The Ultimate CSS Selectors Cheat Sheet for Web Scraping

Valentina Skakun Valentina Skakun
Last update: 30 Apr 2024

There are several ways for data scraping. One of the most common is web scraping using CSS selectors.

Every website has it’s own structure similar at all it’s pages. And every HTML element at this site has its own address. Selectors, in it’s turn, allow to select the rules that will be used to select the html content of group of elements.

Different types of CSS selectors and their uses in Web Scraping

There are some CSS selector types that divide into groups depending on their purpose. The main CSS selector list contains:

Universal selectors

They are designated by the * sign, and mean all HTML elements. An example of a universal CSS selector:

* {
    font-family: Arial, Verdana, sans-serif;
 }

Tag selectors

Allows to apply the specified rule to all elements located inside the specified html tag. Tag selector example:

p {
  padding-bottom: 15px;
}

Class selectors

Applies the specified rule to all elements with class attribute. Example of class selector:

.center {
  text-align: center;
}

ID selectors

Applies the rule to all elements that have the id attribute specified by the rule. For example:

#footer {
  margin-top: 50 px;
}

Attribute selectors

They are used to selection attributes by name or value of particular attribute. There are following types of attribute selectors:

  • [attr] – selecting elements by the attribute name.

  • [attr=value] - selection based on the attribute name and its value.

  • [attr^=value] - selection based on the attribute name and the value to start with.

  • [attr|=value] - select by attribute name and its value, which is equal to bracketed.

  • [attr$=value] - selection based on the attribute name and the value to end with.

  • [attr*=value] - selection based on the specified attribute containing the bracketed value.

  • [attr~=value] - selects by the specified attribute containing the value separated by space.

With help of such css selectors one can collect a list of names or find data for strictly defined parameters among all products quickly. Simple example of such selector which uses href attribute:

[href*="google.com"] {
  background-color: green;
}

Descendant selectors or context selectors

Used for nested css selectors, for example:

.wraping p {
   padding: 30px;
}

Moreover, in this case, it does’t matter whether the element with the p attribute was the first child element in the class .wrapping or not. In other words, it can be nested in any other element.

Child selectors

Their main difference is that only first-level descendants are counted.

.wraping>p {
 padding: 35px;
}

Sister selectors, or neighboring (located on the same level)

Helps one to find the first matching element on the same level. They don’t have to follow each other. The ones that are closest to each other will be selected.

h2 + p {
   padding-bottom: 10px;
}

Pseudo-class selectors (state selectors)

Include css selectors which work with pseudo classes and describe certain actions performed on the object (hover, mouse click, etc.). For example:

a:hover {
   text-decoration: none;
}

As a rule, one uses only one of basic css selectors for scraping web pages: #id (for searching elements by ID), element (for searching by element name), .class (for searching by element different classes), [attribute] (for searching by attribute), and * (for selecting all elements).

So selectors allow one to select elements of the same type on all pages of the website: links, images, prices and etc. This means that using the information contained in the css selectors, one can quickly collect the data that is needed.

Picking right CSS Selectors using Inspect

The most common way to pick right CSS Selectors is by inspecting the site with the browser’s developer tool. First, one needs to go to DevTools using either F12 key, or right-click on the page and select Inspect.

Inspect Element is available by default in context menu of Chrome

Right-click on the webpage and click 'Inspect' to open the developer console

After that, the Inspect tools open, where one can view the html structure of the page. When one hover over a line of code, the respective element is highlighted.

Element Highlight

Element Highlight

It also works in reverse order. So one can select any element on the web page and view its code using a special function or using the Ctrl+Shit+C hotkeys.

Viewing an element selector

Select any element on the web page and view its code

This way, one can find the code of the required element and use one of web scrapers to collect all the similar ones. For example, one needs to find all titles on the page. By searching the code by element, one can find out that title has <a> tag.

However, it is not good selector. If one will use <a> tag for titles scraping, the HTML parser will return a lot of noise. For example, “Learn more” button also has <a> tag.

In this case, it’s a good idea to use CSS selectors. One can use a.link-block.w-inline-block.h3-5 selector for web scraping all titles from web page.

Find all post titles from page

Find all post titles from page

Write [...$('a.link-block.w-inline-block.h3-5')].map(i => i.innerText) to get the titles of all post links.

Extracting all post titles from page

Getting InnerText from all a.link-block.w-inline-block.h3-5

This way, it isn’t difficult to create a method for parsing the uploaded HTML document and then extract data from those html elements from web page that are needed.

How to use CSS selectors for web scraping

If one knows the css selector that contains the necessary information, he can quickly get this data. For example, from the selector a.snippet-cell[href], the scraper can get a reference stored in href, which is located in the class selector .snippet-cell in the a element.

Let’s say someone need to get all product names on a page using css selectors. He knows that product names are specified in the <h4> tag in the <title> attribute. In addition to the <title> attribute, the <h4> tag also contains other information, such as:

<h4 href= "/product/11" title="Pen">Pen</a>

The name will be stored in h4[title]. Moreover, one can get the names of all products on all pages of the site using the h4[title] selector.

Selectors can be grouped, supplemented, and described in more detail in order to scrape more complex things.

For example, one needs to find the first item from each product category. The selector will be the following:

#set_1 > div:nth-child(1) > div > div.title > a

If the number of elements is constantly changing, they don’t have unique classes and one needs unique element then one can use more advanced css selectors such as :not, :eq(), :last.

To select elements that don’t contain some selector, use :not (selector). It is negative pseudo-class. Let’s say one need to select parts that don’t have a class .classy:

p:not(.classy)

This css selector will select elements in the <p> tag, except those which have .classy class.

To select an element that goes in a certain order, one can use :eq(number). The countdown starts from zero. For example :eq(0) will select the first element, :eq(1) will select the second, and :eq(10) will select the ninth.

If it doesn’t known which item is in order, but it’s known that it is the last one, one can use :last. It returns strictly the last element, regardless of the total number of elements.

Everyone can check on the site itself if his css selector is written  correctly. For it go to the developer console (F12) and write $('selector'), for example, $('h4[title]') and, if everything is written correctly, the site returns <h4 title= "Pen">.

CSS Selector vs XPath

CSS selector allows to pick necessary element. While XPath is a special query language that can be used to extract data from tag or attribute at its address in the source HTML code of web pages.

CSS selectors were created for HTML source code, while XPath was created for XML documents search. So, for CSS name and ID are special attributes that will be used for future searches at html documents. From XPath’s point of view these are “just attributes” that don’t affect on search. This is due to the fact that the search is carried out not through the index table, but through the whole DOM tree.

CSS and XPath have both advantages and disadvantages, some of which are shown in the table.

CSSXPath
Ability to search up the DOM tree-+
Ability to use subqueries-+
The search speed in Chromehighlow
The search speed in FireFoxhighlow
The search speed in Internet Explorerlowhigh

Laconic and simpleness of writing were not included in table because of it’s difference for everyone. However, to compare there are examples of syntax that will allow everyone to determine which of the methods is better for himself.

ExampleCSSXPATH example
All elements*//*
All <a> elementsof a//a
All child elementsa>*//a/*
Search by ID#footer//*[@id=’footer’]
Search by class. classy//*[contains(@class,’classy’)]
Search by attribute*[title]//*[@title]
First child of all <a>a>*:first-child//a/*[0]
All <a> elements with a child <p>-cannot be found-//a[p]
Next elementp + */p/following-sibling::*[0]
Previous elementN/A/ / p/preceding-sibling::*[0]

Conclusion and Takeaways

Each site has its own structure, which contains elements of the same type. For their web scraping, css selectors are useful. They are used for selecting elements of the same type on all pages of the web site.

This way saves time by allowing to collect all necessary information from all web pages quickly and structures it for further processing.

Blog

Might Be Interesting