Web Scraping with XPath in Selenium

Valentina Skakun Valentina Skakun
Last update: 30 Apr 2024

For data scraping, the first step is to find the data itself. This can be done in many ways - through unique attributes, class names, id or CSS Selectors. However, sometimes due to the presence of dynamic elements, it becomes more difficult to search for data, as well as to identify HTML elements. This is where XPath comes in handy.

When a web page is loaded in a browser, it generates a DOM (Document Object Model) structure. At the same time, XPath is a query language that queries objects in the DOM. This makes XPath a good way to search for web elements on a web page using Selenium as well.

Syntax of XPath

XML Path or commonly known as XPath is a query language for XML documents. It allows one to write an XML document navigation flow to search for any web element.

The XPath syntax consists of DOM attributes and tags, which makes it possible to locate any element on a web page using the DOM. In general, XPath starts with ”//” and looks like this:

//tag_name[@Attribute_name = "Value"]/child nodes

Where tag name is node name, @ means start of name of the selected attribute and value helps filter results.

Example of XPath may be the next:

//*[@id="w-node"]/div/a[1]

Types of XPath

There are only 2 types of XPaths in Selenium - absolute XPaths and relative XPaths.

In example will be used a web page with the following html code:

<!DOCTYPE html>
<html>
    <head>
        <title>A sample shop</title>
    </head>
    <body>
 	<div class="product-item">
 		<img src="example.com\item1.jpg">
 		<div class="product-list">
 			<h3>Pen</h3>
 			<span class="price">10$</span>
 			<a href="example.com\item1.html" class="button">Buy</a>
 		</div>
 	</div>
 	<div class="product-item">
 		<img src="example.com\item2.jpg">
 		<div class="product-list">
 			<h3>Book</h3>
 			<span class="price">20$</span>
 			<a href="example.com\item2.html" class="button">Buy</a>
 		</div>
 	</div>
    </body>
</html>

Absolute XPath

Using of absolute XPath helps to accurately find a specific given element. For example, lets write absolute XPath for product name:

Element

Element

Absolute XPath:

/html/body/div[1]/div/h3

To copy XPath from Chrome DevTools (press F12 to open) just inspect the element (Ctrl+Shift+C or inspect bottom):

Title

Title

Then right-click on highlight line at element window and choose copy-copy full XPath:

XPath

XPath

The resulting XPath can be checked in the console:

Console

Console

Here one can also copy the html code of this element. Just right-click on result and choose “Copy Object”:

Copy

Copy

The result:

<h3>Pen</h3>

This method is also known as a single slash search is the most vulnerable to minor changes in the structure of the page.

Relative XPath

Relative XPath is more flexible and not depends on the minor changes in the page structure. The next relative XPath will find the same element as an absolute XPath below:

//*[@class="product-list"]/h3

Let’s check:

Copy

Copy

The result:

[   {<h3>Pen</h3>},   {<h3>Book</h3>} ]

Relative XPath can start to search anywhere in the DOM structure. Moreover, it is shorter than Absolute XPath.

XPath VS CSS Selectors

Someone, who has already read about CSS selector, may be can’t choose between them. The main difference between XPath and CSS selectors is that with XPath one can move both forward and backward, while the CSS selector only moves forward and does not see parent elements. However, XPath is different in each browser, which does not allow them to be universal.

Thus, it can be concluded that CSS Selectors are best used when it is necessary to reduce time or simplify the code. Whereas XPath is more suitable for more complex tasks. Full article about CSS selectors is here.

Using XPath in Selenium

For scraping data using Selenium, the By class is used. There are two methods that can be useful for finding page elements in combination with the “By” class for selecting attributes. They are:

  1. find_element returns the first instance of multiple web elements with a particular attribute in the DOM. If no element is found, the method throws a NoSuchElementException.

  2. find_elements returns an empty value if the element is not found, or a list of all web element instances that match the specified attribute.

So, for search product name of pen using XPath in Selenium:

from selenium.webdriver.common.by import By
driver.find_element(By.XPATH, '//*[@class="product-list"]/h3')

And for list contains all product names:

from selenium.webdriver.common.by import By
driver.find_elements(By.XPATH, '//*[@class="product-list"]/h3')

Dynamic XPath in Selenium

To perform specific queries, one can use special commands and XPath operators.

XPath Using Logical Operators: OR & AND

Logical operators are needed to more accurately search for elements depending on the specified conditions. XPath can use 2 logical operators: or & and. One should remember that they are case-sensitive. So, using “OR” & “AND” will be incorrect.

Logical Operator OR

This XPath query returns the child elements that match the first value, the second value, or both. For example:

//tag_name[@Attribute_name = "Value" or @Attribute_name2 = "Value2"]

It will return:

Attribute 1Attribute 2Result
FalseFalseNo Elements
TrueFalseReturns A
FalseTrueReturns B
TrueTrueReturns Both

Let’s change example above and check work of logical operator or. Imagine that the price of pen is stored in a container:

<span time-in="150" class="price">10$</span>

And book price:

<span time-in="100" class="price">20$</span>

Use the logical operator or:

//span[@time-in = "100" or @class = "price"]

The result:

OR

Logical operator OR

The query returned both products because they both had the class “price”.

Logical Operator AND

This XPath query returns the child elements that match only both values. For example:

//tag_name[@Attribute_name = “Value” and @Attribute_name2 = “Value2”]

It will return:

Attribute 1Attribute 2Result
FalseFalseNo Elements
TrueFalseNo Elements
FalseTrueNo Elements
TrueTrueReturns Both

To check it just use the example above and change operator OR to AND:

AND

Logical operator AND

XPath using Starts-With()

This method helps to find elements which a started at the special way. For example, lets find the article “Web Scraping with Python: from Fundamentals to Practice”.

Title

Title

The XPath will be the next:

//a[starts-with(text(),'Web Scraping')]

or

//a[starts-with(text(),'Web')]

Let’s check:

find

Start-with

But the next will be incorrect:

//a[starts-with(text(),'Scraping with Python')]

This method can be used not only for static elements but for dynamic (as button) too. For example:

//span[starts-with(@class, 'read-more-link')]

XPath using Index

This method is useful when one needs to find a specific element in the DOM. For example:

 //tag[@attribute_name='value'][element_num]

Let’s return to the operator OR example and try to find only the first result:

Number

Number

XPath using Following

This method is used to find the web element or elements following a known one. Following syntax is the next:

//tag[@attribute_name='value']//following::tag

But it shouldn’t be next to known tag or at the same level. Selenium will choose the nearest one:

Following

Following

XPath using Following-Sibling

This method will find the nearest element with the same parent. It has the next syntax:

//tag[@attribute_name='value']//following-sibiling::tag

Result will be the same as at previous example.

XPath using Preceding

Preceding method will find all the elements before current node:

//tag[@attribute_name='value']//preceding::tag

Searches for the nearest one at all levels.

XPath using Preceding-Sibling

The same as previous one but searching for elements before current node with the same parent:

//tag[@attribute_name='value']//preceding-sibling::tag

XPath using Child

This method is used to locate all the child elements of a particular node:

//tag[@attribute_name='value']//child::tag

Child

Child

XPath using Parent

This method is used to locate all the parent elements of a particular node:

//tag[@attribute_name='value']//parent::tag

XPath using Descendants

This method is used to locate all the descendants (child, grandchild nodes and etc.) of a particular node:

//tag[@attribute_name='value']//descendants::tag

XPath using Ancestors

This method is used to locate all the ancestors (parent, grandparent nodes and etc.) of a particular node:

//tag[@attribute_name='value']//ancestors::tag

Conclusion and Takeaways

So, XPath in selenium can help to locate elements for further scraping. It can work with static data and dynamic data. Moreover, unlike to selectors XPath can operate on all levels of DOM structure including parent elements.

Blog

Might Be Interesting