Web Scraping with XPath in Selenium
For data scraping, the first step is to find the data itself. This can be done in many ways - through unique attributes, class names, id or CSS Selectors. However, sometimes due to the presence of dynamic elements, it becomes more difficult to search for data, as well as to identify HTML elements. This is where XPath comes in handy.
When a web page is loaded in a browser, it generates a DOM (Document Object Model) structure. At the same time, XPath is a query language that queries objects in the DOM. This makes XPath a good way to search for web elements on a web page using Selenium as well.
Get real-time access to Google search results, structured data, and more with our powerful SERP API. Streamline your development process with easy integration of our API. Start your free trial now!
Gain instant access to a wealth of business data on Google Maps, effortlessly extracting vital information like location, operating hours, reviews, and more in HTML or JSON format.
Syntax of XPath
XML Path or commonly known as XPath is a query language for XML documents. It allows one to write an XML document navigation flow to search for any web element.
The XPath syntax consists of DOM attributes and tags, which makes it possible to locate any element on a web page using the DOM. In general, XPath starts with ”//” and looks like this:
//tag_name[@Attribute_name = "Value"]/child nodes
Where tag name is node name, @ means start of name of the selected attribute and value helps filter results.
Example of XPath may be the next:
//*[@id="w-node"]/div/a[1]
Types of XPath
There are only 2 types of XPaths in Selenium - absolute XPaths and relative XPaths.
In example will be used a web page with the following html code:
<!DOCTYPE html>
<html>
<head>
<title>A sample shop</title>
</head>
<body>
<div class="product-item">
<img src="example.com\item1.jpg">
<div class="product-list">
<h3>Pen</h3>
<span class="price">10$</span>
<a href="example.com\item1.html" class="button">Buy</a>
</div>
</div>
<div class="product-item">
<img src="example.com\item2.jpg">
<div class="product-list">
<h3>Book</h3>
<span class="price">20$</span>
<a href="example.com\item2.html" class="button">Buy</a>
</div>
</div>
</body>
</html>
Absolute XPath
Using of absolute XPath helps to accurately find a specific given element. For example, lets write absolute XPath for product name:
Absolute XPath:
/html/body/div[1]/div/h3
To copy XPath from Chrome DevTools (press F12 to open) just inspect the element (Ctrl+Shift+C or inspect bottom):
Then right-click on highlight line at element window and choose copy-copy full XPath:
The resulting XPath can be checked in the console:
Here one can also copy the html code of this element. Just right-click on result and choose “Copy Object”:
The result:
<h3>Pen</h3>
This method is also known as a single slash search is the most vulnerable to minor changes in the structure of the page.
Relative XPath
Relative XPath is more flexible and not depends on the minor changes in the page structure. The next relative XPath will find the same element as an absolute XPath below:
//*[@class="product-list"]/h3
Let’s check:
The result:
[ {<h3>Pen</h3>}, {<h3>Book</h3>} ]
Relative XPath can start to search anywhere in the DOM structure. Moreover, it is shorter than Absolute XPath.
XPath VS CSS Selectors
Someone, who has already read about CSS selector, may be can’t choose between them. The main difference between XPath and CSS selectors is that with XPath one can move both forward and backward, while the CSS selector only moves forward and does not see parent elements. However, XPath is different in each browser, which does not allow them to be universal.
Thus, it can be concluded that CSS Selectors are best used when it is necessary to reduce time or simplify the code. Whereas XPath is more suitable for more complex tasks. Full article about CSS selectors is here.
Effortlessly extract Google Maps data – business types, phone numbers, addresses, websites, emails, ratings, review counts, and more. No coding needed! Download results in convenient JSON, CSV, and Excel formats.
Discover the easiest way to get valuable SEO data from Google SERPs with our Google SERP Scraper! No coding is needed - just run, download, and analyze your SERP data in Excel, CSV, or JSON formats. Get started now for free!
Using XPath in Selenium
For scraping data using Selenium, the By class is used. There are two methods that can be useful for finding page elements in combination with the “By” class for selecting attributes. They are:
-
find_element
returns the first instance of multiple web elements with a particular attribute in the DOM. If no element is found, the method throws a NoSuchElementException. -
find_elements
returns an empty value if the element is not found, or a list of all web element instances that match the specified attribute.
So, for search product name of pen using XPath in Selenium:
from selenium.webdriver.common.by import By
driver.find_element(By.XPATH, '//*[@class="product-list"]/h3')
And for list contains all product names:
from selenium.webdriver.common.by import By
driver.find_elements(By.XPATH, '//*[@class="product-list"]/h3')
Dynamic XPath in Selenium
To perform specific queries, one can use special commands and XPath operators.
XPath Using Logical Operators: OR & AND
Logical operators are needed to more accurately search for elements depending on the specified conditions. XPath can use 2 logical operators: or & and. One should remember that they are case-sensitive. So, using “OR” & “AND” will be incorrect.
Logical Operator OR
This XPath query returns the child elements that match the first value, the second value, or both. For example:
//tag_name[@Attribute_name = "Value" or @Attribute_name2 = "Value2"]
It will return:
Attribute 1 | Attribute 2 | Result |
---|---|---|
False | False | No Elements |
True | False | Returns A |
False | True | Returns B |
True | True | Returns Both |
Let’s change example above and check work of logical operator or. Imagine that the price of pen is stored in a container:
<span time-in="150" class="price">10$</span>
And book price:
<span time-in="100" class="price">20$</span>
Use the logical operator or:
//span[@time-in = "100" or @class = "price"]
The result:
The query returned both products because they both had the class “price”.
Logical Operator AND
This XPath query returns the child elements that match only both values. For example:
//tag_name[@Attribute_name = “Value” and @Attribute_name2 = “Value2”]
It will return:
Attribute 1 | Attribute 2 | Result |
---|---|---|
False | False | No Elements |
True | False | No Elements |
False | True | No Elements |
True | True | Returns Both |
To check it just use the example above and change operator OR to AND:
XPath using Starts-With()
This method helps to find elements which a started at the special way. For example, lets find the article “Web Scraping with Python: from Fundamentals to Practice”.
The XPath will be the next:
//a[starts-with(text(),'Web Scraping')]
or
//a[starts-with(text(),'Web')]
Let’s check:
But the next will be incorrect:
//a[starts-with(text(),'Scraping with Python')]
This method can be used not only for static elements but for dynamic (as button) too. For example:
//span[starts-with(@class, 'read-more-link')]
The HasData Python SDK simplifies web scraping by handling complex tasks like browser rendering, proxy management, and CAPTCHA avoidance, allowing you to focus on extracting the data you need.
Effortlessly integrate web scraping into your Node.js projects with HasData's Node.js SDK, leveraging headless browsers, proxy rotation, and JavaScript rendering capabilities.
XPath using Index
This method is useful when one needs to find a specific element in the DOM. For example:
//tag[@attribute_name='value'][element_num]
Let’s return to the operator OR example and try to find only the first result:
XPath using Following
This method is used to find the web element or elements following a known one. Following syntax is the next:
//tag[@attribute_name='value']//following::tag
But it shouldn’t be next to known tag or at the same level. Selenium will choose the nearest one:
XPath using Following-Sibling
This method will find the nearest element with the same parent. It has the next syntax:
//tag[@attribute_name='value']//following-sibiling::tag
Result will be the same as at previous example.
XPath using Preceding
Preceding method will find all the elements before current node:
//tag[@attribute_name='value']//preceding::tag
Searches for the nearest one at all levels.
XPath using Preceding-Sibling
The same as previous one but searching for elements before current node with the same parent:
//tag[@attribute_name='value']//preceding-sibling::tag
XPath using Child
This method is used to locate all the child elements of a particular node:
//tag[@attribute_name='value']//child::tag
XPath using Parent
This method is used to locate all the parent elements of a particular node:
//tag[@attribute_name='value']//parent::tag
XPath using Descendants
This method is used to locate all the descendants (child, grandchild nodes and etc.) of a particular node:
//tag[@attribute_name='value']//descendants::tag
XPath using Ancestors
This method is used to locate all the ancestors (parent, grandparent nodes and etc.) of a particular node:
//tag[@attribute_name='value']//ancestors::tag
Conclusion and Takeaways
So, XPath in selenium can help to locate elements for further scraping. It can work with static data and dynamic data. Moreover, unlike to selectors XPath can operate on all levels of DOM structure including parent elements.
Might Be Interesting
Sep 9, 2024
How to Scrape Immobilienscout24.de Real Estate Data
Learn how to scrape real estate data from Immobilienscout24.de with step-by-step instructions, covering website analysis, choosing the right tools, and storing the collected data.
- Real Estate
- Use Cases
- Python
Aug 16, 2024
JavaScript vs Python for Web Scraping
Explore the differences between JavaScript and Python for web scraping, including popular tools, advantages, disadvantages, and key factors to consider when choosing the right language for your scraping projects.
- Tools and Libraries
- Python
- NodeJS
Aug 13, 2024
How to Scroll Page using Selenium in Python
Explore various techniques for scrolling pages using Selenium in Python. Learn about JavaScript Executor, Action Class, keyboard events, handling overflow elements, and tips for improving scrolling accuracy, managing pop-ups, and dealing with frames and nested elements.
- Tools and Libraries
- Python
- Tutorials and guides