How to Select Elements By Text in XPath?
XPath (XML Path Language) is a query language designed explicitly for navigating and extracting elements from XML documents. It provides a precise way to identify and select specific nodes within an XML structure, making it a valuable tool for data analysis and processing.
Compared to CSS selectors, XPath offers greater flexibility and functionality for data manipulation and retrieval. While we’ve previously explored the capabilities and benefits of XPath, this article focuses on its application for text processing and searching.
Effortlessly extract Google Maps data – business types, phone numbers, addresses, websites, emails, ratings, review counts, and more. No coding needed! Download results in convenient JSON, CSV, and Excel formats.
Discover the easiest way to get valuable SEO data from Google SERPs with our Google SERP Scraper! No coding is needed - just run, download, and analyze your SERP data in Excel, CSV, or JSON formats. Get started now for free!
Text selection is a fundamental operation in XPath as a significant portion of data in XML documents resides in textual form. By mastering text selection, you can extract the required data from XML and process it further based on your application or task requirements.
Basic Methods for Selecting Elements by Text
XPath provides several core methods for working with text, and a few that were not initially designed for this purpose but are still very convenient to use. Let’s start by talking about the basic text search methods. We will test them on this demo site as an example.
text(): Selecting elements with exact text match
The first method is text()
, which allows you to find elements by their full-text content. This makes it a precise way to target elements based on their textual contents. For example, to select this element:
We can use the following XPath:
//*[text()='MacBook']
To verify it in the browser, we can use the browser’s search function and put a dot (.) before the XPath or use the console and execute the $x()
or $$()
function to find the element.
Let’s verify and try to find the element:
While the text()
method is convenient for selecting elements by exact text, its usage can be limited due to some drawbacks:
-
The
text()
method requires an exact text match. If the element contains additional formatting or whitespace characters, the text may not match the exact search string. -
It is also case-sensitive, meaning the text must match exactly in terms of capitalization. For example, “MacBook” and “macbook” would be considered different texts.
-
Lastly, the
text()
method ignores child elements and their text content. It may not be selected if an element contains child elements with text.
Therefore, in some cases, using more complex element selection strategies may be necessary based on other element characteristics or their context.
contains(): Selecting elements containing a substring
This method allows you to search for a substring within an element’s text or attribute value. Unlike the previous method, you don’t need to know the entire text of the element, only a part of it. This method is especially useful when selecting elements containing certain keywords or parts of text.
However, it’s important to note that the contains()
method is case-sensitive by default. This means that it will only find matches if the substring you are searching for is in the same case as the text in the element.
Let’s return to the demo site and try to find all the rows containing the substring “EOS”. We can use the following XPath for this:
.//*[contains(text(), 'EOS')]
As a result, we will find two elements:
If you want to achieve more precise results, you can slightly modify the previously considered XPath and specify a specific tag, for example:
.//p[contains(text(), 'EOS')]
This variation will return only one result – the product description.
starts-with(): Selecting elements starting with a substring
If you need to find elements that start with a specific word or syllable, the starts-with()
method is the most suitable option. This method is especially useful when searching for elements with a specific prefix or initial sequence of characters.
Let’s find products that start with certain characters:
.//*[starts-with(text(), 'Ap')]
As a result, there will be only one:
Keep in mind that this method is also case-sensitive. If you want your XPath to be case-insensitive, you can proceed to the next section, where we will discuss how to achieve this.
ends-with(): Selecting elements ending with a substring
This method is the opposite of the previous one and searches for matches not at the beginning of the text but at the end. It can be especially useful when you need to select elements identified by their attributes ending with a specific sequence of characters.
Unfortunately, this method does not work for XPath 1.0. Although most tools support it, it has a limited syntax for working with text data:
.//*[ends-with(text(), 'ok')]
Instead of searching the entire text, you can specify a specific tag or attribute or otherwise designate the desired search area.
Additional Methods for Text Selection
As we mentioned earlier, there are additional text search and processing methods beyond the basic ones. This section will discuss how to more accurately identify the necessary elements using additional XPath methods, which are well-suited for working with strings.
translate(): Ignoring Case Sensitivity
To find elements regardless of letter case, we can normalize the text by converting all letters to lowercase. Here’s how we can do it:
//*[starts-with(translate(text(), 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'mac')]
This code uses the translate()
method to replace all uppercase letters with lowercase letters. Then, we use the starts-with()
function to find elements that start with the lowercase string “mac”.
Additionally, you can use the lower-case()
and upper-case()
functions, but like the ends-with()
method, they won’t always work.
not(): Excluding elements matching specific text
The XPath method, not()
is used to exclude elements that match a specific text or pattern. It allows you to filter elements in a document and select only those that do not contain specific text or match a particular pattern.
Using the not()
method is particularly useful when excluding certain elements from your query results. For example, let’s refine product tags and exclude MacBook:
We used the following XPath expression to select all h4 elements that do not contain the text “Mac” in their child element:
//h4[not(contains(a/text(), 'Mac'))]
To make the result more straightforward, we executed the query in the browser’s console, allowing us to access the results directly. To make the example even more concise, let’s only display the text of the selected elements instead of the entire elements themselves:
Thus, using the not()
method we got all the items except MacBook.
position(): Selecting elements by their position in the list
The XPath position()
method selects elements based on their position in a list. Predicates often use it to select specific elements based on their ordinal number. This is very convenient, for example, if we want to get only the first item in a list, not the whole list.
To get data about the first item, you can use the following XPath:
//h4[position() = 1]
To get the last item:
//h4[position() = last()]
You can also specify a range of elements:
//h4[position() >= 2 and position() <= 4]
This makes the method especially useful when you should work with specific elements based on their position in the document.
normalize-space(): Removing extra spaces from text
The normalize-space()
method is a powerful tool in XPath that helps you work with text data more efficiently. While it doesn’t directly find or exclude elements, it plays a crucial role in text processing by removing unnecessary whitespace.
This can be incredibly useful when dealing with text data containing extra spaces, tabs, or newlines, which can complicate data processing. Using normalize-space()
, XPath automatically eliminates all leading and trailing spaces from the text and replaces any sequence of spaces within the text with a single space. This results in cleaner and more consistent text data, making it easier to process further.
normalize-space(" This is an example")
If we use normalize-space()
on this text, we get the following result:
As you can see, the extra spaces, tabs, and newlines have been removed, leaving us with a clean and concise string.
Advanced Techniques for Text Selection
Now that we’ve covered the primary methods for finding and processing text let’s dive into more advanced XPath techniques for working with text. We’ll start by exploring how to use regular expressions to find elements.
Find and extract emails from any website with ease. Build targeted email lists for lead generation, outreach campaigns, and market research. Download your extracted data in your preferred format (CSV, JSON, or Excel) for immediate use.
Extract valuable data from Airbnb listings with ease: locations, prices, images, availability, ratings, number of reviews, host information, and more, in just a few clicks. Download your collected data in the preferred format – CSV, JSON, or Excel.
Using regular expressions for text matching
Regular expressions (regex) are powerful tools for finding and matching patterns in text. They offer more flexibility than the contains()
method, allowing you to search for patterns instead of exact string matches.
For example, let’s consider email addresses. They all have the same structure: [[email protected]](/cdn-cgi/l/email-protection)
. We can use regex to find all email addresses on a page, even if we don’t know the specific addresses:
//*[matches(text(), '[\w\.-]+@[\w\.-]+')]
Standard XPath (used in browsers and automation tools) often only supports XPath 1.0, which lacks the matches() function. To use regex with XPath, you’ll need XPath 2.0 or higher.
Regular expressions are a powerful tool for text manipulation, allowing you to perform various search and manipulation operations on text data. However, their practical use requires understanding their syntax and basic concepts.
While regular expressions can be used in many different contexts, they are often used in conjunction with programming languages, for example, JavaScript or Python. This is because programming languages provide a convenient way to write and execute regular expression code. If you want to scrape data, you can find something useful in our article about using XPath in Selenium webdriver.
Combining different XPath methods for complex selections
As mentioned, the matches function is only supported in XPath 2.0+, so it won’t work in most browsers. Instead, we can combine the previously discussed methods to extract email addresses using the following XPath expression:
//div[contains(text(),'@') and contains(text(),'.') and not(contains(text(),' '))]/text()
This expression uses several criteria to determine whether a string is an email address:
-
It must contain the @ symbol.
-
It must contain at least one dot (.).
-
It must not contain any spaces.
The element will be ignored if any of these conditions are not met.
Best Practices and Tips
We’ve compiled a few tips and tricks to make your XPath usage even more efficient and productive. Here, we’ll cover common issues and how to solve them.
Choosing the right text selection method based on context
The method to select text in XPath should be tailored to the specific document structure and your task requirements. For example, if the text is located within a tag, you can use methods like text()
, string()
, or normalize-space()
, depending on the context.
When selecting XPath, it’s crucial to visualize the desired outcome and understand the appropriate level for each method. For instance, you can use multiple methods on the same level to refine an element:
//h4[not(contains(a/text(), 'Mac'))][position() = 1]
Here, we use them sequentially to eliminate all items not containing “Mac” and then identify the first element among the remaining ones. If you want to refine this element further, such as specifying that it should be the first and not contain the substring “Mac,” the XPath would be different:
//h4[not(contains(a/text(), 'Mac')) and position() = 1]
Remember, that adapting XPath to your document structure and task requirements is crucial.
Avoiding common mistakes in XPath text selection
Incorrect usage of axes or conditions is a common XPath mistake that can lead to incorrect data selection or unwanted results. Carefully analyze the structure and verify the correctness of your XPath expressions.
For example, the following variation of the previous example would be incorrect:
//h4[not(contains(a/text(), 'Mac'))] and [position() = 1]
This is because XPath does not allow the “and” operator to combine two different conditions in this format. To make it work correctly, we should combine both conditions within a single predicate.
Testing XPath expressions for accuracy and efficiency
There are many tools and resources available to help you create and test XPath expressions. Some browser developer tools, such as Chrome DevTools or Firefox Developer Tools, provide convenient ways to test XPath on real web pages. There are also online tools and libraries for testing and debugging XPath expressions.
We used Chrome DevTools to verify and demonstrate the created XPath expressions. We strongly recommend using this approach before implementing XPath expressions in your scripts. It is much faster than other methods, allowing you to immediately track the required elements and observe the real-time operation of your expressions. You can also get detailed information about each element found.
Conclusion and Takeaways
In this tutorial, we explored various methods for locating elements on a web page by their text content using XPath expressions. XPath is a standard tool for working with XML and HTML documents, making it a versatile tool for web scraping and automation regardless of the technology or platform used.
When automating web actions such as filling out forms, clicking buttons, and collecting information, targeting specific text elements on a page is often necessary. XPath helps to locate and interact with these elements, simplifying the automation process.
Web scraping often requires extracting specific data from web pages, such as titles, prices, and descriptions. XPath allows for precise elements targeting on a web page, making the data extraction process more efficient and accurate than CSS selectors or other methods.
Might Be Interesting
Aug 16, 2024
JavaScript vs Python for Web Scraping
Explore the differences between JavaScript and Python for web scraping, including popular tools, advantages, disadvantages, and key factors to consider when choosing the right language for your scraping projects.
- Tools and Libraries
- Python
- NodeJS
Aug 13, 2024
How to Scroll Page using Selenium in Python
Explore various techniques for scrolling pages using Selenium in Python. Learn about JavaScript Executor, Action Class, keyboard events, handling overflow elements, and tips for improving scrolling accuracy, managing pop-ups, and dealing with frames and nested elements.
- Tools and Libraries
- Python
- Tutorials and guides
Jul 1, 2024
How to Scrape Dynamic Content in Python
Explore techniques to scrape dynamic content in Python, including using tools like Beautiful Soup, Selenium, Pyppeteer, Playwright, and Scrapy. Learn advanced methods for handling infinite scroll and evaluating JavaScript.
- Python
- Tools and Libraries