8 Best Python Libraries and Tools for Web Scraping in 2024
Python is one of the most common programming languages for scraping, data science, and machine learning. Because it is quite easy to learn, it is suitable even for those who haven’t done programming before.
Python has a number of libraries for data scraping that allow you to simplify script writing. We’ve already written an introductory article on this topic before, but today we’ll focus in more detail on eight of the best Python web scraping libraries.
To understand which of the reviewed libraries is the most popular, we’ve collected the download statistics for some of the most popular Python web scraping libraries over the past three years. By analyzing these trends, we can gain insights into the evolving needs of web scraping developers and how they choose their preferred libraries.
The graph shows that Requests and UrlLib3 have significantly more downloads than other Python libraries for web scraping. This is not solely due to their popularity in web scraping projects but also because they are commonly used in many other Python projects involving interaction with websites. Requests, in particular, is a widely used HTTP library for Python and is often used for making API requests and web-based applications. UrlLib3, on the other hand, is a powerful library for handling URLs and is often used for parsing, reading, and opening URLs.
It is interesting to note that the more popular a library is, the simpler it is and the less functionality it has. This could be because simpler libraries offer a lower barrier for entry, allowing developers with less experience to get started quickly. Additionally, the more complex libraries are geared towards larger projects and require more technical knowledge to use effectively. This means those who don’t need the extra features or power these tools offer instead opt for simpler solutions.
TOP 8 Python web scraping libraries
Here are the best Python libraries and frameworks for web scraping to help you extract data from websites*.*
UrlLib3
UrlLib3 is a Python library for working with queries. It has fairly extensive functionality and supports a number of functions for working with queries.
Advantages
UrlLib3 is a powerful query library. It has many advantages, such as:
Organizing and supporting secure streams.
Verifying SSL / TLS on the client side.
Connection pool support.
Creating repeated queries.
Proxy for HTTP and SOCKS.
Because of this, the library quickly became one of the most popular and used in Python.
Disadvantages
Unfortunately, despite the advantages, UrlLib has a number of disadvantages:
It does not know how to process data, so it is not suitable for scraping as an independent library.
Compared to the Requests library is not user-friendly.
The urllib3 connection pool makes it difficult to work with cookies because it is not a client with state tracking.
Usage
The UrlLib library is used to handle queries and receive and transmit data. Works well with proxy.
To understand how to use it for scraping, let’s write a simple scraper.
Installing the UrlLib3 library
Let’s install the library by running the following command on the command line:
pip install urllib3
If the installation was successful, the following message appears:
C:\Users\user>pip install urllib3
Collecting urllib3
Downloading urllib3-1.26.15-py2.py3-none-any.whl (140 kB)
---------------------------------------- 140.9/140.9 kB 832.1 kB/s eta 0:00:00
Installing collected packages: urllib3
Successfully installed urllib3-1.26.15
If you encounter difficulties, make sure that the Python interpreter is installed. To do this, at the command line, type the command:
python -V
If the interpreter version is displayed, then this is not the problem. In this case, you should follow the hints from the error.
Using the UrlLib3 library
To use the UrlLib3 library, it must be included in a script file. To do this, create a file with any name and extension *.py and add the first line of code:
import urllib3
Now that the library is installed and connected, let’s run a request to get the code of the website page. For checking and testing, we will use the requests to https://example.com/.
Create a connection manager and store it in the http variable:
http = urllib3.PoolManager()
This part of the code makes a connection to the site and retrieving data possible. The request itself is executed as a single line and consists of two parts: the request method and the link to which the request is to be executed:
resp = http.request('GET', 'https://example.com')
The result of the query will be saved in the variable resp. To see the query result, let’s display it on the screen:
print(resp.data)
Such a command will print the result as received, that is, on one line:
D:\scripts>urllib3_test.py
b'<!doctype html>\n<html>\n<head>\n <title>Example Domain</title>\n\n <meta charset="utf-8" />\n <meta http-equiv="Content-type" content="text/html; charset=utf-8" />\n <meta name="viewport" content="width=device-width, initial-scale=1" />\n <style type="text/css">\n body {\n background-color: #f0f0f2;\n margin: 0;\n padding: 0;\n font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;\n \n }\n div {\n width: 600px;\n margin: 5em auto;\n padding: 2em;\n background-color: #fdfdff;\n border-radius: 0.5em;\n box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);\n }\n a:link, a:visited {\n color: #38488f;\n text-decoration: none;\n }\n @media (max-width: 700px) {\n div {\n margin: 0 auto;\n width: auto;\n }\n }\n </style> \n</head>\n\n<body>\n<div>\n <h1>Example Domain</h1>\n <p>This domain is for use in illustrative examples in documents. You may use this\n domain in literature without prior coordination or asking for permission.</p>\n <p><a href="https://www.iana.org/domains/example">More information...</a></p>\n</div>\n</body>\n</html>\n'
To output the resulting code of the page in a convenient form, it must be decoded:
print(resp.data.decode('utf-8'))
The result of such a query:
D:\scripts>urllib3_test.py
<!doctype html>
<html>
<head>
<title>Example Domain</title>
…
@media (max-width: 700px) {
div {
margin: 0 auto;
width: auto;
}
}
</style>
</head>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.</p>
<p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>
The script code:
import urllib3
http = urllib3.PoolManager()
resp = http.request('GET', 'https://example.com')
print(resp.data.decode('utf-8'))
Next, let’s process the resulting code to extract only useful data. To do this, use special libraries or use regular expressions.
Considering that regular expressions are more universal and that we will use special libraries later, let’s consider data processing with regular expressions.
Zillow API Python is a Python library that provides convenient access to the Zillow API. It allows developers to retrieve real estate data such as property details, home valuations, rental information, and more.
Our Shopify Node API offers a seamless solution, allowing you to get data from these Shopify stores without the need to navigate the complexities of web scraping, proxy configuration, or bypassing potential blocks.
Data processing with regular expressions
The RE library is used to work with regular expressions. It is preinstalled. That is, it does not need to be installed additionally via pip install. It is sufficient to import it into the script file itself.
import re
Suppose we need to find all the headers. In this case, they are in the h1 tag. To find them by code, you can apply the following:
re.findall(r'<h1>(.+?)</h1>', data)
Script result:
['Example Domain']
Full script code:
import urllib3
import re
http = urllib3.PoolManager()
resp = http.request('GET', 'https://example.com')
data = resp.data.decode('utf-8')
print(re.findall(r'<h1>(.+?)</h1>', data))
To get data from other tags or, for example, the content of tags, regular expressions can also be used.
Get real-time access to Google search results, structured data, and more with our powerful SERP API. Streamline your development process with easy integration of our API. Start your free trial now!
Gain instant access to a wealth of business data on Google Maps, effortlessly extracting vital information like location, operating hours, reviews, and more in HTML or JSON format.
Requests
The next most popular library is Requests. This library, like the previous one, is also used for executing queries. However, it is simpler and easier to use. In addition, it is preinstalled and does not need installation.
Advantages
The Requests library, like UrlLib3, is designed to perform various queries and also has a number of advantages:
It is based on urllib3 and httplib and therefore has all their functions.
Significantly easier to use than UrlLib3.
Supports HEAD, POST, PUT, PATCH and DELETE requests.
Automatically adds query strings to addresses and encodes POST data.
In addition to the above pluses, the library is also user-friendly, has a very large community and is well-documented with clear examples.
Disadvantages
Unfortunately, this library also has disadvantages:
Can not perform data processing, so it is not very convenient to use for scraping as a stand-alone solution.
It is a synchronous library. That is, the entire program is blocked before the query and response.
However, the problem of synchronism, as well as the lack of data processing, can be solved by connecting additional libraries.
Usage
The Requests library is designed to simplify Python queries. Therefore, it is well suited for scraping when you want to get the code of the whole page.
Installing the Requests Library
In case it was deleted, to reinstall it, use the command:
pip install requests
Now the library is ready for use.
Using the Requests library
This HTTP library, like the previous one, supports two types of queries: Get and Post. Get requests are used to retrieve data from the site, and Post requests are used to send data. Let’s look at our tutorial.
To do that, create a file with the *.py extension and import the requests library:
import requests
First, let’s get the data using the Get method. For this, unlike the previous library, one line is enough:
print(requests.get('https://example.com').text)
Execution result:
D:\scripts>requests_test.py
<!doctype html>
<html>
<head>
<title>Example Domain</title>
<style>
…
</style>
</head>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.</p>
<p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>
When using this library, you can use it without decoding or encoding the result - it saves in a convenient form. Script code:
import requests
print(requests.get('https://example.com').text)
Unfortunately, this library only allows you to receive or send given data, so you must process it yourself afterward. However, as we said before, regular expressions are a universal tool so that we will use the regular expression library again.
Data processing with regular expressions
We have received header data on the page previously, so let’s get the link this time, which is stored in the <a> tag:
re.findall(r'<a\shref=\S(.+?)\S>', data)
Here we used special regular expression patterns, namely “\S” and “\s”. They are needed to encode a specific character. In this case, \S replaces one character (quotation marks) and \s replaces one space character.
As a result of the script execution, we got a link, which was stored in the tag <a>:
D:\scripts>requests_test.py
['https://www.iana.org/domains/example']
Script code:
import requests
import re
data = requests.get('https://example.com').text
print(re.findall(r'<a\shref=\S(.+?)\S>', data))
As you can see from the code, the requests library is very convenient and functional. However, using regular expressions is not always comfortable, and connecting additional libraries for using CSS selectors can be quite difficult and resource-intensive.
In that case, you can use the Requests library and special tools for scraping, which take on the functions that special libraries usually perform. Let’s take our web scraping API as an example.
To use it, you must register and copy your account’s API key. You’ll need it a little later. To work with the resource, we need the POST method:
import requests
import json
url = "https://api.hasdata.com//scrape"
payload = json.dumps({
"extract_rules": {
"title": "h1"
},
"url": "example.com"
})
headers = {
'x-api-key': 'YOUR-API-KEY',
'Content-Type': 'application/json'
}
response = requests.request("POST", url, headers=headers, data=payload)
print(response.text)
It is enough to configure your rules in extraction_rules, specify the necessary URL and put your API key to make it work.
HasData
HasData SDK is a library that provides a set of pre-written functions and methods that can be easily integrated into your Python code, allowing you to access and interact with our web scraping API. It is an optimal solution for protecting yourself from blocks, solving captchas, and issues with obtaining JavaScript data.
The HasData Python SDK simplifies web scraping by handling complex tasks like browser rendering, proxy management, and CAPTCHA avoidance, allowing you to focus on extracting the data you need.
The Google SERP API library for Python is a comprehensive solution that allows developers to integrate Google Search Engine Results Page (SERP) data. It provides a simplified way to get organic search results, snippets, knowledge graph data, and other data from the Google search engine.
Advantages
Since this library works with web scraping API, it has several advantages:
It has a simple syntax that even beginners can use.
It allows the use of proxies (residential and datacenter).
It scrapes dynamic websites that use JavaScript.
It parses data from any structure well.
The library has many additional settings and functions, which are described in detail in the official documentation.
Disadvantages
It has one drawback:
- The scraped data is in JSON format, so additional processing is required to work with it.
This drawback can be easily solved by connecting the additional built-in JSON library.
Usage
HasData library is designed for scraping data. It helps to scrape any website, even if your IP is blocked. It is suitable for both executing requests and obtaining specific data from a page.
Installing HasData library
You can install the library via pip install:
pip install scrapeit-cloud
We also recommend installing the libraries for scraping Google Maps and Google SERP, which we will also consider:
pip install google-serp-api
pip install sc-google-maps-api
To use these libraries, we need an API key, which you can find in your account after sign-up at HasData. After that, you can start creating a scraper.
Using HasData Library
Create a file with *.py extension and include the libraries:
from scrapeit_cloud import ScrapeitCloudClient
import json
Now put your API key:
client = ScrapeitCloudClient(api_key='YOUR_API_KEY')
Let’s get the code of the page and display the result:
response = client.scrape(
params={
"url": "https://quotes.toscrape.com/"
}
)
print(response.text)
Check the result:
D:\scripts>scrape_test.py
{"status":"ok","scrapingResult":{"content":"<!DOCTYPE html><html lang=\"en\"><head>\n\t<meta charset=\"UTF-8\">\n\t<title>Quotes to Scrape</title>\n <link rel=\"stylesheet\" href=\"/static/bootstrap.min.css\">\n <link rel=\"stylesheet\" href=\"/static/main.css\">\n</head>\n<body>\n <div class=\"container\">\n <div class=\"row header-box\">\n <div class=\"col-md-8\">\n <h1>\n <a href=\"/\" style=\"text-decoration: none\">Quotes to Scrape</a>\n </h1>\n </div>\n <div class=\"col-md-4\">\n <p>\n \n <a href=\"/login\">Login</a>\n \n </p>\n … </>}}
As said before, the library returns JSON code and if you pay attention to the structure, in this case, it is in the “content” parameter. Let’s import the JSON library and print only the content of this parameter:
data = json.loads(response.text)
result = data["scrapingResult"]["content"]
print(result)
The result:
D:\scripts>scrape_test.py
<!DOCTYPE html><html lang="en"><head>
<meta charset="UTF-8">
<title>Quotes to Scrape</title>
<link rel="stylesheet" href="/static/bootstrap.min.css">
<link rel="stylesheet" href="/static/main.css">
</head>
<body>
<div class="container">
<div class="row header-box">
...
</html>
Script code:
from scrapeit_cloud import ScrapeitCloudClient
import json
client = ScrapeitCloudClient(api_key='YOUR-API-KEY')
response = client.scrape(
params={
"url": "https://quotes.toscrape.com/"
}
)
data = json.loads(response.text)
result = data["scrapingResult"]["content"]
print(result)
Now the data is exactly as we wanted it.
Data processing with HasData Library
There are additional libraries for scraping resources like Google Maps or Google SERPs. Let’s look at examples of how to use them.
The basic elements of working with libraries are the same and do not differ from the HasData library. Their only difference is that they have additional parameters.
For example, for scraping Google search results, you can use the following:
from google_serp_api import ScrapeitCloudClient
import json
client = ScrapeitCloudClient(api_key='YOUR_API_KEY')
response = client.scrape(
params={
"keyword": "pizza",
"country": "US",
"num_results": 10,
"domain": "com"
}
)
data = json.loads(response.text)
print (data)
To get data from Google Maps:
from sc_google_maps_api import ScrapeitCloudClient
import json
client = ScrapeitCloudClient(api_key='YOUR_API_KEY')
response = client.scrape(
params={
"keyword": "plumber in new york",
"country": "US",
"domain": "com"
}
)
data = json.loads(response.text)
print (data)
Result:
D:\scripts>scrape_test.py
{'status': 'ok', 'scrapingResult': {'pagination': {'start': 0}, 'locals': [{'position': 1, 'title': 'RR Plumbing Roto-Rooter', 'phone': '+1 212-687-1215', 'address': '450 7th Ave Ste B, New York, NY 10123, United States', 'website': 'https://www.rotorooter.com/manhattan/', 'workingHours': {'timezone': 'America/New_York', 'days': [{'day': 'Wednesday', 'time': 'Open 24 hours'}
Using these libraries scraping will be easy enough even for beginners.
LXML
Lxml is a library for parsing and processing HTML and XML structures. It is fast and functional enough, but it can not make GET requests by itself, so you will need an additional library, such as Requests, to perform queries.
Advantages
Among the advantages of Lxml are the following:
Speed of operation.
For good work, lxml does not matter how good the structure has a site.
Support for XSLT transformations.
Works equally well with CSS selectors as with XPath.
As a result, lxml library builds very handy data trees.
Disadvantages
True, the disadvantages are also present:
It is a complicated library and not well suited for beginners.
High memory usage.
Lack of a large number of examples and active community.
Because of this, the library is usually used for specific tasks.
Usage
As mentioned above, lxml is not always used. Because of its shortcomings, this library is used only when it is necessary to parse the tree of a site with a bad structure or XML documents. This is because, in this case, it is better than other libraries to cope with the task.
Installing the Lxml Library
The installation is done through pip install:
pip install lxml
After running the command, you can start working with the installed library.
Using the Lxml library
To understand how this library works, we need to retrieve data from the site and process them. To execute the request, we use the requests library:
import requests
data = requests.get('https://quotes.toscrape.com')
To work with html pages we import a special module of the xml library:
from lxml import html
Enter the structure of the site into a variable:
tree = html.fromstring(data.content)
We can display this now, but then we get all the code of the page. So let’s move on to processing it further.
Data processing with Lxml
We can use XPath or CSS selectors to select specific data. Given that to work with CSS selectors, we need to import the additional lxml module cssselect, let’s select quotes using xpath and display the result:
quotes = tree.xpath('//div[@class="quote"]/span[@class="text"]/text()')
print(quotes)
The result:
D:\scripts>lxml_test.py
['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", '“Try not to become a man of success. Rather become a man of value.”', '“It is better to be hated for what you are than to be loved for what you are not.”', "“I have not failed. I've just found 10,000 ways that won't work.”", "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”", '“A day without sunshine is like, you know, night.”']
To output element-by-element from a new line, let’s change the output method:
for quote in quotes:
print(quote)
The output is now executed on a new line. All script code:
import requests
from lxml import html
data = requests.get('https://quotes.toscrape.com')
tree = html.fromstring(data.content)
quotes = tree.xpath('//div[@class="quote"]/span[@class="text"]/text()')
for quote in quotes:
print(quote)
Thus, we can conclude that lxml handles its functions quite well.
BeautifulSoup4
BeautifulSoup4, or simply BS4, is a library that was created for resource parsing. While we could use the previous two libraries to get data using queries, BS4 is used to process the resulting data. Both HTML and XML structures are suitable for processing.
This library cannot be used on its own. It needs an additional library for executing queries. In this article, we will use the Requests library, but you can use any other one.
Advantages
BS4 is a wonderful library that allows you to process web pages and retrieve data quickly and easily. It has several advantages:
It can work with both HTML and XML structures.
Beginner-friendly. It is easy to learn and suitable even for beginners because of its easy syntax.
It has a number of search possibilities: by name, by id, by attributes (e.g. classes), and by text.
It is not resource-intensive.
Has well-documented features.
And beautifulsoup can handle both small and large pages.
If you’re interested in using beautiful soup for web scraping, we have a detailed tutorial that can help you get started.
Disadvantages
In connection with the functionality and tasks of the library, you can also highlight a number of disadvantages:
It works only with static pages. Unfortunately, BS4 does not support the work of headless browsers, so it can not work with dynamic sites.
This library is not, in principle, designed for querying, so it can not be used for scraping on its own.
In the case of a confusing page structure, this can work very inaccurately.
Usage
The beautiful soup library is used to process the XML or HTML code you have already received. It does a great job of receiving, searching, and processing data.
However, to work properly, it must be used with a query library: UrlLib, Requests, or http.client.
Installing the BeautifulSoup4 Library
The library is not preinstalled, so it can be installed using pip install:
pip install beautifulsoup4
After installation, a message about successful operation appears:
D:\scripts>pip install beautifulsoup4
Collecting beautifulsoup4
Downloading beautifulsoup4-4.12.0-py3-none-any.whl (132 kB)
---------------------------------------- 132.2/132.2 kB 865.1 kB/s eta 0:00:00
Requirement already satisfied: soupsieve>1.2 in c:\users\user\appdata\local\programs\python\python310\lib\site-packages (from beautifulsoup4) (2.3.2.post1)
Installing collected packages: beautifulsoup4
Successfully installed beautifulsoup4-4.12.0
If all goes well, let’s go to the next part - applying the library in practice.
Using the BeautifulSoup4 library
First, let’s create a file with *.py extension, connect the necessary libraries and get the code of the page. We have already done this before, so we won’t look into it in detail:
import requests
from bs4 import BeautifulSoup
data = requests.get('https://example.com')
Now we need to process this code using the BS4 library:
soup = BeautifulSoup(data.text, "html.parser")
To make sure all is well, let’s display the result:
D:\scripts>bs4_test.py
<!DOCTYPE html>
<html>
<head>
<title>Example Domain</title>
<style type="text/css">
…
</style>
</head>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.</p>
<p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>
On the surface, the result is no different from executing a query and displaying the content on the screen, but this is not true. By processing the code with the BS4 library, we can quickly get the data without using regular expressions.
Data processing with BS4
The easiest way is to search for data by tags. Let’s find the headings as well as the contents of the <a> tag. That is, we will do the same thing that we did with regular expressions, only using the BS4 library:
titles = soup.find('h1').text
href = soup.find('a').get("href")
If there were several headers or links, we could use soup.find_all instead of soup.find. In addition, BS4 allows you to search by class optionally. For this example, a simple test page is not enough for us, so let’s use https://quotes.toscrape.com
for an example.
Quotes are stored in the <span> tag and have the class “text”. Let’s set these parameters in our code:
soup.find_all('span', class_='text')
However, this code will output all quotes along with tags:
D:\scripts>bs4_test.py
[<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>, <span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>, <span class="text" itemprop="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span>, <span class="text" itemprop="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”</span>, <span class="text" itemprop="text">“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”</span>, <span class="text" itemprop="text">“Try not to become a man of success. Rather become a man of value.”</span>, <span class="text" itemprop="text">“It is better to be hated for what you are than to be loved for what you are not.”</span>, <span class="text" itemprop="text">“I have not failed. I've just found 10,000 ways that won't work.”</span>, <span class="text" itemprop="text">“A woman is like a tea bag; you never know how strong it is until it's in hot water.”</span>, <span class="text" itemprop="text">“A day without sunshine is like, you know, night.”</span>]
To solve this, let’s go through all the data line by line and select only the content:
for result in text:
print(result.text)
The result of the script:
D:\scripts>bs4_test.py
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
“It is our choices, Harry, that show what we truly are, far more than our abilities.”
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
“Try not to become a man of success. Rather become a man of value.”
“It is better to be hated for what you are than to be loved for what you are not.”
“I have not failed. I've just found 10,000 ways that won't work.”
“A woman is like a tea bag; you never know how strong it is until it's in hot water.”
“A day without sunshine is like, you know, night.”
Script code:
import requests
from bs4 import BeautifulSoup
data = requests.get()
soup = BeautifulSoup(data.text, "html.parser")
text = soup.find_all('span', class_='text')
for result in text:
print(result.text)
This flexibility and versatility have made the BS4 library so popular.
Selenium
Selenium is a well-known Python 3 library that is used for both testing and scraping. It is good for the simulation of the behavior of a real user, which helps to avoid blocking. When using Selenium, a webdriver is launched that simulates the behavior of a browser. For a more in-depth exploration, check out our comprehensive guide to web scraping with Selenium Python.
Advantages
Selenium is a balanced library suitable for beginners and pros alike. It has the following advantages:
It supports working with headless browsers and can scrape dynamic data.
Supports a variety of ways to search for data, whether by attributes or using XPath.
Supports the ability to interact with elements on the page (such as buttons or input fields in web applications).
It provides analogous libraries for all popular programming languages.
Besides all above-stated, the library is rather simple and pleasant to work with. Therefore acquaintance with it does not cause big difficulties.
Disadvantages
The disadvantages, in fact, are not so many:
Because the web driver is called to execute the script, it is pretty resource-intensive.
Scraping and other actions are performed only after the page is fully loaded, and that can take quite a long time.
Otherwise, the library has no such glaring disadvantages.
Usage
Selenium is used for automated testing and is great for scraping sites that use JavaScript. Using a headless browser allows you to collect data from dynamic websites successfully.
It is also great for cases where you need to perform some actions before scraping, such as clicking a button or logging in.
Installing Selenium
Like other libraries, Selenium can be installed using pip install:
pip install selenium
After that, it can be imported into the code and used as any other library.
Selenium needs webdriver for work. It should be the same version as Chrome browser. Unzip the webdriver to the C drive. We will need it later.
Using Selenium
Before you start using Selenium, it is worth understanding in what form some web pages are stored and how developers resist scraping.
The main method to stop scraping is to add javascript code that runs after the page loads and pulls in data dynamically. In this case, the previously considered libraries would return an empty request or a request containing javascript code but not the page content.
In the case of Selenium, the request will return the content because WebDriver will wait until the page is completely loaded, and only then will it perform the necessary actions.
Create a file with the *.py extension and attach Selenium WebDriver:
from selenium import webdriver
To make it work, let’s specify the path where we unzipped the WebDriver:
DRIVER_PATH = 'C:\chromedriver.exe'
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
Now let’s get the data from the page:
driver.get("https://quotes.toscrape.com/")
Once the page has been received, you can work with it and filter the necessary information. For the script to finish its work correctly, you must close WebDriver at the end:
driver.close()
Script code:
from selenium import webdriver
DRIVER_PATH = 'C:\chromedriver.exe'
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
text = driver.get("https://quotes.toscrape.com/")
driver.close()
Data processing with Selenium
Let’s extract the quotes again. To do this, we will need the By module of the Selenium library. Let’s import it:
from selenium.webdriver.common.by import By
Now we can use the connected library to process and search for the required data. The easiest way to select quotes will be using CSS selectors.
In addition to CSS selectors, Selenium also supports the following methods:
ID. Search by element id.
NAME. Search by element name.
XPATH. Search by XPath.
LINK_TEXT. Search by link text.
PARTIAL_LINK_TEXT. Search by partial link text.
TAG_NAME. Search by tag name.
CLASS_NAME. Search by class name.
CSS_SELECTOR. Search by CSS selector.
Find quotes by the CSS selector span.text and store them in the text variable:
text = driver.find_element(By.CSS_SELECTOR, "span.text")
Display the variable and make sure everything is working correctly:
D:\scripts>selenium_test.py
D:\scripts\selenium_test.py:5: DeprecationWarning: executable_path has been deprecated, please pass in a Service object
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
DevTools listening on ws://127.0.0.1:52684/devtools/browser/fe204bdd-2d10-47ca-999e-ddf22d5c3854
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
“It is our choices, Harry, that show what we truly are, far more than our abilities.”
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
“Try not to become a man of success. Rather become a man of value.”
“It is better to be hated for what you are than to be loved for what you are not.”
“I have not failed. I've just found 10,000 ways that won't work.”
“A woman is like a tea bag; you never know how strong it is until it's in hot water.”
“A day without sunshine is like, you know, night.”
Script code:
from selenium import webdriver
from selenium.webdriver.common.by import By
DRIVER_PATH = 'C:\chromedriver.exe'
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
driver.get("https://quotes.toscrape.com/")
text = driver.find_elements(By.CSS_SELECTOR, "span.text")
for res in text:
print(res.text)
driver.close()
Thus, Selenium is a feature-rich library that reduces the risk of blocking during scraping. Also, it is useful for data scientists or for bots writing.
Scrapy
Scrapy is an open-source web scraping framework ideal for large or scalable projects. It has built-in functionality for creating scrapers, automatically adjusts the scraping speed, and allows you to save output data in JSON, CSV, and XML formats using built-in tools. In addition, Scrapy performs queries asynchronously so that you can run them at a faster speed. Read our comprehensive guide to learn more about web scraping with Scrapy.
Unlike the previous options, it is better to create scrapers with Scrapy using the built-in functionality.
Advantages
This framework has the following advantages:
Suitable for any project, including scalable ones.
Can create separate web crawlers within the same project, each responsible for its own tasks.
Supports creating entire projects with common and separate settings, upload rules, exceptions, and scripts.
It is asynchronous.
Unfortunately, for beginners, this web scraping framework is quite difficult.
Disadvantages
The disadvantages of Scrapy include:
High complexity of work for beginners.
High resource consumption.
Not suitable for scraping dynamic web pages.
Because of this, Scrapy is used quite rarely and usually for those projects that need to scale.
Usage
Scrapy is a specialized scraping framework for web crawling. It is great for large and scalable projects where there is a division by scraping direction. Because of its ability to create multiple web crawlers within a single project, Scrapy is popular even though it is difficult to use.
Installing Scrapy
To install, use pip install:
pip install scrapy
After that, you can start working with Scrapy.
Using Scrapy
To proceed to create a scraper, enter a command at the command line to create a new project:
scrapy startproject scrapy_test
If everything is done correctly, the following will appear:
D:\scripts>scrapy startproject scrapy_test
New Scrapy project 'scrapy_test', using template directory 'C:\Users\user\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\templates\project', created in:
D:\scripts\scrapy_test
You can start your first spider with:
cd scrapy_test
scrapy genspider example example.com
Now go to the project directory with the command:
cd scrapy_test
The following files were automatically created in the directory:
Items.py. This file describes the classes.
Pipelines.py. Describe the actions that are performed when you open or close the spider. You can also specify how the data is saved.
Settings.py. This file contains the user’s settings for the spider.
Spiders. The folder where the spiders of the chosen project are stored.
Then we create the basic spider in this project:
scrapy genspider quotes quotes.toscrape.com
Code of the automatically generated file:
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'quotes'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
pass
Data processing with Scrapy
The def parse(self, response) block describes the elements that need to be scraped. Scrapy supports both CSS selectors and XPath. Let’s do it using XPath:
item = DemoItem()
item["text"] = response.xpath("span[@class = 'text']/text()").extract()
return items
Full script code:
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'quotes'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
item = DemoItem()
item["text"] = response.xpath("span[@class = 'text']/text()").extract()
return items
Pyppeteer
The last library on the list is the Python version of the famous NodeJS library - Puppeteer. Like Selenium, Pyppeteer allows us to simulate user behavior and work with JavaScript.
Advantages
This library has a number of advantages:
It has an analog in NodeJS, which is very popular.
Has a very active community and well-written documentation.
Great for scraping dynamic pages thanks to the use of a headless browser.
It will be very convenient to work with for those who are used to using it on NodeJS.
Disadvantages
Unfortunately, the disadvantages are also present:
It is originally a NodeJS library and therefore does not have very many examples in Python.
Quite complex for beginners.
Resource-intensive.
However, in spite of all the disadvantages, it deserves attention.
Usage
Good for large projects. It is also worth using this library if you need to scrape dynamic data.
Thanks to its great functionality, it is suitable for most tasks, for example, data extraction or parsing HTML documents.
Installing Pyppeteer
Install library via pip install:
pip install pyppeteer
Generally, the asyncio library is also used with this library, so it is recommended to install it as well:
pip install asyncio
Using Pyppeteer
First, create a file with the *.py extension and import the necessary libraries:
import asyncio
from pyppeteer import launch
Let’s add an asynchronous function to get the country code and display the code:
async def main():
browser = await launch()
page = await browser.newPage()
await page.goto('https://quotes.toscrape.com')
html = await page.content()
print(html)
await browser.close()
Let’s run the code and see what the result will be. During the first use, Chromium may be loaded:
D:\scripts\pyppeteer_test.py:12: DeprecationWarning: There is no current event loop
asyncio.get_event_loop().run_until_complete(main())
[INFO] Starting Chromium download.
4%|██
This will execute the code, which will display all the code of the page:
D:\scripts>pyppeteer_test.py
D:\scripts\pyppeteer_test.py:12: DeprecationWarning: There is no current event loop
asyncio.get_event_loop().run_until_complete(main())
[INFO] Starting Chromium download.
100%|████████████████████████████████████████████████████████████████████████████████| 137M/137M [06:26<00:00, 354kb/s]
[INFO] Beginning extraction
[INFO] Chromium extracted to: C:\Users\user\AppData\Local\pyppeteer\pyppeteer\local-chromium\588429
<!DOCTYPE html><html lang="en"><head>
<meta charset="UTF-8">
<title>Quotes to Scrape</title>
<link rel="stylesheet" href="/static/bootstrap.min.css">
<link rel="stylesheet" href="/static/main.css">
</head>
<body>
<div class="container">
<div class="row header-box">
<div class="col-md-8">
<h1>
<a href="/">Quotes to Scrape</a>
</h1>
…
Data processing with Pyppeteer
You can also use CSS selectors to select specific data. Let’s select quotations only:
text = await page.querySelectorAll("span.text")
for t in text:
qout = await t.getProperty("textContent")
print(await qout.jsonValue())
Execution result:
D:\scripts>pyppeteer_test.py
D:\scripts\pyppeteer_test.py:15: DeprecationWarning: There is no current event loop
asyncio.get_event_loop().run_until_complete(main())
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
“It is our choices, Harry, that show what we truly are, far more than our abilities.”
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
“Try not to become a man of success. Rather become a man of value.”
“It is better to be hated for what you are than to be loved for what you are not.”
“I have not failed. I've just found 10,000 ways that won't work.”
“A woman is like a tea bag; you never know how strong it is until it's in hot water.”
“A day without sunshine is like, you know, night.”
Full script code:
import asyncio
from pyppeteer import launch
async def main():
browser = await launch()
page = await browser.newPage()
await page.goto('https://quotes.toscrape.com')
html = await page.content()
text = await page.querySelectorAll("span.text")
for t in text:
qout = await t.getProperty("textContent")
print(await qout.jsonValue())
await browser.close()
asyncio.get_event_loop().run_until_complete(main())
Thus, Pyppeteer is a good option if you need to simulate the behavior of real users or extract data in a large volume quite quickly.
Conclusion and Takeaways
Let’s give a comparative table to make it easier to choose the right tool:
Library | User-Friendly | Performance | JavaScript Support |
---|---|---|---|
urllib | Easy to use | Fast for small requests but slower for large-scale scraping | No |
requests | Easy to use | Fast for small requests but slower for large-scale scraping | No |
lxml | Easy to use | Very fast for parsing and manipulating XML and HTML | No |
HasData | Easy to use | Very fast for scraping web resources | Yes |
BeautifulSoup4 | Easy to use | Slower than some other libraries for large-scale scraping, but still fast for most use cases | No |
Selenium | Can be complex for beginners | Relatively slower due to the overhead of web driver usage, but provides full browser emulation | Yes |
Scrapy | Can be complex for beginners | Very fast and scalable for large-scale scraping projects | No |
Pyppeteer | Can be complex for beginners, requires some knowledge of JavaScript and browser automation | Relatively slower due to the overhead of browser usage, but provides full browser emulation | Yes |
If you are a beginner, take a closer look at special libraries for scraping to avoid blocking. This will help you better understand how to write scrapers.
If you want to learn how to write scrapers on your own, Python is ideal - it’s quite easy to learn and has a wide range of tools.
Scraping in Python is a fairly easy task, regardless of your choice: using a constructor or writing your own code using one of the popular scraping libraries.
FAQ
Is Python good for web scraping?
Python is well-suited for web scraping and data processing in general. It has many functions and libraries aimed at obtaining and processing large amounts of data, and it is very convenient for writing small, functional scripts.
Is Scrapy better than BeautifulSoup?
Regarding Scrapy and BeautifulSoup, they are both useful for different tasks. Scrapy is more suitable for scalable or large projects, while BeautifulSoup is more suitable for projects where data does not need to be obtained from a website but rather searched and processed among data that has already been obtained.
What is the best Python library/framework for web scraping?
One of the best libraries for functionality is Selenium. It is suitable for practically any task, including scraping dynamic data. If data only needs to be processed rather than obtained, then BS4 or Lxml would be the best choice.
Is selenium better than BeautifulSoup?
Selenium has a broader range of functionality than BeautifulSoup. Unlike BeautifulSoup, it allows the scraping of websites with JavaScript, using headless browsers and making requests. However, BeautifulSoup is simpler and more suitable for beginners.
Might Be Interesting
Oct 29, 2024
How to Scrape YouTube Data for Free: A Complete Guide
Learn effective methods for scraping YouTube data, including extracting video details, channel info, playlists, comments, and search results. Explore tools like YouTube Data API, yt-dlp, and Selenium for a step-by-step guide to accessing valuable YouTube insights.
- Python
- Tutorials and guides
- Tools and Libraries
Oct 16, 2024
Scrape Etsy.com Product, Shop and Search Results Data
Learn how to scrape Etsy product, shop, and search results data with methods like Requests, BeautifulSoup, Selenium, and web scraping APIs. Explore strategies for data extraction and storage from Etsy's platform.
- E-commerce
- Tutorials and guides
- Python
Sep 9, 2024
How to Scrape Immobilienscout24.de Real Estate Data
Learn how to scrape real estate data from Immobilienscout24.de with step-by-step instructions, covering website analysis, choosing the right tools, and storing the collected data.
- Real Estate
- Use Cases
- Python