Pyppeteer: The Puppeteer Alternative for Python Web Scraping
Pyppeteer is a Python wrapper for the Puppeteer library, which was originally developed for Node.js. This allows it to use asynchronous methods for processing pages and data, distinguishing it from other web scraping libraries.
This article will cover various aspects of working with the Pyppeteer library, from installation and requirements to interacting with elements and data on a page.
We will also discuss common problems and how to solve them. To better understand the capabilities and benefits of this library, we will provide comparative tables with Pyppeteer and other popular web scraping libraries.
What is Pyppeteer
Pyppeteer is a popular Python library for interacting with Chromium headless browsers and simulating the actions of a real user. As we mentioned, originally developed for NodeJS as Puppeteer, it was later ported to Python.
The main features of the Pyppeteer library include page management, event handling, working with selectors, executing JavaScript code in the context of a page, as well as capabilities for creating screenshots and recording videos of a web session.
Getting Started with Pyppeteer
To start with the Pyppeteer library, you will need Python v3.6 or higher and a code editor or Python IDE. We will use Visual Studio Code, a lightweight and powerful code editor with syntax highlighting and a built-in compiler.
Gain instant access to a wealth of business data on Google Maps, effortlessly extracting vital information like location, operating hours, reviews, and more in HTML or JSON format.
Get real-time access to Google search results, structured data, and more with our powerful SERP API. Streamline your development process with easy integration of our API. Start your free trial now!
How to install Pyppeteer
To install the library, open a terminal or command prompt and enter:
pip install pyppeteer
The latest version of Chromium will be downloaded and installed automatically during the library installation. If you want to speed up the process or install it manually, use the following command:
pyppeteer-install
Once the library is installed, you can use it in your projects.
Pyppeteer supported browsers
Pyppeteer is a Python library that allows users to control Chromium browsers. It does not support other browsers, such as Firefox or Safari. Instead, it uses its own bundled version of Chromium, which is installed automatically with the library.
Basic Pyppeteer Project
Let’s create a new Python file with the .py extension. Import the necessary libraries, launch Chromium, and navigate to any page using Pyppeteer. To begin, we need to import the following libraries:
import asyncio
from Pyppeteer import launch
The asyncio library is required for Pyppeteer to run in asynchronous mode. This is a more efficient way to run Pyppeteer, as it allows multiple tasks to be executed simultaneously. The asyncio library is pre-installed with Python, so we don’t need to install it separately.
Next, create a function that you can call and execute asynchronously:
async def main():
In this function, we will perform all actions, including launching and closing the browser and navigating between pages. Let’s describe the commands for launching the browser, creating a new tab, navigating to a page, and closing the browser:
browser = await launch()
page = await browser.newPage()
await page.goto('https://www.example.com')
await browser.close()
Now that we have described the asynchronous function that will perform all the necessary actions let’s call it.
asyncio.get_event_loop().run_until_complete(main())
We will use this basic example to create more complex scripts in the future. However, the basic structure of the example will remain the same.
Advanced Configuration with Pyppeteer
In this section, we’ll explore the parameters that can be configured before calling the browser, which can make your script more functional. We’ll show you how to configure proxies, add and manage user agents, and handle cookies.
Using proxies
In previous articles, we covered how to use proxies with Python, including a discussion of the best proxy providers. In this tutorial, we will focus on how to use proxies with Pyppeteer.
To add a proxy, use the args parameter when launching the browser. For convenience, create variables to store the proxy server and port.
proxy_server = 'your_proxy_server'
proxy_port = 'your_proxy_port'
Then create the template for specifying proxies:
proxy_url = f'http://{proxy_server}:{proxy_port}'
Add a proxy argument as a parameter when calling the browser:
browser = await launch(args=[f'--proxy-server={proxy_url}'])
In all other respects, the primary example will remain unchanged.
Modifying user agents
User agents, unlike proxies, are specified after the browser has launched and are specified as a parameter to the launched page. Let’s create an empty page:
page = await browser.newPage()
Next, create a variable to store the user agent. This is necessary for convenience and to simplify the process of replacing and substituting it as needed.
user_agent = "Mozilla/5.0 (platform; rv:geckoversion) Gecko/geckotrail Firefox/firefoxversion"
Then, set the specified user agent.
await page.setUserAgent(user_agent)
Pyppeteer now uses the specified user agent when visiting a website. To reduce the risk of being blocked, you can create an array of user agents and randomly select one from the list when creating a new page. This will make it more difficult for websites to identify and block your scraping activity.
Managing cookies
Cookies are specified to the page in the same way as the user agent, as a parameter:
cookie = {'name': 'example_cookie', 'value': '123456789'}
await page.setCookie(cookie)
Pyppeteer supports many other parameters in addition to those discussed here. For a complete list, you can use the official documentation.
Navigating and Extracting Data
In the basic example, we showed how to navigate to a page using a link. In this section, we will explore how to extract any desired information from a page, including data from pages with dynamically loaded content. We will use the OpenCart demo site as an example.
Get HTML content
Let’s change the main function in the primary example and navigate to the opencart demo site.
await page.goto('https://demo.opencart.com/')
Next, extract the full content of the page.
content = await page.content()
And print the result on the screen:
print(content)
If you have followed the instructions correctly, you will see the entire HTML code of the page you requested in the terminal or command prompt.
Find and extract emails from any website with ease. Build targeted email lists for lead generation, outreach campaigns, and market research. Download your extracted data in your preferred format (CSV, JSON, or Excel) for immediate use.
Extract valuable business data such as names, addresses, phone numbers, websites, ratings, and more from a wide range of local business directories. Download your results in user-friendly formats like CSV, JSON, and Excel for easy analysis.
Using XPath and CSS selectors for Data Scraping
Typically, we only need specific elements of a page, such as products, headings, or prices, not the entire HTML code. To extract this data, we can use CSS selectors or XPath. CSS selectors are easier for beginners, while XPath offers more advanced features and capabilities. Ultimately, the best option for you will depend on your specific needs.
Pyppeteer supports both CSS selectors and XPath. Let’s take a look at how to use each one. For example, we’ll extract all h1 headings from a page. To do this, we can use the querySelector() function:
await page.goto('https://demo.opencart.com/')
element = await page.querySelector('h1')
At this stage, we get the entire element, not just its content. To extract the text content from the element, we will use the following function:
text_content = await page.evaluate('(element) => element.textContent', element)
To verify that the extraction was successful, we can print the text content to the console:
print(text_content)
You can extract any necessary information from a page using CSS selectors. However, for a better example, let’s extract the same data using XPath. The extraction principle for XPath is similar to what we have already seen:
element = await page.xpath('//h1')
text_content = await page.evaluate('(element) => element.textContent', element[0])
Note that in the case of XPath, page.xpath()
returns a list of elements, so we need to access the element by its index (for example, element[0]
).
Dealing with timeouts
Pyppeteer provides methods for handling timeouts to allow the browser enough time to complete operations. For example, you can use the timeout parameter when navigating to a page to limit the maximum page load time:
await page.goto('https://demo.opencart.com/', {'timeout': 5000})
This will tell the browser to wait up to 5 seconds for the page to load before throwing a timeout error.
Implementing waits for page loading
Additionally, you can use the waitUntil()
method to specify when you consider navigation successful. For example, you could use the load event to wait for the page to be fully loaded or the domcontentloaded
event to wait for the DOM to be loaded:
page.goto('https://demo.opencart.com/', waitUntil="load")
This will tell the browser to wait for the load event to fire before continuing.
Scrape Dynamic Pages
To track the loading of a page with dynamically generated content, you can use a selector for an dynamically loaded element and wait for it to load.
await page.waitForSelector('#someDynamicElement')
So, if you use this function after navigating to a page, your script will wait for the necessary element to appear on the page before continuing execution.
Interacting with the Page
Data collection is an important part of the scraping process, but it’s also important to emulate user actions. This helps to reduce the risk of being blocked and to ensure that you get all the data you need, even if it’s hidden below the fold. Pyppeteer also allows you to interact with forms to extract data from them and fill them out. This can be important if you’re scraping a site that requires authentication.
Clicking buttons and elements
One of the most essential and frequently used actions is clicking a button or element. Pyppeteer does not distinguish between these actions, and they are performed in the same way:
Find the element by its selector.
Click it using the
click()
function.
Let’s look at an example of how to click a button with an id="button"
:
await page.click('#button')
Pyppeteer will wait for the click to complete before executing any subsequent actions. You can explicitly wait for the click to complete (for example, await page.waitForNavigation())
. The only exception is if the click causes a navigation to another page.
Fill an input field
The type()
function is used to fill out forms. It works similarly to the click()
function, but instead of clicking on an element, it types text into it. To use type()
, select the element by its selector and pass the text you want to type into the function.
await page.type('input[name="username"]', 'put_your_text')
On the other hand, if you want to get the text from an input form, you can use the following:
input_text = await page.evaluate('document.querySelector("input[name="username"]").value')
This may be useful if you need to extract data from non-editable fields that are automatically generated.
The Google SERP API library for Python is a comprehensive solution that allows developers to integrate Google Search Engine Results Page (SERP) data. It provides a simplified way to get organic search results, snippets, knowledge graph data, and other data from the Google search engine.
The Google Maps API Python library offers developers an efficient means to harness detailed location data directly from Google Maps. This library simplifies the extraction of essential information such as the title of a place, its address, phone number, website URL, rating, reviews, and more.
Executing specific actions
Pyppeteer provides extensive capabilities for executing JavaScript code in the context of a page, which can be used to perform actions that are not possible with the standard APIs. For example, we previously used this approach to get the value of an input field. As we showed in the previous example, you can use the page.evaluate()
method to execute any arbitrary JavaScript code.
await page.evaluate('console.log("Hello world!")')
If you want more control over simulating user actions, Pyppeteer supports a wide range of special functions, such as simulating pressing the Enter key:
await page.keyboard.press('Enter')
Or even hovering over an element:
await page.hover('.example-element')
You can find a complete list of functions in the official documentation.
Login with Pyppeteer
Let’s combine everything we’ve discussed and develop a small script to simulate user authentication using Pyppeteer. We’ll use the basic template from the first example and make a few modifications to do this. After navigating to the page, we’ll find the necessary fields and fill out the form:
await page.type('#login-input', 'login')
await page.type('#password-input', 'password')
Then, we’ll find the Login button and click it.
await page.click('#login-button')
Alternatively, you can simulate pressing the Enter key instead, as the cursor remains in the input field after filling out forms.
Page Scrolling
To scroll a page in Pyppeteer, you need to use page.evaluate() to execute JavaScript code in the page’s context. For example, to scroll a page down by 300 pixels, you can use this code:
await page.evaluate('window.scrollBy(0, 300)')
Using this scrolling option, you can freely scroll the page horizontally and vertically.
Capturing screenshots
Pyppeteer provides a dedicated function for taking screenshots. This function supports saving screenshots in multiple formats, including PNG and JPEG. PNG is a lossless compression format that produces high-quality images, while JPEG is a lossy compression format that produces smaller file sizes.
await page.screenshot({'path': 'folder/path/screenshot.png'})
To save screenshots to a separate folder, ensure the folder exists and you have permission to write to it. Otherwise, the script will fail.
Troubleshooting and Error Handling
While using Pyppeteer, you may encounter errors caused by incorrect package installation, dependencies, or versions or by missing required components. Let’s look at the most common errors and how to fix them.
Common issues like “Pyppeteer is not installed”
The error message “Pyppeteer not installed” indicates that the package is either not installed or is considered to be not installed. This error usually occurs because the package installation was terminated with an error, for example, due to a version mismatch.
As we said initially, Pyppeteer only works with Python 3.6 and higher. If this is your case, you should update or reinstall Python to a later version. If you are using Python 3.6 or later, you can try to install Pyppeteer once more time:
pip install pyppeteer
In some cases, problems can occur due to incompatibility between Pyppeteer and the installed Chromium. Try installing specific, compatible versions:
pip install pyppeteer==<version>
pyppeteer-install --force
Specify the version, which can be found on the Pyppeteer releases page on GitHub.
Handling unexpected browser closures in Pyppeteer
The possible solutions to this error depend on its cause. For example, if you have just installed Pyppeteer and are experiencing problems launching the browser, it is possible that the Chromium installation was not successful. In this case, you can try reinstalling the browser and its dependencies.
pyppeteer-install
If you are concerned that the browser may close unexpectedly during script execution, you can use standard try/catch blocks to handle these cases.
Pyppeteer vs. Other Tools
The most similar library to Pyppeteer in terms of functionality is Selenium. However, Pyppeteer has features similar to other Python scraping libraries, such as Beautiful Soup or Scrapy. This section will compare Pyppeteer to other libraries, scraping approaches, or data processing methods.
Pyppeteer vs. BeautifulSoup
For a more concise and informative comparison, here is a table that summarizes the key differences between Pyppeteer and BeautifulSoup:
Feature | Pyppeteer | BeautifulSoup |
---|---|---|
Purpose | Automation of browser interactions, dynamic content execution, and user actions. | Parsing HTML and XML documents, extracting information from static markup. |
Ease of Use | Сomplex due to the need to manage the browser instance and asynchronous code. | Offers a simpler and more declarative approach to HTML parsing. |
Performance | Can be slower, especially when dealing with large volumes of dynamic content | More efficient for static page parsing. |
Use Cases | Interaction with dynamic web pages, scraping data after JavaScript execution, taking screenshots, and more. | Data extraction from static HTML, navigating and searching through markup. |
The choice of BeautifulSoup vs. Pyppeteer depends on the specific needs of your project. If you need to parse data from static web pages, BeautifulSoup is a simpler and faster option. However, if you need to simulate user actions or interact with dynamic web pages, Pyppeteer is a better choice.
Pyppeteer vs. Scrapy
Let’s look at a comparative table of Pyppeteer and Scrapy:
Feature | Pyppeteer | Scrapy |
---|---|---|
Purpose | Web scraping with browser automation | General-purpose web crawling and scraping |
Browser Automation | Yes | No (Focused on HTTP requests) |
Ease of Use | More complex due to browser integration | Easier to use |
Scalability | Suitable for smaller-scale projects | Designed for scalable and large-scale scraping |
Flexibility | Provides flexibility for complex scenarios | Less flexible for scenarios requiring browser interactivity |
Performance | Slower due to browser launch overhead | Faster for traditional HTTP-based scraping |
As you can see from the table, Pyppeteer and Scrapy are suited for different purposes. Pyppeteer is a better choice for small projects requiring processing dynamic web pages or executing JavaScript. Scrapy is better for large or scalable projects that scrape simple HTML pages.
Pyppeteer vs. Selenium
Let’s look at a comparative table of Pyppeteer and Selenium:
Feature | Pyppeteer | Selenium |
---|---|---|
Browser Support | Chromium only | Multiple browsers (Chrome, Firefox, Edge, etc.) |
Asynchronicity | Asynchronous (async/await) | Synchronous (supports async but traditionally synchronous) |
Use Cases | Great for headless automation, web scraping | General-purpose web automation, testing, and browser interactions |
Performance | Generally faster due to its asynchronous nature | Slightly slower due to the synchronous nature and additional layers |
Selenium and Pyppeteer are both popular open-source tools for web automation. They offer similar functionality and capabilities, but there are some key differences to consider when choosing between them.
The choice between the two is primarily based on the need for asynchronous execution and personal preference. In addition, Pyppeteer is often preferred for headless tasks and simplicity, while Selenium’s versatility makes it suitable for a wide range of web automation scenarios.
Conclusion and Takeaways
In this article, we discussed various aspects and use cases of the Pyppeteer library, from how to install it and its core features to potential problems and comparisons with other available scraping libraries and frameworks.
Specifically, we covered how to configure and personalize data using user agents and cookies, find and extract data, emulate real-user actions, and control various elements on a page. We also discussed potential problems when using Pyppeteer, such as rate limiting and browser detection.
Overall, Pyppeteer is a powerful and versatile library that can be used for various scraping tasks. It is easy to learn and use and offers a wide range of features, making it a good choice for both beginners and experienced scrapers.
Might Be Interesting
Oct 29, 2024
How to Scrape YouTube Data for Free: A Complete Guide
Learn effective methods for scraping YouTube data, including extracting video details, channel info, playlists, comments, and search results. Explore tools like YouTube Data API, yt-dlp, and Selenium for a step-by-step guide to accessing valuable YouTube insights.
- Python
- Tutorials and guides
- Tools and Libraries
Oct 16, 2024
Scrape Etsy.com Product, Shop and Search Results Data
Learn how to scrape Etsy product, shop, and search results data with methods like Requests, BeautifulSoup, Selenium, and web scraping APIs. Explore strategies for data extraction and storage from Etsy's platform.
- E-commerce
- Tutorials and guides
- Python
Sep 9, 2024
How to Scrape Immobilienscout24.de Real Estate Data
Learn how to scrape real estate data from Immobilienscout24.de with step-by-step instructions, covering website analysis, choosing the right tools, and storing the collected data.
- Real Estate
- Use Cases
- Python