Pyppeteer: The Puppeteer Alternative for Python Web Scraping

Valentina Skakun Valentina Skakun
Last update: 30 Apr 2024

Pyppeteer is a Python wrapper for the Puppeteer library, which was originally developed for Node.js. This allows it to use asynchronous methods for processing pages and data, distinguishing it from other web scraping libraries.

This article will cover various aspects of working with the Pyppeteer library, from installation and requirements to interacting with elements and data on a page.

We will also discuss common problems and how to solve them. To better understand the capabilities and benefits of this library, we will provide comparative tables with Pyppeteer and other popular web scraping libraries.

What is Pyppeteer

Pyppeteer is a popular Python library for interacting with Chromium headless browsers and simulating the actions of a real user. As we mentioned, originally developed for NodeJS as Puppeteer, it was later ported to Python.

The main features of the Pyppeteer library include page management, event handling, working with selectors, executing JavaScript code in the context of a page, as well as capabilities for creating screenshots and recording videos of a web session.

Getting Started with Pyppeteer

To start with the Pyppeteer library, you will need Python v3.6 or higher and a code editor or Python IDE. We will use Visual Studio Code, a lightweight and powerful code editor with syntax highlighting and a built-in compiler.

How to install Pyppeteer

To install the library, open a terminal or command prompt and enter:

pip install pyppeteer

The latest version of Chromium will be downloaded and installed automatically during the library installation. If you want to speed up the process or install it manually, use the following command:

pyppeteer-install

Once the library is installed, you can use it in your projects.

Pyppeteer supported browsers

Pyppeteer is a Python library that allows users to control Chromium browsers. It does not support other browsers, such as Firefox or Safari. Instead, it uses its own bundled version of Chromium, which is installed automatically with the library.

Basic Pyppeteer Project

Let’s create a new Python file with the .py extension. Import the necessary libraries, launch Chromium, and navigate to any page using Pyppeteer. To begin, we need to import the following libraries:

import asyncio
from Pyppeteer import launch

The asyncio library is required for Pyppeteer to run in asynchronous mode. This is a more efficient way to run Pyppeteer, as it allows multiple tasks to be executed simultaneously. The asyncio library is pre-installed with Python, so we don’t need to install it separately.

Next, create a function that you can call and execute asynchronously:

async def main():

In this function, we will perform all actions, including launching and closing the browser and navigating between pages. Let’s describe the commands for launching the browser, creating a new tab, navigating to a page, and closing the browser:

    browser = await launch()
    page = await browser.newPage()
    await page.goto('https://www.example.com')
    await browser.close()

Now that we have described the asynchronous function that will perform all the necessary actions let’s call it.

asyncio.get_event_loop().run_until_complete(main())

We will use this basic example to create more complex scripts in the future. However, the basic structure of the example will remain the same.

Advanced Configuration with Pyppeteer

In this section, we’ll explore the parameters that can be configured before calling the browser, which can make your script more functional. We’ll show you how to configure proxies, add and manage user agents, and handle cookies.

Using proxies

In previous articles, we covered how to use proxies with Python, including a discussion of the best proxy providers. In this tutorial, we will focus on how to use proxies with Pyppeteer.

To add a proxy, use the args parameter when launching the browser. For convenience, create variables to store the proxy server and port.

    proxy_server = 'your_proxy_server'
    proxy_port = 'your_proxy_port'

Then create the template for specifying proxies:

    proxy_url = f'http://{proxy_server}:{proxy_port}'

Add a proxy argument as a parameter when calling the browser:

    browser = await launch(args=[f'--proxy-server={proxy_url}'])

In all other respects, the primary example will remain unchanged.

Modifying user agents

User agents, unlike proxies, are specified after the browser has launched and are specified as a parameter to the launched page. Let’s create an empty page:

    page = await browser.newPage()

Next, create a variable to store the user agent. This is necessary for convenience and to simplify the process of replacing and substituting it as needed.

    user_agent = "Mozilla/5.0 (platform; rv:geckoversion) Gecko/geckotrail Firefox/firefoxversion"

Then, set the specified user agent.

    await page.setUserAgent(user_agent)

Pyppeteer now uses the specified user agent when visiting a website. To reduce the risk of being blocked, you can create an array of user agents and randomly select one from the list when creating a new page. This will make it more difficult for websites to identify and block your scraping activity.

Managing cookies

Cookies are specified to the page in the same way as the user agent, as a parameter:

    cookie = {'name': 'example_cookie', 'value': '123456789'}
    await page.setCookie(cookie)

Pyppeteer supports many other parameters in addition to those discussed here. For a complete list, you can use the official documentation.

In the basic example, we showed how to navigate to a page using a link. In this section, we will explore how to extract any desired information from a page, including data from pages with dynamically loaded content. We will use the OpenCart demo site as an example.

Get HTML content

Let’s change the main function in the primary example and navigate to the opencart demo site.

    await page.goto('https://demo.opencart.com/')

Next, extract the full content of the page.

    content = await page.content()

And print the result on the screen:

    print(content)

If you have followed the instructions correctly, you will see the entire HTML code of the page you requested in the terminal or command prompt.

Using XPath and CSS selectors for Data Scraping

Typically, we only need specific elements of a page, such as products, headings, or prices, not the entire HTML code. To extract this data, we can use CSS selectors or XPath. CSS selectors are easier for beginners, while XPath offers more advanced features and capabilities. Ultimately, the best option for you will depend on your specific needs.

Pyppeteer supports both CSS selectors and XPath. Let’s take a look at how to use each one. For example, we’ll extract all h1 headings from a page. To do this, we can use the querySelector() function:

    await page.goto('https://demo.opencart.com/')
    element = await page.querySelector('h1')

At this stage, we get the entire element, not just its content. To extract the text content from the element, we will use the following function:

    text_content = await page.evaluate('(element) => element.textContent', element)

To verify that the extraction was successful, we can print the text content to the console:

    print(text_content)

You can extract any necessary information from a page using CSS selectors. However, for a better example, let’s extract the same data using XPath. The extraction principle for XPath is similar to what we have already seen:

    element = await page.xpath('//h1')
    text_content = await page.evaluate('(element) => element.textContent', element[0])

Note that in the case of XPath, page.xpath() returns a list of elements, so we need to access the element by its index (for example, element[0]).

Dealing with timeouts

Pyppeteer provides methods for handling timeouts to allow the browser enough time to complete operations. For example, you can use the timeout parameter when navigating to a page to limit the maximum page load time:

    await page.goto('https://demo.opencart.com/', {'timeout': 5000})

This will tell the browser to wait up to 5 seconds for the page to load before throwing a timeout error.

Implementing waits for page loading

Additionally, you can use the waitUntil() method to specify when you consider navigation successful. For example, you could use the load event to wait for the page to be fully loaded or the domcontentloaded event to wait for the DOM to be loaded:

    page.goto('https://demo.opencart.com/', waitUntil="load")

This will tell the browser to wait for the load event to fire before continuing.

Scrape Dynamic Pages

To track the loading of a page with dynamically generated content, you can use a selector for an dynamically loaded element and wait for it to load.

    await page.waitForSelector('#someDynamicElement')

So, if you use this function after navigating to a page, your script will wait for the necessary element to appear on the page before continuing execution.

Interacting with the Page

Data collection is an important part of the scraping process, but it’s also important to emulate user actions. This helps to reduce the risk of being blocked and to ensure that you get all the data you need, even if it’s hidden below the fold. Pyppeteer also allows you to interact with forms to extract data from them and fill them out. This can be important if you’re scraping a site that requires authentication.

Clicking buttons and elements

One of the most essential and frequently used actions is clicking a button or element. Pyppeteer does not distinguish between these actions, and they are performed in the same way:

  1. Find the element by its selector.

  2. Click it using the click() function.

Let’s look at an example of how to click a button with an id="button":

    await page.click('#button')

Pyppeteer will wait for the click to complete before executing any subsequent actions. You can explicitly wait for the click to complete (for example, await page.waitForNavigation()). The only exception is if the click causes a navigation to another page.

Fill an input field

The type() function is used to fill out forms. It works similarly to the click() function, but instead of clicking on an element, it types text into it. To use type(), select the element by its selector and pass the text you want to type into the function.

    await page.type('input[name="username"]', 'put_your_text')

On the other hand, if you want to get the text from an input form, you can use the following:

    input_text = await page.evaluate('document.querySelector("input[name="username"]").value')

This may be useful if you need to extract data from non-editable fields that are automatically generated.

Executing specific actions

Pyppeteer provides extensive capabilities for executing JavaScript code in the context of a page, which can be used to perform actions that are not possible with the standard APIs. For example, we previously used this approach to get the value of an input field. As we showed in the previous example, you can use the page.evaluate() method to execute any arbitrary JavaScript code.

    await page.evaluate('console.log("Hello world!")')

If you want more control over simulating user actions, Pyppeteer supports a wide range of special functions, such as simulating pressing the Enter key:

     await page.keyboard.press('Enter')

Or even hovering over an element:

    await page.hover('.example-element')

You can find a complete list of functions in the official documentation.

Login with Pyppeteer

Let’s combine everything we’ve discussed and develop a small script to simulate user authentication using Pyppeteer. We’ll use the basic template from the first example and make a few modifications to do this. After navigating to the page, we’ll find the necessary fields and fill out the form:

    await page.type('#login-input', 'login')
    await page.type('#password-input', 'password')

Then, we’ll find the Login button and click it.

    await page.click('#login-button')

Alternatively, you can simulate pressing the Enter key instead, as the cursor remains in the input field after filling out forms.

Page Scrolling

To scroll a page in Pyppeteer, you need to use page.evaluate() to execute JavaScript code in the page’s context. For example, to scroll a page down by 300 pixels, you can use this code:

    await page.evaluate('window.scrollBy(0, 300)')

Using this scrolling option, you can freely scroll the page horizontally and vertically.

Capturing screenshots

Pyppeteer provides a dedicated function for taking screenshots. This function supports saving screenshots in multiple formats, including PNG and JPEG. PNG is a lossless compression format that produces high-quality images, while JPEG is a lossy compression format that produces smaller file sizes.

    await page.screenshot({'path': 'folder/path/screenshot.png'})

To save screenshots to a separate folder, ensure the folder exists and you have permission to write to it. Otherwise, the script will fail.

Troubleshooting and Error Handling

While using Pyppeteer, you may encounter errors caused by incorrect package installation, dependencies, or versions or by missing required components. Let’s look at the most common errors and how to fix them.

Common issues like “Pyppeteer is not installed”

The error message “Pyppeteer not installed” indicates that the package is either not installed or is considered to be not installed. This error usually occurs because the package installation was terminated with an error, for example, due to a version mismatch.

As we said initially, Pyppeteer only works with Python 3.6 and higher. If this is your case, you should update or reinstall Python to a later version. If you are using Python 3.6 or later, you can try to install Pyppeteer once more time:

pip install pyppeteer

In some cases, problems can occur due to incompatibility between Pyppeteer and the installed Chromium. Try installing specific, compatible versions:

pip install pyppeteer==<version>
pyppeteer-install --force

Specify the version, which can be found on the Pyppeteer releases page on GitHub.

Handling unexpected browser closures in Pyppeteer

The possible solutions to this error depend on its cause. For example, if you have just installed Pyppeteer and are experiencing problems launching the browser, it is possible that the Chromium installation was not successful. In this case, you can try reinstalling the browser and its dependencies.

pyppeteer-install

If you are concerned that the browser may close unexpectedly during script execution, you can use standard try/catch blocks to handle these cases.

Pyppeteer vs. Other Tools

The most similar library to Pyppeteer in terms of functionality is Selenium. However, Pyppeteer has features similar to other Python scraping libraries, such as Beautiful Soup or Scrapy. This section will compare Pyppeteer to other libraries, scraping approaches, or data processing methods.

Pyppeteer vs. BeautifulSoup

For a more concise and informative comparison, here is a table that summarizes the key differences between Pyppeteer and BeautifulSoup:

FeaturePyppeteerBeautifulSoup
PurposeAutomation of browser interactions, dynamic content execution, and user actions.Parsing HTML and XML documents, extracting information from static markup.
Ease of UseСomplex due to the need to manage the browser instance and asynchronous code.Offers a simpler and more declarative approach to HTML parsing.
PerformanceCan be slower, especially when dealing with large volumes of dynamic contentMore efficient for static page parsing.
Use CasesInteraction with dynamic web pages, scraping data after JavaScript execution, taking screenshots, and more.Data extraction from static HTML, navigating and searching through markup.

The choice of BeautifulSoup vs. Pyppeteer depends on the specific needs of your project. If you need to parse data from static web pages, BeautifulSoup is a simpler and faster option. However, if you need to simulate user actions or interact with dynamic web pages, Pyppeteer is a better choice.

Pyppeteer vs. Scrapy

Let’s look at a comparative table of Pyppeteer and Scrapy:

FeaturePyppeteerScrapy
PurposeWeb scraping with browser automationGeneral-purpose web crawling and scraping
Browser AutomationYesNo (Focused on HTTP requests)
Ease of UseMore complex due to browser integrationEasier to use
ScalabilitySuitable for smaller-scale projectsDesigned for scalable and large-scale scraping
FlexibilityProvides flexibility for complex scenariosLess flexible for scenarios requiring browser interactivity
PerformanceSlower due to browser launch overheadFaster for traditional HTTP-based scraping

As you can see from the table, Pyppeteer and Scrapy are suited for different purposes. Pyppeteer is a better choice for small projects requiring processing dynamic web pages or executing JavaScript. Scrapy is better for large or scalable projects that scrape simple HTML pages.

Pyppeteer vs. Selenium

Let’s look at a comparative table of Pyppeteer and Selenium:

FeaturePyppeteerSelenium
Browser SupportChromium onlyMultiple browsers (Chrome, Firefox, Edge, etc.)
AsynchronicityAsynchronous (async/await)Synchronous (supports async but traditionally synchronous)
Use CasesGreat for headless automation, web scrapingGeneral-purpose web automation, testing, and browser interactions
PerformanceGenerally faster due to its asynchronous natureSlightly slower due to the synchronous nature and additional layers

Selenium and Pyppeteer are both popular open-source tools for web automation. They offer similar functionality and capabilities, but there are some key differences to consider when choosing between them.

The choice between the two is primarily based on the need for asynchronous execution and personal preference. In addition, Pyppeteer is often preferred for headless tasks and simplicity, while Selenium’s versatility makes it suitable for a wide range of web automation scenarios.

Conclusion and Takeaways

In this article, we discussed various aspects and use cases of the Pyppeteer library, from how to install it and its core features to potential problems and comparisons with other available scraping libraries and frameworks.

Specifically, we covered how to configure and personalize data using user agents and cookies, find and extract data, emulate real-user actions, and control various elements on a page. We also discussed potential problems when using Pyppeteer, such as rate limiting and browser detection.

Overall, Pyppeteer is a powerful and versatile library that can be used for various scraping tasks. It is easy to learn and use and offers a wide range of features, making it a good choice for both beginners and experienced scrapers.

Blog

Might Be Interesting