Web Scraping with Playwright and Node.js

Roman Milyushkevich Roman Milyushkevich
Last update: 30 Apr 2024

Extracting data from websites has become essential in various fields, ranging from data collection to competitive analysis. However, content that is loaded through JavaScript, AJAX requests, and complex interactions can be tricky to capture with basic tools. This is where headless browser automation tools, such as Playwright, come into play.

In this blog, we’ll dive into how to use Playwright for web scraping. We’ll explore its advantages, scraping single and multiple pages, error handling, button clicks, form submissions, and crucial techniques like leveraging proxies to bypass detection and intercepting requests. Additionally, the comparison with other web scraping and automation tools like Puppeteer and Selenium.

What is Playwright

Playwright is a popular open-source framework built on Node.js for web testing and automation. It allows testing Chromium, Firefox, and WebKit with a single API. Playwright was developed by Microsoft and offers efficient, reliable, and fast cross-browser web automation. It works across multiple platforms, including Windows, Linux, and macOS, and supports all modern web browsers. Additionally, Playwright provides cross-language support for TypeScript, JavaScript, Python, .NET, and Java.

Getting Started with Playwright

Before scraping websites, let’s prepare our system’s Node.js and Python environments for Playwright.

Project Setup & Installation

For Nodejs:

Ensure you have the latest version of Node.js installed on your system. For this project, we’ll be using Node.js v20.9.0.

Create a new directory for your project, navigate to it, and initialize a Node.js project by running npm init. The npm init -y command creates a package.json file with all the dependencies.

mkdir playwright-scraping
cd playwright-scraping
npm init -y

Now, you can install Playwright using NPM:

npm install playwright

To use Playwright, you’ll also need to install a compatible browser. Each Playwright version requires specific browser binary versions. Run the following command to install the latest browser versions:

npx playwright install

This will install the latest versions of Chromium, Firefox, and WebKit. You can use any of these browsers in your code, but we’ll use Chromium for this tutorial.

Here’s how the complete process looks:

Install npm packages

Install npm packages

Open the package.json file and add "type": "module" to support modern JavaScript syntax.

Add package.json file

Add package.json file

Finally, open the project in your preferred code editor and create a new file named index.js.

Create index.js file

Create index.js file

For Python:

Ensure you have the latest version of Python installed on your system. Next, install the Playwright Python library and the necessary browsers.

pip install playwright
playwright install

Here’s how the complete process looks:

Install playwright using pip

Install playwright using pip

Launching Playwright

For Nodejs:

Let’s write our first Playwright code that opens a new page in a Chromium browser.

import { chromium } from 'playwright';

async function main() {
    // Launch a new instance of a Chromium browser with headless mode
    // disabled for visibility
    const browser = await chromium.launch({
        headless: false
    });

    // Create a new Playwright context to isolate browsing session
    const context = await browser.newContext();
    // Open a new page/tab within the context
    const page = await context.newPage();

    // Navigate to the GitHub topics homepage
    await page.goto('<https://github.com/topics>');

    // Wait for 1 second to ensure page content loads properly
    await page.waitForTimeout(1000);

    // Close the browser instance after task completion
    await browser.close();
}

// Execute the main function
main();

The above code imports the chromium module to control Chromium-based browsers. It then launches a new, visible instance of the Chromium browser using chromium.launch() with the headless option set to False. A new browser page is opened, and the page.goto() function navigates to the GitHub topics web page. A one-second wait allows the user to see the page before it is finally closed.

For Python:

Playwright for Python offers both synchronous and asynchronous APIs. The following example shows the asynchronous API in action:

from playwright.async_api import async_playwright
import asyncio

async def main():
    # Initialize Playwright asynchronously
    async with async_playwright() as p:
        # Launch a Chromium browser instance with headless mode disabled
        browser = await p.chromium.launch(headless=False)

        # Create a new context within the browser to isolate browsing session
        context = await browser.new_context()

        # Create a new page/tab within the context
        page = await context.new_page()

        # Navigate to the GitHub topics homepage
        await page.goto('<https://github.com/topics>')

        # Wait for 1 second to ensure page content loads properly
        await page.wait_for_timeout(1000)

        # Close the browser instance after task completion
        await browser.close()

# Run the main function asynchronously
asyncio.run(main())

Both Node.js and Python code share similarities, but there are some key differences. Python uses the asyncio library for asynchronous operations. Additionally, function naming conventions differ, with Python using snake_case (e.g., wait_for_timeout) and JavaScript using camelCase (e.g., waitForTimeout).

The following example shows the synchronous API in action:

from playwright.sync_api import sync_playwright

def main():
    # Initialize Playwright synchronously
    with sync_playwright() as p:
        # Launch a Chromium browser instance with headless mode disabled
        browser = p.chromium.launch(headless=False)

        # Create a new context within the browser to isolate browsing session
        context = browser.new_context()

        # Create a new page/tab within the context
        page = context.new_page()

        # Navigate to the GitHub topics homepage
        page.goto('<https://github.com/topics>')

        # Wait for 1 second to ensure page content loads properly
        page.wait_for_timeout(1000)

        # Close the browser instance after task completion
        browser.close()

if __name__ == '__main__':
    main()

The Node.js and Python codes mentioned above will open the following page.

Research github page

Research github page

Basic Scraping with Playwright

Now that your environment is set up, let’s dive into some basic web scraping with Playwright. You can do everything you normally do manually in the browser, from generating screenshots to crawling multiple pages.

Selecting Data to Scrape

We’ll be extracting data from GitHub topics. This will allow you to select the topic and the number of repositories you want to extract. The scraper will then return the information associated with the chosen topic.

Scrape NodeJS packages

Scrape NodeJS packages

We’ll use Playwright to launch a browser, navigate to the GitHub topics page, and extract the necessary information. This includes details such as the repository owner, repository name, repository URL, the number of stars the repository has, its description, and any associated tags.

Extract only useful data

Extract only useful data

Locating Elements and Extracting Data

When you open the topic page, you’ll see 20 repositories. Each entry, shown as an <article> element, displays information about a specific repository. You can expand each element to view more detailed information about the corresponding repository.

Find tags using DevTools

Find tags using DevTools

The image below shows an expanded <article> element, displaying all the information about the repository.

Use classes

Use classes

Extracting User and Repository Information:

  1. User: Use h3 > a:first-child to target the first anchor tag directly within a <h3> tag.

  2. Repository Name: Target the second child element within the same <h3> parent. This child holds both the name and URL. Use the textContent property to extract the name and the getAttribute('href') method to extract the URL.

  3. Number of Stars: Use #repo-stars-counter-star to select the element and extract the actual number from its title attribute.

  4. Repository Description: Use div.px-3 > p to select the first paragraph within a div with the class px-3.

  5. Repository Tags: Use a.topic-tag to select all anchor tags with the class topic-tag.

Common Functions:

To use the above selectors effectively, here are the common functions:

  • $$eval(selector, function): This function selects all elements matching the selector and passes them as an array to the function. The function’s return value is then returned.

  • $eval(selector, function): This function selects the first element matching the selector and passes it as an argument to the function. The function’s return value is then returned.

  • querySelector(selector): This function returns the first element matching the selector.

  • querySelectorAll(selector): This function returns a list of all elements matching the selector.

Here’s a code snippet:

repos.forEach(repo => {
    const user = repo.querySelector('h3 > a:first-child').textContent.trim();
    const repoLink = repo.querySelector('h3 > a:nth-child(2)');
    const repoName = repoLink.textContent.trim();
    const repoUrl = repoLink.getAttribute('href');
    const repoStar = repo.querySelector('#repo-stars-counter-star').getAttribute('title');
    const repoDescription = repo.querySelector('div.px-3 > p').textContent.trim();
    const tagsElements = Array.from(repo.querySelectorAll('a.topic-tag'));
    const tags = tagsElements.map(tag => tag.textContent.trim());

Let’s look at the complete process for extracting all repositories from a single page.

The process begins with the page.$$eval function. This function selects all elements with the class border within the article tag and passes them as an array to the provided function. We’ll define all variables and selectors within this function.

const extractedRepos = await page.$$eval('article.border', repos => { ... });

Also, create an empty array called repoData to store the extracted information.

const repoData = [];

Iterates through each element in the repos array to extract all relevant data for each repository using the provided selectors.

repos.forEach(repo => { ... });

Finally, the extracted data for each repository is added to the repoData array, and the array is returned.

repoData.push({ user, repoName, repoStar, repoDescription, tags, repoUrl });

Here’s the code for all the above steps.

const extractedRepos = await page.$$eval('article.border', repos => {
    const repoData = [];

    repos.forEach(repo => {
        const user = repo.querySelector('h3 > a:first-child').textContent.trim();
        const repoLink = repo.querySelector('h3 > a:nth-child(2)');
        const repoName = repoLink.textContent.trim();
        const repoUrl = repoLink.getAttribute('href');
        const repoStar = repo.querySelector('#repo-stars-counter-star').getAttribute('title');
        const repoDescription = repo.querySelector('div.px-3 > p').textContent.trim();
        const tagsElements = Array.from(repo.querySelectorAll('a.topic-tag'));
        const tags = tagsElements.map(tag => tag.textContent.trim());

        repoData.push({ user, repoName, repoStar, repoDescription, tags, repoUrl });
    });

    return repoData;
});

Here’s the complete code. When you run this code, it’ll extract the first page of GitHub Topics.

import { chromium } from 'playwright';

(async () => {
    // Launch a headless browser
    const browser = await chromium.launch({ headless: true });

    // Open a new page
    const context = await browser.newContext();
    const page = await context.newPage();

    // Navigate to the Node.js topic page on GitHub
    await page.goto('<https://github.com/topics/nodejs>');

    const extractedRepos = await page.$$eval('article.border', repos => {
        // Array to store extracted data
        const repoData = [];

        // Extract data from each repository element
        repos.forEach(repo => {
            const user = repo.querySelector('h3 > a:first-child').textContent.trim();
            const repoLink = repo.querySelector('h3 > a:nth-child(2)');
            const repoName = repoLink.textContent.trim();
            const repoUrl = repoLink.getAttribute('href');
            const repoStar = repo.querySelector('#repo-stars-counter-star').getAttribute('title');
            const repoDescription = repo.querySelector('div.px-3 > p').textContent.trim();
            const tagsElements = Array.from(repo.querySelectorAll('a.topic-tag'));
            const tags = tagsElements.map(tag => tag.textContent.trim());

            // Add extracted data to the array
            repoData.push({ user, repoName, repoStar, repoDescription, tags, repoUrl });
        });

        // Return the extracted data
        return repoData;
    });

    console.log(`Total repositories extracted: ${extractedRepos.length}\\n`);

    // Print extracted data to the console
    console.dir(extractedRepos, { depth: null }); // Show all nested data

    // Close the browser
    await browser.close();
})();

The result is:

Print Repositories Data

Print Repositories Data

If you’re interested in exploring more about web scraping with Node.js beyond Playwright, you might find our comprehensive guide on Node.js web scraping helpful. This article covers additional libraries and techniques to enhance your scraping capabilities with Node.js

Advanced Scraping Techniques

We’ve successfully scraped the single page. Now, let’s move on to advanced scraping using Playwright. You can click buttons, fill forms, crawl multiple pages, and rotate headers and proxies to make your scraping more reliable…

Clicking Buttons and Waiting for Actions

You can load more repositories by clicking the ‘Load more…’ button at the bottom of the page. Here are the actions to tell Playwright to load more repositories:

  1. Wait for the “Load more…” button to appear.

  2. Click the “Load more…” button.

  3. Wait for the new repositories to load before proceeding.

Add Load more button processing

Add Load more button processing

Clicking buttons in Playwright is straightforward! Simply pass a valid CSS selector to the locator method, which efficiently finds the element on the page.

const button = await page.locator('button[type="submit"].ajax-pagination-btn.f6');
await button.click();

In this case, the page.locator() method searches for a button element with attributes type="submit" and CSS classes ajax-pagination-btn and f6.

Handling Dynamic Content and Navigation

To scrape multiple pages using Playwright, you need to click the “Load more…” button repeatedly until you reach the end. However, we can write code to automate this process and scrape a specific number of repositories you want. For example, imagine there are 10,000 repositories for “nodejs” and you only want to extract data for 1,000 of them.

Here’s how the script will crawl multiple pages:

  1. Open the ‘nodejs’ topic page on GitHub.

  2. Create an empty set to store unique repository data entries.

  3. $$eval function extracts data from each repository like username, repository name, star count, description, tags, and URL.

  4. The extracted data is stored only if it’s unique.

  5. The code checks for a “Next page” using a specific locator. If available, it clicks the button to move to the next page and repeat the scraping process.

  6. If no button is found, the loop exits.

  7. Finally, the launched browser instance closes.

Here’s the code:

import { chromium } from 'playwright';

async function scrapeData(numRepos) {
    const browser = await chromium.launch({ headless: true });
    const context = await browser.newContext();
    const page = await context.newPage();
    await page.goto('<https://github.com/topics/nodejs>');

    const uniqueRepos = new Set();

    while (uniqueRepos.size < numRepos) {
        const extractedData = await page.$$eval('article.border', repos => {
            return repos.map(repo => {
                const userLink = repo.querySelector('h3 > a:first-child');
                const repoLink = repo.querySelector('h3 > a:last-child');
                const user = userLink?.textContent.trim() || '';
                const repoName = repoLink?.textContent.trim() || '';
                const repoStar = repo.querySelector('#repo-stars-counter-star')?.title || '';
                const repoDescription = repo.querySelector('div.px-3 > p')?.textContent.trim() || '';
                const tags = Array.from(repo.querySelectorAll('a.topic-tag')).map(tag => tag.textContent.trim());
                const repoUrl = repoLink?.href || '';

                return { user, repoName, repoStar, repoDescription, tags, repoUrl };
            });
        });

        extractedData.forEach(entry => uniqueRepos.add(JSON.stringify(entry)));

        if (uniqueRepos.size >= numRepos) {
            break;
        }

        const button = await page.locator('button[type="submit"].ajax-pagination-btn.f6');
        if (!button) {
            console.log('Pagination button not found. All data scraped.');
            break;
        }

        await button.click();
        await page.waitForLoadState('networkidle');
    }

    const uniqueList = Array.from(uniqueRepos).slice(0, numRepos).map(entry => JSON.parse(entry));
    console.dir(uniqueList, { depth: null });

    await browser.close();
}

scrapeData(30);

Handling Errors

Several errors can arise during scraping web pages due to various factors, including human errors like providing a non-functioning URL or failing to click a button. Additionally, the targeted data element might be absent on the page, for example, a code repository might lack a description or stars.

Fortunately, there are strategies to handle these challenges. A common one is using try/catch blocks. These blocks allow you to gracefully handle errors, such as failed page navigation or timeouts, preventing your code from crashing and enabling continued execution.

Here’s the complete code with handling errors and edge cases:

import { chromium } from 'playwright';

async function scrapeData(numRepos) {
    let browser;
    try {
        browser = await chromium.launch({ headless: true });
        const context = await browser.newContext();
        const page = await context.newPage();
        await page.goto('<https://github.com/topics/nodejs>');

        const uniqueRepos = new Set();

        while (uniqueRepos.size < numRepos) {
            const extractedData = await page.$$eval('article.border', repos => {
                return repos.map(repo => {
                    const userLink = repo.querySelector('h3 > a:first-child');
                    const repoLink = repo.querySelector('h3 > a:last-child');
                    const user = userLink?.textContent.trim() || '';
                    const repoName = repoLink?.textContent.trim() || '';
                    const repoStar = repo.querySelector('#repo-stars-counter-star')?.title || '';
                    const repoDescription = repo.querySelector('div.px-3 > p')?.textContent.trim() || '';
                    const tags = Array.from(repo.querySelectorAll('a.topic-tag')).map(tag => tag.textContent.trim());
                    const repoUrl = repoLink?.href || '';

                    return { user, repoName, repoStar, repoDescription, tags, repoUrl };
                });
            });

            if (extractedData.length === 0) {
                console.log('No articles found on this page.');
                break;
            }

            extractedData.forEach(entry => uniqueRepos.add(JSON.stringify(entry)));

            if (uniqueRepos.size >= numRepos) {
                break;
            }

            const button = await page.locator('button[type="submit"].ajax-pagination-btn.f6');
            if (!button) {
                console.log('Next button not found. All data scraped.');
                break;
            }

            await button.click();
            await page.waitForLoadState('networkidle');
        }

        const uniqueList = Array.from(uniqueRepos).slice(0, numRepos).map(entry => JSON.parse(entry));
        console.dir(uniqueList, { depth: null });
    } catch (error) {
        console.error('Error during scraping:', error);
    } finally {
        if (browser) {
            await browser.close();
        }
    }
}

scrapeData(30).catch(error => console.error('Unhandled error:', error));

Using Proxies with Playwright

Scraping data from websites can sometimes be challenging. Websites may restrict access based on your location or block your IP address. This is where proxies come in handy. Proxies help bypass these restrictions by hiding your real IP address and location.

Firstly, get your proxy from the Free Proxy List. Then, add a proxy object to your browser launch options. Within the proxy object, set the server parameter to your proxyUrl. Finally, launch the browser by calling the chromium.launch function, providing the launchOptions object you just defined.

import { chromium } from 'playwright';

const proxyUrl = '<http://20.210.113.32:80>';

const launchOptions = {
    proxy: {
        server: proxyUrl
    }
};

(async () => {
    const browser = await chromium.launch(launchOptions);
    const page = await browser.newPage();

    await page.goto('<http://httpbin.org/ip>');

    const pageContent = await page.textContent('body');
    console.log(pageContent);
})();

The result is:

Check your proxy work

Check your proxy work

We did it! The IP address matches the one from the web page, confirming that Playwright is using the specified proxy.

Note: Free proxies are not recommended due to their unreliability. Specifically, their short lifespan makes them unsuitable for real-world scenarios.

Intercepting HTTP requests

With Playwright, you can easily monitor and modify network traffic, such as HTTP and HTTPS requests, XMLHttpRequests (XHRs), and fetch requests. Below is a code snippet that shows how to modify a request header.

import { chromium } from 'playwright';

(async () => {
    const browser = await chromium.launch({ headless: true });
    const context = await browser.newContext();
    const page = await context.newPage();

    await page.route('<https://httpbin.org/headers>', (route, request) => {
        // Get original headers
        const originalHeaders = request.headers();

        // Modify the Accept-Language and User-Agent headers
        const modifiedHeaders = { ...originalHeaders };
        modifiedHeaders['accept-language'] = 'fr-FR'; // Change to French
        modifiedHeaders['user-agent'] = 'Mozilla/5.0 (Windows Phone 10.0; Android 4.2.1; Microsoft; RM-1127_16056) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Mobile Safari/537.36 Edge/12.10536';

        // Continue the request with modified headers
        route.continue({
            headers: modifiedHeaders,
        });
    });

    // Make the request with modified headers
    await page.goto('<https://httpbin.org/headers>');

    // Extract the data to see if the fields are updated
    const response = await page.evaluate(() => {
        return JSON.parse(document.querySelector('pre').textContent);
    });

    console.log('Response:', response);

    await browser.close();
})();

The code sets up a route handler for the URL https://httpbin.org/headers. This handler intercepts requests and modifies specific headers, like “Accept-Language” and “User-Agent”, before sending them.

Inside the handler function, it modifies the Accept-Language header to 'fr-FR' (French) and the User-Agent header as well. After these modifications, the handler continues the request with the updated headers using route.continue().

Original header - visit https://httpbin.org/headers to see the original header.

Research your headers

Research your headers

Modified header - returned by our code.

Research your headers

Research your headers

Filling Forms

Filling and submitting forms is straightforward using Playwright. You just have to pass the selectors to the .fill() function to fill in the values in the input fields such as username and password. Then use the .click() function to submit the form.

Let’s see this in action by using Playwright to log in to our Reddit account.

import { chromium } from 'playwright';

(async () => {

    const browser = await chromium.launch({ headless: true });

    const page = await browser.newPage();

    await page.goto('<https://reddit.com/login>');

    await page.fill('input[name="username"]', "satyamtri");
    await page.fill('input[name="password"]', "<secret-password>");

    await page.click('.login');

    await page.waitForNavigation();

    await page.screenshot({
        path: "reddit.png",
        fullPage: false
    });

    await browser.close();

})();

Research Reddit Page

Research Reddit Page

Screenshot Capture of Web Pages

Playwright lets you capture screenshots of web pages. This feature is valuable for visual verification, as it allows you to capture the snapshot at any point in time.

import { chromium } from 'playwright';

async function screenShot() {
    const browser = await chromium.launch({
        headless: true
    });

    const context = await browser.newContext();
    const page = await context.newPage();

    await page.setViewportSize({ width: 1280, height: 800 }); // set screenshot dimension
    await page.goto('<https://github.com/topics/nodejs>')
    await page.screenshot({ path: 'images/screenshot.png' })
    await browser.close()
}
screenShot();

Find Repositories

Find Repositories

To capture the entire page, set the fullPage property to true. You can also change the image format to jpg or jpeg for saving in different formats.

await page.screenshot({ path: 'images/screenshot.jpg', fullPage: true });

To capture a specific area on a webpage, use the clip property. It requires defining four values:

  • x: Horizontal distance from the top-left corner of the capture area.

  • y: Vertical distance from the top-left corner of the capture area.

  • width: Width of the capture area.

  • height: Height of the capture area.

await page.screenshot({
        path: "images/screenshot.png", fullPage: false, clip: {
            x: 5,
            y: 5,
            width: 320,
            height: 160
        }
    });

Use topics

Use topics

Save Scraped Data to Excel

Awesome! We’ve successfully scraped the data. Let’s save it to an Excel file instead of printing it to the console.

We’ll use the exceljs package to write the data to the Excel file. But before doing this, install it using the following command:

npm install exceljs

Here’s the code snippet to store data in the Excel file:

const workbook = new ExcelJS.Workbook();
const sheet = workbook.addWorksheet('GitHub Repositories');
sheet.columns = [
    { header: 'User', key: 'user' },
    { header: 'Repository Name', key: 'repoName' },
    { header: 'Stars', key: 'repoStar' },
    { header: 'Description', key: 'repoDescription' },
    { header: 'Repository URL', key: 'repoUrl' },
    { header: 'Tags', key: 'tags' }
];

uniqueList.forEach(entry => {
    sheet.addRow(entry);
});

await workbook.xlsx.writeFile('github_repos.xlsx');

The code snippet uses several functions:

  1. new ExcelJS.Workbook(): Creates a new ExcelJS Workbook object.

  2. workbook.addWorksheet('GitHub Repositories'): Adds a new worksheet named “GitHub Repositories” to the workbook.

  3. sheet.columns: Defines the columns in the worksheet. Each column object specifies the header and key for that column.

  4. sheet.addRow(entry): Adds a row to the worksheet with data from the entry object (represents data for a single row).

  5. workbook.xlsx.writeFile('github_repos.xlsx'): Writes the workbook to a file named “github_repos.xlsx” in the XLSX format.

The complete code:

import { chromium } from 'playwright';
import ExcelJS from 'exceljs';

async function scrapeData(numRepos) {
    let browser;
    try {
        browser = await chromium.launch({ headless: true });
        const context = await browser.newContext();
        const page = await context.newPage();
        await page.goto('<https://github.com/topics/nodejs>');

        const uniqueRepos = new Set();

        while (uniqueRepos.size < numRepos) {
            const extractedData = await page.$$eval('article.border', repos => {
                return repos.map(repo => {
                    const userLink = repo.querySelector('h3 > a:first-child');
                    const repoLink = repo.querySelector('h3 > a:last-child');
                    const user = userLink?.textContent.trim() || '';
                    const repoName = repoLink?.textContent.trim() || '';
                    const repoStar = repo.querySelector('#repo-stars-counter-star')?.title || '';
                    const repoDescription = repo.querySelector('div.px-3 > p')?.textContent.trim() || '';
                    const tags = Array.from(repo.querySelectorAll('a.topic-tag')).map(tag => tag.textContent.trim());
                    const repoUrl = repoLink?.href || '';

                    return { user, repoName, repoStar, repoDescription, tags, repoUrl };
                });
            });

            if (extractedData.length === 0) {
                console.log('No articles found on this page.');
                break;
            }

            extractedData.forEach(entry => uniqueRepos.add(JSON.stringify(entry)));

            if (uniqueRepos.size >= numRepos) {
                break;
            }

            const button = await page.locator('button[type="submit"].ajax-pagination-btn.f6');
            if (!button) {
                console.log('Next button not found. All data scraped.');
                break;
            }

            await button.click();
            await page.waitForLoadState('networkidle');
        }

        const uniqueList = Array.from(uniqueRepos).slice(0, numRepos).map(entry => JSON.parse(entry));

        // Save data to Excel file
        const workbook = new ExcelJS.Workbook();
        const sheet = workbook.addWorksheet('GitHub Repositories');
        sheet.columns = [
            { header: 'User', key: 'user' },
            { header: 'Repository Name', key: 'repoName' },
            { header: 'Stars', key: 'repoStar' },
            { header: 'Description', key: 'repoDescription' },
            { header: 'Repository URL', key: 'repoUrl' },
            { header: 'Tags', key: 'tags' }
        ];

        uniqueList.forEach(entry => {
            sheet.addRow(entry);
        });
        await workbook.xlsx.writeFile('github_repos.xlsx');
        console.log('Data saved to excel file.');

    } catch (error) {
        console.error('Error during scraping:', error);
    } finally {
        if (browser) {
            await browser.close();
        }
    }
}
scrapeData(30).catch(error => console.error('Unhandled error:', error));

File with Results

File with Results

Comparison with other tools

Other tools like Selenium and Puppeteer offer similar functionalities to Playwright. However, each tool has its strengths in terms of execution speed, developer experience, and community support.

Playwright stands out for its ability to run seamlessly across multiple browsers (including Chromium, WebKit, and Firefox) using a single API. It also has extensive documentation and supports various programming languages like Python, Node.js, Java, and .NET.

While Puppeteer is also developer-friendly and easy to set up, it’s limited to JavaScript and Chromium browsers. Selenium, on the other hand, offers the broadest browser and language support, but it can be slower and less user-friendly.

In terms of speed, Puppeteer generally takes the lead, followed closely by Playwright (with Playwright even surpassing Puppeteer in some cases). Selenium lags in performance.

Let’s examine the npm trends and popularity of these three libraries. The data suggests Playwright’s adoption is growing among developers.

Libraries Full Comparison

Libraries Full Comparison

Let’s take a look at the comparison table:

ParameterPlaywrightPuppeteerSelenium
SpeedFastFastSlow
DocumentationExcellentExcellentFair
Developer ExperienceBestGoodFair
Language SupportJavaScript, Python, C#, JavaJavaScriptJava, Python, C#, Ruby, JavaScript, Kotlin
ByMicrosoftGoogleCommunity and Sponsors
CommunitySmall but activeLarge and activeLarge and active
Browser SupportChromium, Firefox, and WebKitChromiumChrome, Firefox, IE, Edge, Opera, Safari

Conclusion

Playwright offers a powerful and versatile toolkit for web scraping tasks with excellent documentation and a growing community. By leveraging its capabilities, you can efficiently extract valuable data from websites, automate repetitive browser interactions, and streamline various workflows.

In this guide, we focused on the GitHub Topics page, where you can choose a topic (like nodejs) and specify the number of repositories to scrape. We covered how to handle errors and edge cases during scraping, using proxies to avoid detection, and comparing it with Selenium and Puppeteer.

As you gain experience, explore advanced techniques in detail like configuring proxies, intercepting requests, managing cookies, and blocking unnecessary resources and images. You can learn more about Playwright by visiting its official documentation, which is easy to understand and detailed.

Blog

Might Be Interesting