Web Scraping with Playwright and Node.js
Extracting data from websites has become essential in various fields, ranging from data collection to competitive analysis. However, content that is loaded through JavaScript, AJAX requests, and complex interactions can be tricky to capture with basic tools. This is where headless browser automation tools, such as Playwright, come into play.
In this blog, we’ll dive into how to use Playwright for web scraping. We’ll explore its advantages, scraping single and multiple pages, error handling, button clicks, form submissions, and crucial techniques like leveraging proxies to bypass detection and intercepting requests. Additionally, the comparison with other web scraping and automation tools like Puppeteer and Selenium.
Gain instant access to a wealth of business data on Google Maps, effortlessly extracting vital information like location, operating hours, reviews, and more in HTML or JSON format.
Get real-time access to Google search results, structured data, and more with our powerful SERP API. Streamline your development process with easy integration of our API. Start your free trial now!
What is Playwright
Playwright is a popular open-source framework built on Node.js for web testing and automation. It allows testing Chromium, Firefox, and WebKit with a single API. Playwright was developed by Microsoft and offers efficient, reliable, and fast cross-browser web automation. It works across multiple platforms, including Windows, Linux, and macOS, and supports all modern web browsers. Additionally, Playwright provides cross-language support for TypeScript, JavaScript, Python, .NET, and Java.
Getting Started with Playwright
Before scraping websites, let’s prepare our system’s Node.js and Python environments for Playwright.
Project Setup & Installation
For Nodejs:
Ensure you have the latest version of Node.js installed on your system. For this project, we’ll be using Node.js v20.9.0.
Create a new directory for your project, navigate to it, and initialize a Node.js project by running npm init
. The npm init -y
command creates a package.json
file with all the dependencies.
mkdir playwright-scraping
cd playwright-scraping
npm init -y
Now, you can install Playwright using NPM:
npm install playwright
To use Playwright, you’ll also need to install a compatible browser. Each Playwright version requires specific browser binary versions. Run the following command to install the latest browser versions:
npx playwright install
This will install the latest versions of Chromium, Firefox, and WebKit. You can use any of these browsers in your code, but we’ll use Chromium for this tutorial.
Here’s how the complete process looks:
Open the package.json
file and add "type": "module"
to support modern JavaScript syntax.
Finally, open the project in your preferred code editor and create a new file named index.js
.
For Python:
Ensure you have the latest version of Python installed on your system. Next, install the Playwright Python library and the necessary browsers.
pip install playwright
playwright install
Here’s how the complete process looks:
Launching Playwright
For Nodejs:
Let’s write our first Playwright code that opens a new page in a Chromium browser.
import { chromium } from 'playwright';
async function main() {
// Launch a new instance of a Chromium browser with headless mode
// disabled for visibility
const browser = await chromium.launch({
headless: false
});
// Create a new Playwright context to isolate browsing session
const context = await browser.newContext();
// Open a new page/tab within the context
const page = await context.newPage();
// Navigate to the GitHub topics homepage
await page.goto('<https://github.com/topics>');
// Wait for 1 second to ensure page content loads properly
await page.waitForTimeout(1000);
// Close the browser instance after task completion
await browser.close();
}
// Execute the main function
main();
The above code imports the chromium
module to control Chromium-based browsers. It then launches a new, visible instance of the Chromium browser using chromium.launch()
with the headless
option set to False
. A new browser page is opened, and the page.goto()
function navigates to the GitHub topics web page. A one-second wait allows the user to see the page before it is finally closed.
For Python:
Playwright for Python offers both synchronous and asynchronous APIs. The following example shows the asynchronous API in action:
from playwright.async_api import async_playwright
import asyncio
async def main():
# Initialize Playwright asynchronously
async with async_playwright() as p:
# Launch a Chromium browser instance with headless mode disabled
browser = await p.chromium.launch(headless=False)
# Create a new context within the browser to isolate browsing session
context = await browser.new_context()
# Create a new page/tab within the context
page = await context.new_page()
# Navigate to the GitHub topics homepage
await page.goto('<https://github.com/topics>')
# Wait for 1 second to ensure page content loads properly
await page.wait_for_timeout(1000)
# Close the browser instance after task completion
await browser.close()
# Run the main function asynchronously
asyncio.run(main())
Both Node.js and Python code share similarities, but there are some key differences. Python uses the asyncio library for asynchronous operations. Additionally, function naming conventions differ, with Python using snake_case (e.g., wait_for_timeout
) and JavaScript using camelCase (e.g., waitForTimeout
).
The following example shows the synchronous API in action:
from playwright.sync_api import sync_playwright
def main():
# Initialize Playwright synchronously
with sync_playwright() as p:
# Launch a Chromium browser instance with headless mode disabled
browser = p.chromium.launch(headless=False)
# Create a new context within the browser to isolate browsing session
context = browser.new_context()
# Create a new page/tab within the context
page = context.new_page()
# Navigate to the GitHub topics homepage
page.goto('<https://github.com/topics>')
# Wait for 1 second to ensure page content loads properly
page.wait_for_timeout(1000)
# Close the browser instance after task completion
browser.close()
if __name__ == '__main__':
main()
The Node.js and Python codes mentioned above will open the following page.
Basic Scraping with Playwright
Now that your environment is set up, let’s dive into some basic web scraping with Playwright. You can do everything you normally do manually in the browser, from generating screenshots to crawling multiple pages.
Selecting Data to Scrape
We’ll be extracting data from GitHub topics. This will allow you to select the topic and the number of repositories you want to extract. The scraper will then return the information associated with the chosen topic.
We’ll use Playwright to launch a browser, navigate to the GitHub topics page, and extract the necessary information. This includes details such as the repository owner, repository name, repository URL, the number of stars the repository has, its description, and any associated tags.
Locating Elements and Extracting Data
When you open the topic page, you’ll see 20 repositories. Each entry, shown as an <article>
element, displays information about a specific repository. You can expand each element to view more detailed information about the corresponding repository.
The image below shows an expanded <article>
element, displaying all the information about the repository.
Extracting User and Repository Information:
User: Use
h3 > a:first-child
to target the first anchor tag directly within a<h3>
tag.Repository Name: Target the second child element within the same
<h3>
parent. This child holds both the name and URL. Use thetextContent
property to extract the name and thegetAttribute('href')
method to extract the URL.Number of Stars: Use
#repo-stars-counter-star
to select the element and extract the actual number from itstitle
attribute.Repository Description: Use
div.px-3 > p
to select the first paragraph within adiv
with the classpx-3
.Repository Tags: Use
a.topic-tag
to select all anchor tags with the classtopic-tag
.
Common Functions:
To use the above selectors effectively, here are the common functions:
$$eval(selector, function)
: This function selects all elements matching theselector
and passes them as an array to thefunction
. The function’s return value is then returned.$eval(selector, function)
: This function selects the first element matching theselector
and passes it as an argument to thefunction
. The function’s return value is then returned.querySelector(selector)
: This function returns the first element matching theselector
.querySelectorAll(selector)
: This function returns a list of all elements matching theselector
.
Here’s a code snippet:
repos.forEach(repo => {
const user = repo.querySelector('h3 > a:first-child').textContent.trim();
const repoLink = repo.querySelector('h3 > a:nth-child(2)');
const repoName = repoLink.textContent.trim();
const repoUrl = repoLink.getAttribute('href');
const repoStar = repo.querySelector('#repo-stars-counter-star').getAttribute('title');
const repoDescription = repo.querySelector('div.px-3 > p').textContent.trim();
const tagsElements = Array.from(repo.querySelectorAll('a.topic-tag'));
const tags = tagsElements.map(tag => tag.textContent.trim());
Let’s look at the complete process for extracting all repositories from a single page.
The process begins with the page.$$eval function. This function selects all elements with the class border
within the article
tag and passes them as an array to the provided function. We’ll define all variables and selectors within this function.
const extractedRepos = await page.$$eval('article.border', repos => { ... });
Also, create an empty array called repoData
to store the extracted information.
const repoData = [];
Iterates through each element in the repos
array to extract all relevant data for each repository using the provided selectors.
repos.forEach(repo => { ... });
Finally, the extracted data for each repository is added to the repoData
array, and the array is returned.
repoData.push({ user, repoName, repoStar, repoDescription, tags, repoUrl });
Here’s the code for all the above steps.
const extractedRepos = await page.$$eval('article.border', repos => {
const repoData = [];
repos.forEach(repo => {
const user = repo.querySelector('h3 > a:first-child').textContent.trim();
const repoLink = repo.querySelector('h3 > a:nth-child(2)');
const repoName = repoLink.textContent.trim();
const repoUrl = repoLink.getAttribute('href');
const repoStar = repo.querySelector('#repo-stars-counter-star').getAttribute('title');
const repoDescription = repo.querySelector('div.px-3 > p').textContent.trim();
const tagsElements = Array.from(repo.querySelectorAll('a.topic-tag'));
const tags = tagsElements.map(tag => tag.textContent.trim());
repoData.push({ user, repoName, repoStar, repoDescription, tags, repoUrl });
});
return repoData;
});
Here’s the complete code. When you run this code, it’ll extract the first page of GitHub Topics.
import { chromium } from 'playwright';
(async () => {
// Launch a headless browser
const browser = await chromium.launch({ headless: true });
// Open a new page
const context = await browser.newContext();
const page = await context.newPage();
// Navigate to the Node.js topic page on GitHub
await page.goto('<https://github.com/topics/nodejs>');
const extractedRepos = await page.$$eval('article.border', repos => {
// Array to store extracted data
const repoData = [];
// Extract data from each repository element
repos.forEach(repo => {
const user = repo.querySelector('h3 > a:first-child').textContent.trim();
const repoLink = repo.querySelector('h3 > a:nth-child(2)');
const repoName = repoLink.textContent.trim();
const repoUrl = repoLink.getAttribute('href');
const repoStar = repo.querySelector('#repo-stars-counter-star').getAttribute('title');
const repoDescription = repo.querySelector('div.px-3 > p').textContent.trim();
const tagsElements = Array.from(repo.querySelectorAll('a.topic-tag'));
const tags = tagsElements.map(tag => tag.textContent.trim());
// Add extracted data to the array
repoData.push({ user, repoName, repoStar, repoDescription, tags, repoUrl });
});
// Return the extracted data
return repoData;
});
console.log(`Total repositories extracted: ${extractedRepos.length}\\n`);
// Print extracted data to the console
console.dir(extractedRepos, { depth: null }); // Show all nested data
// Close the browser
await browser.close();
})();
The result is:
If you’re interested in exploring more about web scraping with Node.js beyond Playwright, you might find our comprehensive guide on Node.js web scraping helpful. This article covers additional libraries and techniques to enhance your scraping capabilities with Node.js
Advanced Scraping Techniques
We’ve successfully scraped the single page. Now, let’s move on to advanced scraping using Playwright. You can click buttons, fill forms, crawl multiple pages, and rotate headers and proxies to make your scraping more reliable…
Clicking Buttons and Waiting for Actions
You can load more repositories by clicking the ‘Load more…’ button at the bottom of the page. Here are the actions to tell Playwright to load more repositories:
Wait for the “Load more…” button to appear.
Click the “Load more…” button.
Wait for the new repositories to load before proceeding.
Clicking buttons in Playwright is straightforward! Simply pass a valid CSS selector to the locator
method, which efficiently finds the element on the page.
const button = await page.locator('button[type="submit"].ajax-pagination-btn.f6');
await button.click();
In this case, the page.locator()
method searches for a button element with attributes type="submit"
and CSS classes ajax-pagination-btn
and f6
.
Handling Dynamic Content and Navigation
To scrape multiple pages using Playwright, you need to click the “Load more…” button repeatedly until you reach the end. However, we can write code to automate this process and scrape a specific number of repositories you want. For example, imagine there are 10,000 repositories for “nodejs” and you only want to extract data for 1,000 of them.
Here’s how the script will crawl multiple pages:
Open the ‘nodejs’ topic page on GitHub.
Create an empty set to store unique repository data entries.
$$eval
function extracts data from each repository like username, repository name, star count, description, tags, and URL.The extracted data is stored only if it’s unique.
The code checks for a “Next page” using a specific locator. If available, it clicks the button to move to the next page and repeat the scraping process.
If no button is found, the loop exits.
Finally, the launched browser instance closes.
Here’s the code:
import { chromium } from 'playwright';
async function scrapeData(numRepos) {
const browser = await chromium.launch({ headless: true });
const context = await browser.newContext();
const page = await context.newPage();
await page.goto('<https://github.com/topics/nodejs>');
const uniqueRepos = new Set();
while (uniqueRepos.size < numRepos) {
const extractedData = await page.$$eval('article.border', repos => {
return repos.map(repo => {
const userLink = repo.querySelector('h3 > a:first-child');
const repoLink = repo.querySelector('h3 > a:last-child');
const user = userLink?.textContent.trim() || '';
const repoName = repoLink?.textContent.trim() || '';
const repoStar = repo.querySelector('#repo-stars-counter-star')?.title || '';
const repoDescription = repo.querySelector('div.px-3 > p')?.textContent.trim() || '';
const tags = Array.from(repo.querySelectorAll('a.topic-tag')).map(tag => tag.textContent.trim());
const repoUrl = repoLink?.href || '';
return { user, repoName, repoStar, repoDescription, tags, repoUrl };
});
});
extractedData.forEach(entry => uniqueRepos.add(JSON.stringify(entry)));
if (uniqueRepos.size >= numRepos) {
break;
}
const button = await page.locator('button[type="submit"].ajax-pagination-btn.f6');
if (!button) {
console.log('Pagination button not found. All data scraped.');
break;
}
await button.click();
await page.waitForLoadState('networkidle');
}
const uniqueList = Array.from(uniqueRepos).slice(0, numRepos).map(entry => JSON.parse(entry));
console.dir(uniqueList, { depth: null });
await browser.close();
}
scrapeData(30);
Handling Errors
Several errors can arise during scraping web pages due to various factors, including human errors like providing a non-functioning URL or failing to click a button. Additionally, the targeted data element might be absent on the page, for example, a code repository might lack a description or stars.
Fortunately, there are strategies to handle these challenges. A common one is using try/catch
blocks. These blocks allow you to gracefully handle errors, such as failed page navigation or timeouts, preventing your code from crashing and enabling continued execution.
Here’s the complete code with handling errors and edge cases:
import { chromium } from 'playwright';
async function scrapeData(numRepos) {
let browser;
try {
browser = await chromium.launch({ headless: true });
const context = await browser.newContext();
const page = await context.newPage();
await page.goto('<https://github.com/topics/nodejs>');
const uniqueRepos = new Set();
while (uniqueRepos.size < numRepos) {
const extractedData = await page.$$eval('article.border', repos => {
return repos.map(repo => {
const userLink = repo.querySelector('h3 > a:first-child');
const repoLink = repo.querySelector('h3 > a:last-child');
const user = userLink?.textContent.trim() || '';
const repoName = repoLink?.textContent.trim() || '';
const repoStar = repo.querySelector('#repo-stars-counter-star')?.title || '';
const repoDescription = repo.querySelector('div.px-3 > p')?.textContent.trim() || '';
const tags = Array.from(repo.querySelectorAll('a.topic-tag')).map(tag => tag.textContent.trim());
const repoUrl = repoLink?.href || '';
return { user, repoName, repoStar, repoDescription, tags, repoUrl };
});
});
if (extractedData.length === 0) {
console.log('No articles found on this page.');
break;
}
extractedData.forEach(entry => uniqueRepos.add(JSON.stringify(entry)));
if (uniqueRepos.size >= numRepos) {
break;
}
const button = await page.locator('button[type="submit"].ajax-pagination-btn.f6');
if (!button) {
console.log('Next button not found. All data scraped.');
break;
}
await button.click();
await page.waitForLoadState('networkidle');
}
const uniqueList = Array.from(uniqueRepos).slice(0, numRepos).map(entry => JSON.parse(entry));
console.dir(uniqueList, { depth: null });
} catch (error) {
console.error('Error during scraping:', error);
} finally {
if (browser) {
await browser.close();
}
}
}
scrapeData(30).catch(error => console.error('Unhandled error:', error));
Using Proxies with Playwright
Scraping data from websites can sometimes be challenging. Websites may restrict access based on your location or block your IP address. This is where proxies come in handy. Proxies help bypass these restrictions by hiding your real IP address and location.
Firstly, get your proxy from the Free Proxy List. Then, add a proxy object to your browser launch options. Within the proxy object, set the server
parameter to your proxyUrl
. Finally, launch the browser by calling the chromium.launch
function, providing the launchOptions
object you just defined.
import { chromium } from 'playwright';
const proxyUrl = '<http://20.210.113.32:80>';
const launchOptions = {
proxy: {
server: proxyUrl
}
};
(async () => {
const browser = await chromium.launch(launchOptions);
const page = await browser.newPage();
await page.goto('<http://httpbin.org/ip>');
const pageContent = await page.textContent('body');
console.log(pageContent);
})();
The result is:
We did it! The IP address matches the one from the web page, confirming that Playwright is using the specified proxy.
Note: Free proxies are not recommended due to their unreliability. Specifically, their short lifespan makes them unsuitable for real-world scenarios.
Intercepting HTTP requests
With Playwright, you can easily monitor and modify network traffic, such as HTTP and HTTPS requests, XMLHttpRequests (XHRs), and fetch requests. Below is a code snippet that shows how to modify a request header.
import { chromium } from 'playwright';
(async () => {
const browser = await chromium.launch({ headless: true });
const context = await browser.newContext();
const page = await context.newPage();
await page.route('<https://httpbin.org/headers>', (route, request) => {
// Get original headers
const originalHeaders = request.headers();
// Modify the Accept-Language and User-Agent headers
const modifiedHeaders = { ...originalHeaders };
modifiedHeaders['accept-language'] = 'fr-FR'; // Change to French
modifiedHeaders['user-agent'] = 'Mozilla/5.0 (Windows Phone 10.0; Android 4.2.1; Microsoft; RM-1127_16056) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Mobile Safari/537.36 Edge/12.10536';
// Continue the request with modified headers
route.continue({
headers: modifiedHeaders,
});
});
// Make the request with modified headers
await page.goto('<https://httpbin.org/headers>');
// Extract the data to see if the fields are updated
const response = await page.evaluate(() => {
return JSON.parse(document.querySelector('pre').textContent);
});
console.log('Response:', response);
await browser.close();
})();
The code sets up a route handler for the URL https://httpbin.org/headers
. This handler intercepts requests and modifies specific headers, like “Accept-Language” and “User-Agent”, before sending them.
Inside the handler function, it modifies the Accept-Language
header to 'fr-FR'
(French) and the User-Agent
header as well. After these modifications, the handler continues the request with the updated headers using route.continue()
.
Original header - visit https://httpbin.org/headers to see the original header.
Modified header - returned by our code.
Filling Forms
Filling and submitting forms is straightforward using Playwright. You just have to pass the selectors to the .fill()
function to fill in the values in the input fields such as username and password. Then use the .click()
function to submit the form.
Let’s see this in action by using Playwright to log in to our Reddit account.
import { chromium } from 'playwright';
(async () => {
const browser = await chromium.launch({ headless: true });
const page = await browser.newPage();
await page.goto('<https://reddit.com/login>');
await page.fill('input[name="username"]', "satyamtri");
await page.fill('input[name="password"]', "<secret-password>");
await page.click('.login');
await page.waitForNavigation();
await page.screenshot({
path: "reddit.png",
fullPage: false
});
await browser.close();
})();
Screenshot Capture of Web Pages
Playwright lets you capture screenshots of web pages. This feature is valuable for visual verification, as it allows you to capture the snapshot at any point in time.
import { chromium } from 'playwright';
async function screenShot() {
const browser = await chromium.launch({
headless: true
});
const context = await browser.newContext();
const page = await context.newPage();
await page.setViewportSize({ width: 1280, height: 800 }); // set screenshot dimension
await page.goto('<https://github.com/topics/nodejs>')
await page.screenshot({ path: 'images/screenshot.png' })
await browser.close()
}
screenShot();
To capture the entire page, set the fullPage
property to true
. You can also change the image format to jpg
or jpeg
for saving in different formats.
await page.screenshot({ path: 'images/screenshot.jpg', fullPage: true });
To capture a specific area on a webpage, use the clip
property. It requires defining four values:
x
: Horizontal distance from the top-left corner of the capture area.y
: Vertical distance from the top-left corner of the capture area.width
: Width of the capture area.height
: Height of the capture area.
await page.screenshot({
path: "images/screenshot.png", fullPage: false, clip: {
x: 5,
y: 5,
width: 320,
height: 160
}
});
Save Scraped Data to Excel
Awesome! We’ve successfully scraped the data. Let’s save it to an Excel file instead of printing it to the console.
We’ll use the exceljs
package to write the data to the Excel file. But before doing this, install it using the following command:
npm install exceljs
Here’s the code snippet to store data in the Excel file:
const workbook = new ExcelJS.Workbook();
const sheet = workbook.addWorksheet('GitHub Repositories');
sheet.columns = [
{ header: 'User', key: 'user' },
{ header: 'Repository Name', key: 'repoName' },
{ header: 'Stars', key: 'repoStar' },
{ header: 'Description', key: 'repoDescription' },
{ header: 'Repository URL', key: 'repoUrl' },
{ header: 'Tags', key: 'tags' }
];
uniqueList.forEach(entry => {
sheet.addRow(entry);
});
await workbook.xlsx.writeFile('github_repos.xlsx');
The code snippet uses several functions:
new ExcelJS.Workbook()
: Creates a new ExcelJS Workbook object.workbook.addWorksheet('GitHub Repositories')
: Adds a new worksheet named “GitHub Repositories” to the workbook.sheet.columns
: Defines the columns in the worksheet. Each column object specifies the header and key for that column.sheet.addRow(entry)
: Adds a row to the worksheet with data from the entry object (represents data for a single row).workbook.xlsx.writeFile('github_repos.xlsx')
: Writes the workbook to a file named “github_repos.xlsx” in the XLSX format.
The complete code:
import { chromium } from 'playwright';
import ExcelJS from 'exceljs';
async function scrapeData(numRepos) {
let browser;
try {
browser = await chromium.launch({ headless: true });
const context = await browser.newContext();
const page = await context.newPage();
await page.goto('<https://github.com/topics/nodejs>');
const uniqueRepos = new Set();
while (uniqueRepos.size < numRepos) {
const extractedData = await page.$$eval('article.border', repos => {
return repos.map(repo => {
const userLink = repo.querySelector('h3 > a:first-child');
const repoLink = repo.querySelector('h3 > a:last-child');
const user = userLink?.textContent.trim() || '';
const repoName = repoLink?.textContent.trim() || '';
const repoStar = repo.querySelector('#repo-stars-counter-star')?.title || '';
const repoDescription = repo.querySelector('div.px-3 > p')?.textContent.trim() || '';
const tags = Array.from(repo.querySelectorAll('a.topic-tag')).map(tag => tag.textContent.trim());
const repoUrl = repoLink?.href || '';
return { user, repoName, repoStar, repoDescription, tags, repoUrl };
});
});
if (extractedData.length === 0) {
console.log('No articles found on this page.');
break;
}
extractedData.forEach(entry => uniqueRepos.add(JSON.stringify(entry)));
if (uniqueRepos.size >= numRepos) {
break;
}
const button = await page.locator('button[type="submit"].ajax-pagination-btn.f6');
if (!button) {
console.log('Next button not found. All data scraped.');
break;
}
await button.click();
await page.waitForLoadState('networkidle');
}
const uniqueList = Array.from(uniqueRepos).slice(0, numRepos).map(entry => JSON.parse(entry));
// Save data to Excel file
const workbook = new ExcelJS.Workbook();
const sheet = workbook.addWorksheet('GitHub Repositories');
sheet.columns = [
{ header: 'User', key: 'user' },
{ header: 'Repository Name', key: 'repoName' },
{ header: 'Stars', key: 'repoStar' },
{ header: 'Description', key: 'repoDescription' },
{ header: 'Repository URL', key: 'repoUrl' },
{ header: 'Tags', key: 'tags' }
];
uniqueList.forEach(entry => {
sheet.addRow(entry);
});
await workbook.xlsx.writeFile('github_repos.xlsx');
console.log('Data saved to excel file.');
} catch (error) {
console.error('Error during scraping:', error);
} finally {
if (browser) {
await browser.close();
}
}
}
scrapeData(30).catch(error => console.error('Unhandled error:', error));
Comparison with other tools
Other tools like Selenium and Puppeteer offer similar functionalities to Playwright. However, each tool has its strengths in terms of execution speed, developer experience, and community support.
Playwright stands out for its ability to run seamlessly across multiple browsers (including Chromium, WebKit, and Firefox) using a single API. It also has extensive documentation and supports various programming languages like Python, Node.js, Java, and .NET.
While Puppeteer is also developer-friendly and easy to set up, it’s limited to JavaScript and Chromium browsers. Selenium, on the other hand, offers the broadest browser and language support, but it can be slower and less user-friendly.
In terms of speed, Puppeteer generally takes the lead, followed closely by Playwright (with Playwright even surpassing Puppeteer in some cases). Selenium lags in performance.
Let’s examine the npm trends and popularity of these three libraries. The data suggests Playwright’s adoption is growing among developers.
Let’s take a look at the comparison table:
Parameter | Playwright | Puppeteer | Selenium |
---|---|---|---|
Speed | Fast | Fast | Slow |
Documentation | Excellent | Excellent | Fair |
Developer Experience | Best | Good | Fair |
Language Support | JavaScript, Python, C#, Java | JavaScript | Java, Python, C#, Ruby, JavaScript, Kotlin |
By | Microsoft | Community and Sponsors | |
Community | Small but active | Large and active | Large and active |
Browser Support | Chromium, Firefox, and WebKit | Chromium | Chrome, Firefox, IE, Edge, Opera, Safari |
Conclusion
Playwright offers a powerful and versatile toolkit for web scraping tasks with excellent documentation and a growing community. By leveraging its capabilities, you can efficiently extract valuable data from websites, automate repetitive browser interactions, and streamline various workflows.
In this guide, we focused on the GitHub Topics page, where you can choose a topic (like nodejs) and specify the number of repositories to scrape. We covered how to handle errors and edge cases during scraping, using proxies to avoid detection, and comparing it with Selenium and Puppeteer.
As you gain experience, explore advanced techniques in detail like configuring proxies, intercepting requests, managing cookies, and blocking unnecessary resources and images. You can learn more about Playwright by visiting its official documentation, which is easy to understand and detailed.
Might Be Interesting
Oct 29, 2024
How to Scrape YouTube Data for Free: A Complete Guide
Learn effective methods for scraping YouTube data, including extracting video details, channel info, playlists, comments, and search results. Explore tools like YouTube Data API, yt-dlp, and Selenium for a step-by-step guide to accessing valuable YouTube insights.
- Python
- Tutorials and guides
- Tools and Libraries
Aug 16, 2024
JavaScript vs Python for Web Scraping
Explore the differences between JavaScript and Python for web scraping, including popular tools, advantages, disadvantages, and key factors to consider when choosing the right language for your scraping projects.
- Tools and Libraries
- Python
- NodeJS
Aug 13, 2024
How to Scroll Page using Selenium in Python
Explore various techniques for scrolling pages using Selenium in Python. Learn about JavaScript Executor, Action Class, keyboard events, handling overflow elements, and tips for improving scrolling accuracy, managing pop-ups, and dealing with frames and nested elements.
- Tools and Libraries
- Python
- Tutorials and guides