HasData
Back to all posts

PHP Web Scraping: The Complete Guide [2026]

Valentina Skakun
Valentina Skakun
Last update: 6 Apr 2026

Web scraping in PHP means fetching web pages over HTTP and extracting structured data from their HTML using DOM parsing libraries. PHP handles this well: it has a built-in cURL extension, mature libraries like Guzzle and Symfony DomCrawler, and deploys easily on most servers.

PHP is considered less popular for scraping than Python, but PHP 8.3+ handles async operations, has proper typing, and runs anywhere Composer works. If you already use PHP, you can build scrapers without switching languages.

This guide covers raw HTTP fetching, headless browser automation, and commercial API integration. It starts with native cURL for simple requests, then moves to Guzzle for async and session work, DomCrawler for HTML parsing, and Symfony Panther for JavaScript rendering. Each section shows working code you can adapt.

Best PHP Libraries for Web Scraping in 2026

PHP has libraries for every scraping task. Here are the ones that work in production.

LibraryCategoryInstallBest ForSkip When
cURL (native)HTTP clientBuilt-inSimple scripts, no ComposerConcurrent requests needed
Guzzle 7.xHTTP clientcomposer require guzzlehttp/guzzleAsync, middleware, productionQuick one-off scripts
Symfony HTTP ClientHTTP clientIncluded in SymfonyHTTP/2, Symfony projectsNo Symfony in the stack
Requests for PHPHTTP clientcomposer require rmccue/requestsLightweight scriptsMiddleware or async needed
Symfony DomCrawlerHTML parsercomposer require symfony/dom-crawlerCSS selectors, XPath, productionSpeed is the only concern
DiDOMHTML parsercomposer require imangazaliev/didomLarge documents, fast parsingNeed XPath power features
voku/simple_html_domHTML parsercomposer require voku/simple_html_domSimple scripts, familiar APIHigh-volume loops
RoachPHPFrameworkcomposer require roach-php/corePipelines, Laravel projectsSingle-page scraping
Spatie CrawlerFrameworkcomposer require spatie/crawlerRecursive crawling, robots.txtNot crawling whole sites
Symfony PantherHeadless browsercomposer require symfony/pantherJS rendering, full interactionsStatic HTML pages
Spatie BrowsershotHeadless browsercomposer require spatie/browsershotScreenshots, HTML snapshotsNo Node.js on server
chrome-php/chromeHeadless browsercomposer require chrome-php/chromeDevTools Protocol, no Node.jsMultiple concurrent browsers
php-webdriverHeadless browsercomposer require php-webdriver/webdriverManual Selenium controlSimple wait-and-extract

Pick one HTTP client, one parser, add a headless browser only when the site needs it. That covers most scraping work.

HTTP Clients

Guzzle remains the standard. It handles async requests, connection pooling, and middleware. Use versions 7.4.5+ to avoid CVE-2022-31042 (header leaks on redirects).

Symfony HTTP Client works well if you already use Symfony components. Optimized for HTTP/2 multiplexing, which means multiple requests over one TCP connection.

Requests for PHP has a simple API similar to Python’s requests library. Used in WordPress core. Good for lightweight scripts.

cURL is PHP’s built-in HTTP extension. No install required. Every server running PHP almost certainly has it. Good for simple fetch scripts where adding a library is not worth the overhead.

HTML Parsing

Symfony DomCrawler handles CSS selectors and XPath. Works with valid or broken HTML. Pair it with symfony/css-selector for CSS queries.

composer require symfony/dom-crawler symfony/css-selector

DiDOM parses faster than Simple HTML DOM on large documents. Uses DOMDocument under the hood with a jQuery-like API. Actively maintained.

voku/simple_html_dom is a maintained fork of the original Simple HTML DOM. The original parser had memory leaks and stopped updates in 2019. This fork fixes those issues and supports PHP 8+.

Crawling Frameworks

RoachPHP brings Scrapy architecture to PHP. Spiders, item pipelines, and middleware for data extraction workflows. Laravel adapter available. Requires PHP 8.2+.

Spatie Crawler does recursive site crawling with Guzzle async requests. Respects robots.txt, handles delays, filters URLs. PHP 8.4+ required.

Headless Browsers

Symfony Panther controls Chrome or Firefox via WebDriver. Renders JavaScript, executes complex scenarios. Heavy on resources but handles single-page apps.

Spatie Browsershot wraps Puppeteer for screenshots and HTML after JS execution. Requires Node.js and Puppeteer installed. Faster than Panther for static HTML snapshots.

php-webdriver gives low-level control over Selenium WebDriver. Use when you need manual control over every browser action.

chrome-php/chrome talks directly to Chrome via DevTools Protocol. No Node.js required. Good middle ground between Panther and Browsershot.

Libraries to Avoid

Goutte (deprecated 2023). The creator moved functionality into Symfony BrowserKit. Migrate to Symfony\Component\BrowserKit\HttpBrowser for the same API with active support.

Simple HTML DOM Parser (original). Memory leaks in continuous loops. The parser holds circular references that PHP garbage collection can’t clear. Use the voku fork instead.

PHPScraper by theultrasoft (archived 2023). Problems with modern JS and CSS.

PHP-Spider (mvdbos). Last commits in 2022. Use Spatie Crawler or RoachPHP.

cURL for PHP Scraping

cURL ships with PHP. If PHP is already running on the server, cURL is almost certainly available too. For scraping it covers the essentials. It fetches pages, submits login forms, and keeps session cookies alive between requests. For quick one-off scripts, it is often the fastest way to get something working.

Our target is a test site designed for scraping practice. This site uses username and password in a POST request. No CSRF tokens, no JavaScript challenges. Open DevTools and watch the Network tab during login to see just two fields being sent.

DevTools Network tab showing login POST request with email and password parameters

Screenshot of Network tab showing POST request with email/password fields

Fetching a Page

The simplest scraping request is a GET with browser-like headers. Without headers most sites respond differently to cURL than to a real browser, sometimes returning a stripped version of the page or blocking the request entirely.

<?php
$ch = curl_init('https://www.scrapingcourse.com/dashboard');

curl_setopt_array($ch, [
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_FOLLOWLOCATION => true,
    CURLOPT_TIMEOUT        => 15,
    CURLOPT_HTTPHEADER     => [
        'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/146.0.0.0 Safari/537.36',
        'Accept-Language: en-US,en;q=0.9',
    ],
]);

$html  = curl_exec($ch);
$error = curl_error($ch);

curl_close($ch);

if ($error) {
    echo "Failed: {$error}\n";
    exit(1);
}

echo strlen($html) . " bytes\n";

CURLOPT_RETURNTRANSFER is the setting people miss most often. Without it cURL prints the response body straight to stdout and returns true. Set it and the response comes back as a string.

Logging In

Most login forms send credentials as a POST with URL-encoded fields.

<?php
$cookieFile = '/tmp/session.txt';

$ch = curl_init('https://www.scrapingcourse.com/login');

curl_setopt_array($ch, [
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_POST           => true,
    CURLOPT_POSTFIELDS     => http_build_query([
        'email'    => getenv('SCRAPER_EMAIL'),
        'password' => getenv('SCRAPER_PASSWORD'),
    ]),
    CURLOPT_COOKIEJAR  => $cookieFile,
    CURLOPT_COOKIEFILE => $cookieFile,
]);

curl_exec($ch);
curl_close($ch);

CURLOPT_COOKIEJAR writes the cookies the server sends back into a file. CURLOPT_COOKIEFILE reads from that file and attaches the cookies to outgoing requests. Set both and subsequent requests carry the session automatically.

Scraping Behind a Login

After login the session lives in the cookie file. Pass that file to every request that needs authentication and cURL handles the rest.

<?php
$cookieFile = '/tmp/session.txt'; // same file from the login step

$ch = curl_init('https://www.scrapingcourse.com/dashboard');

curl_setopt_array($ch, [
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_COOKIEFILE     => $cookieFile,
    CURLOPT_COOKIEJAR      => $cookieFile,
]);

$html  = curl_exec($ch);
$error = curl_error($ch);

curl_close($ch);

if ($error) {
    echo "Dashboard request failed: {$error}\n";
    exit(1);
}

// parse $html here

The cookie file persists between script runs as long as the server session stays alive. Only run the login step again when the file is missing or the session has expired.

cURL vs Guzzle

cURL handles sequential scraping well. When the script needs multiple requests running at the same time, Guzzle becomes the better choice.

SituationUse
Quick script, no ComposercURL
Need to control every byte of the requestcURL
cURL extension already in usecURL
Working in an existing Guzzle projectGuzzle
Need async or concurrent requestsGuzzle
Need middleware or retry logicGuzzle

Guzzle runs on top of cURL anyway, so nothing gets replaced. It just adds async support, a cleaner API, and a middleware chain on top of what cURL already does.

HTML Fetching and Session Management

Guzzle builds on cURL and adds a cookie jar, middleware, and async support. Install it first:

composer require guzzlehttp/guzzle

To see the difference, the target here is the same dashboard that requires login and session persistence across requests.

Single Request Login

The obvious approach sends credentials in one POST request.

<?php
use GuzzleHttp\Client;

$client = new Client();

$response = $client->request('POST', 'https://www.scrapingcourse.com/login', [
    'form_params' => [
        'email' => getenv('SCRAPER_EMAIL'),
        'password' => getenv('SCRAPER_PASSWORD')
    ]
]);

echo $response->getStatusCode(); // 200

This logs you in, but the session cookie disappears immediately. Try accessing the dashboard next:

$response = $client->request('GET', 'https://www.scrapingcourse.com/dashboard');
// Redirects to login page - no active session

The server sends a session cookie in the login response, but Guzzle doesn’t store it anywhere. Each request starts fresh with no authentication.

A cookie jar captures and reuses session cookies automatically across requests.

<?php
use GuzzleHttp\Client;
use GuzzleHttp\Cookie\CookieJar;

$jar = new CookieJar();
$client = new Client(['cookies' => $jar]);

// Login - cookie jar captures session cookie
$client->request('POST', 'https://www.scrapingcourse.com/login', [
    'form_params' => [
        'email' => getenv('SCRAPER_EMAIL'),
        'password' => getenv('SCRAPER_PASSWORD')
    ]
]);

// Session cookie sent automatically
$response = $client->request('GET', 'https://www.scrapingcourse.com/dashboard');
$html = $response->getBody()->getContents();

The cookie jar intercepts the Set-Cookie header from the login response and includes that cookie in every subsequent request to the same domain. No manual cookie handling required.

Form-Based Login with HttpBrowser

For sites with complex forms, Symfony HttpBrowser provides automatic form handling.

<?php
require 'vendor/autoload.php';
use Symfony\Component\BrowserKit\HttpBrowser;
use Symfony\Component\HttpClient\HttpClient;

$browser = new HttpBrowser(HttpClient::create());
$crawler = $browser->request('GET', 'https://www.scrapingcourse.com/login');

// Find and populate the form
$form = $crawler->selectButton('Login')->form([
    'email' => getenv('SCRAPER_EMAIL'),
    'password' => getenv('SCRAPER_PASSWORD'),
]);

// Submit and follow redirect
$crawler = $browser->submit($form);

// Extract data from dashboard
$products = $crawler->filter('.product-item')->each(function ($node) {
    return [
        'name' => $node->filter('.product-name')->text(),
        'price' => $node->filter('.product-price')->text(),
    ];
});

HttpBrowser handles cookies, redirects, and form submissions automatically. It simulates a real browser session without the overhead of running actual Chrome or Firefox.

Saving Sessions Between Runs

For scrapers that run periodically (cron jobs, scheduled tasks), save cookies to disk to skip re-authentication.

<?php
use GuzzleHttp\Cookie\FileCookieJar;

$jar = new FileCookieJar('/tmp/scraper_session.json', true);
$client = new Client(['cookies' => $jar]);

// Check if session exists
if (!file_exists('/tmp/scraper_session.json') || filesize('/tmp/scraper_session.json') === 0) {
    // First run - login and save cookies
    $client->request('POST', 'https://www.scrapingcourse.com/login', [
        'form_params' => [
            'email' => getenv('SCRAPER_EMAIL'),
            'password' => getenv('SCRAPER_PASSWORD')
        ]
    ]);
}

// Session loaded from file automatically
$response = $client->request('GET', 'https://www.scrapingcourse.com/dashboard');

FileCookieJar writes cookies to JSON format. The session persists between script executions as long as the server-side session hasn’t expired (usually 15-30 minutes for most sites).

Watch out for session expiration though. If the server invalidates your session, the scraper will access login-protected pages without authentication and get redirected. Add a check for this:

<?php
$response = $client->request('GET', 'https://www.scrapingcourse.com/dashboard', [
    'allow_redirects' => ['track_redirects' => true],
]);

// Check if we landed back on the login page (session expired)
$redirectHistory = $response->getHeader('X-Guzzle-Redirect-History');
$finalUrl = !empty($redirectHistory) ? end($redirectHistory) : '';
if (str_contains($finalUrl, '/login') || str_contains($response->getBody()->getContents(), '<form action="/login"')) {
    // Re-authenticate
    unlink('/tmp/scraper_session.json');
    $client->request('POST', 'https://www.scrapingcourse.com/login', [
        'form_params' => [
            'email' => getenv('SCRAPER_EMAIL'),
            'password' => getenv('SCRAPER_PASSWORD')
        ]
    ]);

    $response = $client->request('GET', 'https://www.scrapingcourse.com/dashboard');
}

Handling Redirects During Login

Some sites redirect after successful login (from /login to /dashboard or /home). Guzzle follows redirects automatically by default, but you can control this behavior.

<?php
// Track redirect chain
$response = $client->request('POST', 'https://www.scrapingcourse.com/login', [
    'form_params' => [
        'email' => getenv('SCRAPER_EMAIL'),
        'password' => getenv('SCRAPER_PASSWORD'),
    ],
    'allow_redirects' => [
        'max' => 5,
        'track_redirects' => true
    ]
]);

// See where login took you
$redirects = $response->getHeader('X-Guzzle-Redirect-History');
print_r($redirects);

Disable redirects to capture the intermediate response:

$response = $client->request('POST', 'https://www.scrapingcourse.com/login', [
    'form_params' => ['email' => getenv('SCRAPER_EMAIL'), 'password' => getenv('SCRAPER_PASSWORD')],
    'allow_redirects' => false
]);

$statusCode = $response->getStatusCode(); // 302
$location = $response->getHeader('Location')[0]; // https://www.scrapingcourse.com/dashboard

This helps debug login flows or extract tokens from redirect URLs.

DOM Parsing and Data Extraction

After fetching HTML, extract the data using CSS selectors or XPath. The target is a demo product catalog with nested elements and optional fields.

CSS Selectors vs XPath

Both approaches work for DOM traversal. CSS selectors offer readability, XPath provides more power and better performance.

CSS Selectors handle most scraping tasks with jQuery-like syntax. These examples use file_get_contents() for brevity, which requires allow_url_fopen=On in php.ini (enabled by default on most setups). For production scrapers, use Guzzle or cURL instead.

<?php

use Symfony\Component\DomCrawler\Crawler;

$html = file_get_contents('https://electronics.nop-templates.com/laptops');
$crawler = new Crawler($html);

// Get all product containers
$products = $crawler->filter('.product-item');

// Navigate nested elements
$titles = $crawler->filter('.product-item .product-title a');

// Attribute extraction
$images = $crawler->filter('.product-item img')->each(function (Crawler $node) {
    return $node->attr('src');
});

XPath handles complex relationships CSS can’t express and runs faster in Symfony DomCrawler.

<?php

// Same product containers
$products = $crawler->filterXPath('//div[@class="product-item"]');

// Get products with old price (on sale)
$saleProducts = $crawler->filterXPath('//div[@class="product-item"][.//span[@class="old-price"]]');

// Navigate to parent then sibling
$prices = $crawler->filterXPath('//span[@class="actual-price"]/..//span[@class="old-price"]');

// Position-based selection (skip first 3 products)
$remainingProducts = $crawler->filterXPath('//div[@class="product-item"][position() > 3]');

When to use each:

TaskUse CSSUse XPath
Simple class/ID selection++
Attribute contains value++
Navigate to parent element-+
Position-based filtering-+
Text content matching-+
Complex boolean logic-+
Performance critical code-+

Performance comparison shows XPath significantly faster in DomCrawler. Running 1,000 iterations of the same query:

  • CSS Selectors: 3.865 seconds
  • XPath: 0.564 seconds
  • XPath is 585% faster in Symfony DomCrawler

Why? Symfony DomCrawler converts CSS selectors to XPath internally using the CssSelector component. This translation happens on every filter() call. Using XPath directly skips this conversion layer and queries libxml2 immediately.

For production scrapers processing thousands of pages, use XPath. For quick scripts where readability matters more than performance, CSS selectors work fine.

Extracting Product Data

The laptop catalog has products with varying structures. Some have old prices (sales), some don’t. Some products have ribbons (New, Sale), others don’t. Even when elements are expected to exist, it’s safer to validate them before accessing their values to avoid runtime errors.

<?php

require __DIR__ . '/vendor/autoload.php';

use Symfony\Component\DomCrawler\Crawler;

$html = file_get_contents('https://electronics.nop-templates.com/laptops');
$crawler = new Crawler($html);

$products = $crawler->filterXPath('//div[@class="product-item"]')->each(function (Crawler $node) {
    // Title always exists (but still validated for safety)
    $titleNode = $node->filterXPath('.//h2[@class="product-title"]/a');
    $title = $titleNode->count() ? $titleNode->text() : 'N/A';

    // Description might be empty
    $description = $node->filterXPath('.//div[@class="description"]');
    $descText = $description->count() ? trim($description->text()) : 'N/A';

    // Actual price always present (but still validated)
    $priceNode = $node->filterXPath('.//span[@class="price actual-price"]');
    $actualPrice = $priceNode->count() ? $priceNode->text() : '0';

    // Old price only on sale items
    $oldPrice = $node->filterXPath('.//span[@class="price old-price"]');
    $oldPriceText = $oldPrice->count() ? $oldPrice->text() : null;

    // Product URL
    $url = $titleNode->count() ? $titleNode->attr('href') : null;

    // Image source
    $imageNode = $node->filterXPath('.//div[@class="picture"]//img');
    $image = $imageNode->count() ? $imageNode->attr('src') : null;

    return [
        'title' => $title,
        'description' => $descText,
        'price' => $actualPrice,
        'old_price' => $oldPriceText,
        'url' => $url ? 'https://electronics.nop-templates.com' . $url : null,
        'image' => $image,
    ];
});

foreach ($products as $product) {
    echo "Title: {$product['title']}\n";
    echo "Price: {$product['price']}";
    if ($product['old_price']) {
        echo " (was {$product['old_price']})";
    }
    echo "\n";
    echo "Description: {$product['description']}\n";
    echo "URL: {$product['url']}\n";
    echo str_repeat('-', 80) . "\n";
}

Output from the test run:

Title: Acer 5750 
Price: $1,400.00 (was $1,600.00) 
Description: Acer Aspire 5750 
URL: [https://electronics.nop-templates.com/acer-5750](https://electronics.nop-templates.com/acer-5750)

Title: Dell Inspiron N5110 
Price: $1,800.00 (was $200.00) 
Description: Dell Inspiron N5110 series 
URL: [https://electronics.nop-templates.com/dell-inspiron-n5110](https://electronics.nop-templates.com/dell-inspiron-n5110)

Title: HP Pavilion G6 
Price: $1,950.00 
Description: HP Pavilion G6-1105SQ series 
URL: [https://electronics.nop-templates.com/hp-pavilion-g6](https://electronics.nop-templates.com/hp-pavilion-g6)

The count() check prevents crashes when elements don’t exist. Always validate before calling text() or attr() on optional fields.

Handling Broken HTML

In practice, HTML is messy. Unclosed tags, missing quotes, invalid nesting. DOMDocument handles most issues but throws warnings.

<?php
$html = '<div><p>Unclosed paragraph<div>Invalid nesting</p></div>';

$doc = new DOMDocument();
$doc->loadHTML($html);
// PHP Warning:  DOMDocument::loadHTML(): Unexpected end tag : p in Entity, line: 1

Suppress warnings with libxml_use_internal_errors().

<?php
$html = '<div><p>Unclosed paragraph<div>Invalid nesting</p></div>';

libxml_use_internal_errors(true);

$doc = new DOMDocument();
$doc->loadHTML($html);

$errors = libxml_get_errors();
libxml_clear_errors();

// Optionally log errors
foreach ($errors as $error) {
    error_log("HTML Parse Error: {$error->message}");
}

DOMDocument auto-corrects most HTML issues in recovery mode (enabled by default). It closes unclosed tags, fixes nesting, and adds missing elements.

<?php
$brokenHtml = '
<div class="product">
    <h2>Product Name
    <p>Description without closing tag
    <span class="price">$19.99
</div>
';

libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadHTML($brokenHtml);

$xpath = new DOMXPath($doc);
$price = $xpath->query('//span[@class="price"]')->item(0)->textContent;
echo $price; // $19.99

For extremely broken HTML, use the HTML5 parser from Masterminds.

composer require masterminds/html5
<?php
use Masterminds\HTML5;

$html5 = new HTML5();
$dom = $html5->loadHTML($brokenHtml);

$xpath = new DOMXPath($dom);
$elements = $xpath->query('//div[@class="product"]');

The HTML5 parser follows WHATWG spec and handles modern HTML better than DOMDocument’s libxml2 parser.

Asynchronous Scraping Architecture

Scraping 100 product pages one by one takes time. Each request waits for the server to respond before moving to the next URL. The scraper sits idle while network packets travel back and forth.

Async scraping sends multiple requests at once. While waiting for one server to respond, the scraper fires off ten more requests. This turns a 5-minute job into a 30-second task.

Concurrent Requests with Guzzle Promises

Guzzle Promises send requests without blocking. The scraper fires all requests immediately and processes responses as they arrive.

<?php
require 'vendor/autoload.php';
use GuzzleHttp\Client;
use GuzzleHttp\Promise;

$client = new Client([
    'verify' => false
// Security Warning: `'verify' => false` disables SSL certificate verification.
// Never use this in production, it exposes your scraper to man-in-the-middle attacks.
// It is used here only to simplify examples against test endpoints.

]);
$promises = [];
$urls = [
    'https://electronics.nop-templates.com/laptops',
    'https://electronics.nop-templates.com/desktops',
    'https://electronics.nop-templates.com/monitors',
    'https://electronics.nop-templates.com/tablets',
    'https://electronics.nop-templates.com/notebooks',
    'https://electronics.nop-templates.com/accessories-2'
];

foreach ($urls as $url) {
    $promises[$url] = $client->requestAsync('GET', $url);
}

$start = microtime(true);

$results = Promise\Utils::unwrap($promises);

$elapsed = microtime(true) - $start;
echo "Time: {$elapsed}s\n";

This sends all requests at once. The server receives a flood of connections, and responses come back in parallel. Time drops from 40 seconds to around 3-4 seconds (1.79 sec in our case).

But this approach has problems. Sending 100 simultaneous requests can overwhelm the target server or trigger rate limiting. The scraper also consumes significant memory holding all promises.

Controlled Concurrency with Pools

Guzzle Pool limits concurrent requests to a reasonable number.

<?php
require 'vendor/autoload.php';
use GuzzleHttp\Client;
use GuzzleHttp\Pool;
use GuzzleHttp\Psr7\Request;

$client = new Client([
    'verify' => false   // WARNING: DEV ONLY, remove in production
]);

$urls = [
    'https://electronics.nop-templates.com/laptops',
    'https://electronics.nop-templates.com/desktops',
    'https://electronics.nop-templates.com/monitors',
    'https://electronics.nop-templates.com/tablets',
    'https://electronics.nop-templates.com/notebooks',
    'https://electronics.nop-templates.com/accessories-2',
    // ... more category URLs
];

$requests = function ($urls) {
    foreach ($urls as $url) {
        yield new Request('GET', $url);
    }
};

$products = [];

$pool = new Pool($client, $requests($urls), [
    'concurrency' => 10,
    'fulfilled' => function ($response, $index) use (&$products, $urls) {
        $html = $response->getBody()->getContents();

        // Parse products
        $crawler = new \Symfony\Component\DomCrawler\Crawler($html);
        $count = $crawler->filterXPath('//div[@class="product-item"]')->count();

        $products[$urls[$index]] = $count;
        echo "Scraped {$urls[$index]}: {$count} products\n";
    },
    'rejected' => function ($reason, $index) use ($urls) {
        echo "Failed {$urls[$index]}: {$reason}\n";
    },
]);

$start = microtime(true);
$promise = $pool->promise();
$promise->wait();
$elapsed = microtime(true) - $start;

echo "\nTotal time: {$elapsed}s\n";
echo "Total products: " . array_sum($products) . "\n";

With concurrency set to 10, the pool maintains 10 active requests at all times. When one finishes, the pool immediately starts the next queued request. This keeps the network saturated without overwhelming the server.

Real benchmark with 10 URLs:

  • Synchronous: 4.1 seconds
  • Asynchronous (concurrency 10): 1.25 seconds
  • Speedup: 3.3x faster

The speedup scales with more URLs. For 100 pages, async can be 10-20x faster depending on server response times.

Memory Leak Prevention

Async scraping processes many pages quickly, which exposes memory management issues. A scraper that runs fine on 10 pages crashes on 1,000 due to memory leaks.

The problem comes from circular references in DOM objects. DomCrawler and DOMDocument create object graphs where parent nodes reference children and children reference parents. PHP’s garbage collector can break these cycles, but it relies on reference counting first.

Circular references keep reference counts above zero, so the collector only finds and frees them when its cycle detection phase runs, which happens automatically only when an internal root buffer fills up. In a high-volume scraping loop, cycles accumulate far faster than that threshold triggers.

<?php
$pool = new Pool($client, $requests($urls), [
    'concurrency' => 50,
    'fulfilled' => function ($response, $index) {
        $html = $response->getBody()->getContents();

        $crawler = new Crawler($html);
        $products = $crawler->filterXPath('//div[@class="product-item"]');

        // Process products...

        // Memory leak: $crawler and DOM objects stay in memory
    },
]);

After processing 1,000 pages, memory usage climbs to several gigabytes. The script eventually crashes with “Allowed memory size exhausted.”

Solution 1: Unset variables explicitly

<?php
'fulfilled' => function ($response, $index) {
    $html = $response->getBody()->getContents();

    $crawler = new Crawler($html);
    $products = $crawler->filterXPath('//div[@class="product-item"]')->each(function ($node) {
        return [
            'title' => $node->filterXPath('.//h2[@class="product-title"]/a')->text(),
            'price' => $node->filterXPath('.//span[@class="actual-price"]')->text(),
        ];
    });

    // Save products...

    unset($crawler, $html, $products);
},

This helps but doesn’t fully solve the problem. The DOMDocument inside Crawler still holds references.

Solution 2: Trigger garbage collection

<?php
'fulfilled' => function ($response, $index) use (&$processedCount) {
    $html = $response->getBody()->getContents();

    $crawler = new Crawler($html);
    // Process data...

    unset($crawler, $html);

    $processedCount++;

    // Force GC every 100 pages
    if ($processedCount % 100 === 0) {
        gc_collect_cycles();
        echo "Memory: " . round(memory_get_usage() / 1024 / 1024, 2) . " MB\n";
    }
},

gc_collect_cycles() forces PHP to break circular references and free memory. Call it periodically, not on every page (GC has overhead).

Solution 3: Process in batches

For very large scraping jobs (10,000+ pages), process URLs in batches and restart the script between batches.

<?php
$batchSize = 1000;
$offset = (int)($argv[1] ?? 0);

$urlBatch = array_slice($allUrls, $offset, $batchSize);

// Process batch...

echo "Processed URLs " . $offset . " to " . ($offset + $batchSize) . "\n";
echo "Next batch: php scraper.php " . ($offset + $batchSize) . "\n";

Each batch starts with clean memory. This prevents any leaks from accumulating across the entire job.

Monitoring memory during scraping:

<?php
$pool = new Pool($client, $requests($urls), [
    'concurrency' => 50,
    'fulfilled' => function ($response, $index) use (&$startMemory) {
        if (!isset($startMemory)) {
            $startMemory = memory_get_usage();
        }

        $html = $response->getBody()->getContents();
        $crawler = new Crawler($html);

        // Process...

        unset($crawler, $html);

        $currentMemory = memory_get_usage();
        $leaked = $currentMemory - $startMemory;

        if ($leaked > 50 * 1024 * 1024) { // 50MB leaked
            echo "Warning: Memory leak detected ({$leaked} bytes)\n";
            gc_collect_cycles();
            $startMemory = memory_get_usage();
        }
    },
]);

This tracks memory growth and triggers GC when leaks exceed a threshold.

Dynamic Content and JavaScript Rendering

Single-page applications load content after the initial page renders. A standard HTTP client receives an empty shell. Headless browsers execute JavaScript and wait for the full DOM to build.

Headless Browsers with Symfony Panther

Panther controls Chrome or Firefox through WebDriver. It loads the page, executes JavaScript, and returns the rendered DOM.

Use Panther when:Skip Panther when:
Content loads via JavaScript after page renderSite works with Guzzle (check with a quick test)
Site uses React, Vue, Angular, or similar frameworksAPI endpoints exist for the data (inspect Network tab)
Data appears only after user interactions (clicks, scrolls)Static HTML contains all needed information
Anti-bot protection requires browser fingerprintsScraping thousands of pages (headless browsers are slow)

Run a test request with Guzzle first. If the HTML contains data, skip the browser. Headless scraping is 10-20x slower than HTTP clients.

composer require symfony/panther

Basic setup launches Chrome in headless mode.

<?php
use Symfony\Component\Panther\Client;

$client = Client::createChromeClient();
$crawler = $client->request('GET', 'https://example-spa.com');

// Chrome loads the page, executes JS, renders content
$titles = $crawler->filter('article h2')->each(function ($node) {
    return $node->text();
});

$client->quit();

The browser runs without a visible window. Memory usage is higher than Guzzle (Chrome needs 200-400MB per instance) but you get full JavaScript execution.

Waiting for Dynamic Content

JavaScript rendering takes time. Articles load via AJAX after page load. Use wait strategies to ensure content appears before extraction.

Explicit waits pause until a specific element exists.

<?php
$client->request('GET', 'https://medium.com');

// waitFor pauses until .article-list is in the DOM
$client->waitFor('.article-list', 10);

// Re-fetch the crawler AFTER the wait, not before
$crawler = $client->getCrawler();
$articles = $crawler->filter('.article-list article');

Visibility waits ensure elements are not just in the DOM but actually visible.

<?php
// Wait for element to be visible (not display:none or hidden)
$client->waitForVisibility('.article-title', 10);

// Or wait for loading spinner to disappear
$client->waitForInvisibility('.loading-spinner', 10);

Custom conditions handle complex scenarios.

<?php
// Wait until at least 5 articles have loaded
$client->wait(10)->until(function() use ($client) {
    return $client->getCrawler()->filter('article')->count() >= 5;
});

// Refresh the crawler after the wait
$crawler = $client->getCrawler();

Network idle waits pause until all AJAX requests complete.

<?php
use Spatie\Browsershot\Browsershot;

// Wait until network is idle for 500ms
$html = Browsershot::url('https://medium.com')
    ->waitUntilNetworkIdle()
    ->bodyHtml();

Timeout errors happen when elements never appear. Always set realistic timeout values and handle failures.

<?php
try {
    $client->waitFor('.articles', 10);
    $crawler = $client->getCrawler();
    $articles = $crawler->filter('.articles article');
} catch (\Exception $e) {
    echo "Articles failed to load: " . $e->getMessage();
    // Fallback or retry logic
}

HasData API Integration

Headless browsers solve JavaScript rendering but add complexity. HasData API handles rendering, proxies, and anti-bot bypass in one request.

When to use APIs:When to use Panther:
Scraping sites with strong anti-bot protection (Cloudflare, PerimeterX)Full control over browser behavior needed
No server resources for running Chrome instancesScraping internal tools or sites without anti-bot
Need instant scaling without infrastructure setupBudget constraints (API costs per request)
Need proxies, CAPTCHA solving and JS renderingComplex interactions (multi-step forms, authenticated sessions)

Minimal example with HasData web scraping API.

<?php
require 'vendor/autoload.php';
use GuzzleHttp\Client;

$client = new Client([
    'verify' => false     // WARNING: DEV ONLY, remove in production
]);
$apiKey = 'HASDATA-API-KEY';
$url = 'https://dev.to/';

$response = $client->request('POST', 'https://api.hasdata.com/scrape/web', [
    'headers' => [
        'Content-Type' => 'application/json',
        'x-api-key' => $apiKey
    ],
    'json' => [
        'url' => $url,
        'jsRendering' => true,
        'proxyType' => 'datacenter',
        'proxyCountry' => 'US'
    ]
]);

$result = json_decode($response->getBody(), true);
$html = $result['content'];
echo $html;

The API executes JavaScript, rotates proxies, and returns rendered HTML. No ChromeDriver installation, no memory management, no browser crashes.

AI-powered extraction removes the need for CSS selectors. Define what you want in plain English.

<?php
require 'vendor/autoload.php';
use GuzzleHttp\Client;

$client = new Client([
    'verify' => false,  // WARNING: DEV ONLY, remove in production
    'timeout' => 80,
    'connect_timeout' => 15,
]);
$apiKey = 'HASDATA-API-KEY';

$response = $client->request('POST', 'https://api.hasdata.com/scrape/web', [
    'headers' => [
        'Content-Type' => 'application/json',
        'x-api-key' => $apiKey
    ],
    'json' => [
        'url' => 'https://dev.to/',
        'aiExtractRules' => [
            'articles' => [
                'type' => 'list',
                'output' => [
                    'title' => [
                        'description' => 'article title',
                        'type' => 'string'
                    ],
                    'author' => [
                        'type' => 'string'
                    ],
                    'publishDate' => [
                        'type' => 'string'
                    ]
                ]
            ]
        ]
    ]
]);

$result = json_decode($response->getBody(), true);
$articles = $result['aiResponse']['articles'];
foreach ($articles as $article) {
    echo "Title: {$article['title']}\n";
    echo "Author: {$article['author']}\n";
    echo "Publish Date: {$article['publishDate']}\n\n";
}

No XPath, no CSS selectors, no dealing with class name changes. The AI identifies the data and extracts it automatically.

Output formats include HTML, Markdown, JSON, or plain text.

Common Setup Issues

ChromeDriver not found. Panther downloads ChromeDriver automatically but can fail on restrictive networks. Download manually and specify the path.

$client = Client::createChromeClient('/path/to/chromedriver');

Chrome profiles conflict (Windows). Multiple Chrome profiles cause “Could not start chrome” errors. Use —no-first-run and —no-default-browser-check flags.

$client = Client::createChromeClient(null, [
    '--headless',
    '--disable-gpu',
    '--no-sandbox',
    '--no-first-run',
    '--no-default-browser-check'
]);

Missing ChromeDriver. If Panther fails to download ChromeDriver automatically, install it manually.

composer require --dev dbrekelmans/bdi
php vendor/bin/bdi detect drivers

This detects your Chrome version and downloads the matching driver.

Memory exhaustion. Each browser instance consumes 200-400MB. Limit concurrent browsers and close them properly.

$client = Client::createChromeClient();

// Scrape pages
$crawler = $client->request('GET', $url);
// Extract data...

// Always quit to free memory
$client->quit();

For batch jobs, restart the browser every 50-100 pages to prevent memory leaks.

$urls = [...]; // 500 URLs
$batchSize = 50;

for ($i = 0; $i < count($urls); $i += $batchSize) {
    $client = Client::createChromeClient();

    $batch = array_slice($urls, $i, $batchSize);
    foreach ($batch as $url) {
        $crawler = $client->request('GET', $url);
        // Extract data...
    }

    $client->quit(); // Free memory
    gc_collect_cycles();
}

Timeouts on slow sites. Increase default timeout for heavy pages.

$client = Client::createChromeClient(null, [
    '--headless',
], [], [
    'connection_timeout_in_ms' => 30000,  // 30 seconds
    'request_timeout_in_ms' => 60000      // 60 seconds
]);

SSL certificate errors. Sites with self-signed certificates or expired SSL fail with verification errors.

// Quick fix for testing (NOT for production)
$client = new Client([
    'verify' => false
]);

For production, download the CA bundle and verify properly.

$client = new Client([
    'verify' => '/path/to/cacert.pem'
]);

Never disable SSL verification in production scrapers. Download the latest CA bundle from https://curl.se/docs/caextract.html

Bypassing Anti Bot Protections

Sites block scrapers through IP tracking, browser fingerprinting, and behavior analysis. Bypass these defenses with proxy rotation, realistic headers, and rate limiting.

Proxy Rotation

Rotating proxies prevents IP-based blocks. Build a simple rotator that cycles through a pool and handles failures.

<?php
class ProxyRotator
{
    private array $proxies;
    private int $currentIndex = 0;
    private array $failedProxies = [];

    public function __construct(array $proxies)
    {
        $this->proxies = $proxies;
    }

    public function getNext(): ?string
    {
        if (empty($this->proxies)) {
            return null;
        }

        // Skip failed proxies
        $attempts = 0;
        while ($attempts < count($this->proxies)) {
            $proxy = $this->proxies[$this->currentIndex];
            $this->currentIndex = ($this->currentIndex + 1) % count($this->proxies);

            if (!in_array($proxy, $this->failedProxies)) {
                return $proxy;
            }

            $attempts++;
        }

        return null; // All proxies failed
    }

    public function markFailed(string $proxy): void
    {
        if (!in_array($proxy, $this->failedProxies)) {
            $this->failedProxies[] = $proxy;
        }
    }

    public function resetFailed(): void
    {
        $this->failedProxies = [];
    }
}

Use it with Guzzle requests.

<?php
use GuzzleHttp\Client;

$proxies = [
    'http://proxy1:port1',
    'http://proxy2:port2',
    'http://proxy3:port3',
];

$rotator = new ProxyRotator($proxies);
$client = new Client();

foreach ($urls as $url) {
    $proxy = $rotator->getNext();

    if (!$proxy) {
        echo "All proxies failed\n";
        break;
    }

    try {
        $response = $client->request('GET', $url, [
            'proxy' => $proxy,
            'timeout' => 10
        ]);

        // Process response...

    } catch (\Exception $e) {
        $rotator->markFailed($proxy);
        echo "Proxy {$proxy} failed: {$e->getMessage()}\n";
    }
}

Proxies fail due to bans, timeouts, or invalid credentials. Track failures and skip bad proxies automatically.

Header Spoofing

Default Guzzle headers look like a bot. Real browsers send specific header combinations.

<?php
// Bot-like request (gets blocked)
$response = $client->request('GET', $url);

// Browser-like request
$response = $client->request('GET', $url, [
    'headers' => [
        'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/146.0.0.0 Safari/537.36',
        'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
        'Accept-Language' => 'en-US,en;q=0.9',
        'Accept-Encoding' => 'gzip, deflate, br',
        'Referer' => 'https://google.com/',
        'DNT' => '1',
        'Connection' => 'keep-alive',
        'Upgrade-Insecure-Requests' => '1'
    ]
]);

Rotate User-Agent strings to avoid pattern detection.

<?php
$userAgents = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/146.0.0.0 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/146.0.0.0 Safari/537.36',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/146.0.0.0 Safari/537.36',
    'Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/146.0.7680.178 Mobile Safari/537.36'
];

$headers = [
    'User-Agent' => $userAgents[array_rand($userAgents)],
    'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language' => 'en-US,en;q=0.9',
];

$response = $client->request('GET', $url, ['headers' => $headers]);

Set Referer to simulate navigation from search engines or social media.

$referers = [
    'https://www.google.com/',
    'https://www.bing.com/',
    'https://twitter.com/',
    'https://www.facebook.com/'
];

$headers['Referer'] = $referers[array_rand($referers)];

Header order matters for advanced fingerprinting. Standard Guzzle sends headers alphabetically, which looks suspicious. Most sites ignore this, but some check.

Rate Limiting

Scraping too fast triggers rate limits. Add delays between requests and handle 429 responses gracefully.

<?php
$delayMs = 1000; // 1 second between requests

foreach ($urls as $url) {
    $response = $client->request('GET', $url);

    // Process response...

    usleep($delayMs * 1000); // Convert ms to microseconds
}

Implement exponential backoff for 429 (Too Many Requests) responses.

<?php
use GuzzleHttp\Client;
use Psr\Http\Message\ResponseInterface;

function fetchWithBackoff(Client $client, string $url, int $maxRetries = 5): ?ResponseInterface
{
    $attempt = 0;
    $baseDelay = 1000; // Start with 1 second

    while ($attempt < $maxRetries) {
        try {
            $response = $client->request('GET', $url, ['http_errors' => false]);

            if ($response->getStatusCode() === 429) {
                $delay = $baseDelay * pow(2, $attempt); // Exponential: 1s, 2s, 4s, 8s, 16s
                echo "Rate limited. Waiting {$delay}ms...\n";
                usleep($delay * 1000);
                $attempt++;
                continue;
            }

            return $response;

        } catch (\GuzzleHttp\Exception\ConnectException $e) {
            // Network error (timeout, DNS failure) — retry
            $attempt++;
            $delay = $baseDelay * pow(2, $attempt);
            echo "Connection failed, retrying in {$delay}ms: {$e->getMessage()}\n";
            usleep($delay * 1000);
        } catch (\Exception $e) {
            echo "Request failed: {$e->getMessage()}\n";
            return null;
        }
    }

    echo "Max retries exceeded for {$url}\n";
    return null;
}

Per-domain rate limiting prevents hitting the same site too frequently.

<?php
class RateLimiter
{
    private array $lastRequest = [];
    private int $delayMs;

    public function __construct(int $delayMs = 1000)
    {
        $this->delayMs = $delayMs;
    }

    public function wait(string $domain): void
    {
        if (!isset($this->lastRequest[$domain])) {
            $this->lastRequest[$domain] = microtime(true);
            return;
        }

        $elapsed = (microtime(true) - $this->lastRequest[$domain]) * 1000;
        $remaining = $this->delayMs - $elapsed;

        if ($remaining > 0) {
            usleep($remaining * 1000);
        }

        $this->lastRequest[$domain] = microtime(true);
    }
}

$limiter = new RateLimiter(2000); // 2 seconds per domain

foreach ($urls as $url) {
    $domain = parse_url($url, PHP_URL_HOST);
    $limiter->wait($domain);

    $response = $client->request('GET', $url);
}

Before scraping any site, check its robots.txt file and respect Crawl-delay directives. Spatie Crawler handles this automatically, but with Guzzle or cURL you need to parse it yourself. Fetch https://example.com/robots.txt, check if your target paths are allowed for your user agent, and add delays between requests. Ignoring robots.txt can get your IP blocked and, depending on the jurisdiction, create legal issues.

Conclusion

The patterns in this guide work for price monitoring, content aggregation, lead generation, and competitive analysis. Start with the simplest approach that works (Guzzle + DomCrawler) and add complexity only when needed.

Code examples and complete working scripts are on the Github. Check the GitHub repository for production-ready implementations of all patterns covered in this guide.

Questions or want to discuss scraping strategies? Join our Discord community.

Valentina Skakun
Valentina Skakun
Valentina is a software engineer who builds data extraction tools before writing about them. With a strong background in Python, she also leverages her experience in JavaScript, PHP, R, and Ruby to reverse-engineer complex web architectures.If data renders in a browser, she will find a way to script its extraction.
Articles

Might Be Interesting