HasData
Back to all posts

E-Commerce Scraping with Python: The 2026 Guide

Valentina Skakun
Valentina Skakun
Last update: 6 Jan 2026

E-commerce web scraping transforms unstructured HTML from online marketplaces into structured JSON datasets for business intelligence systems. Engineering teams use this technology to map competitor catalogs and analyze inventory distribution across different regions.

Scraping modern e-commerce requires handling JavaScript hydration, infinite scrolling, and anti-bot defenses. Below is a production-ready template using Python and Playwright that solves 90% of retail scraping challenges.

Production-Ready Snippet

The following script uses the HasData web scraping API to render JavaScript (handling hydration) and extract structured product data using AI-driven extraction rules instead of CSS selectors. This approach eliminates the need for maintaining Selenium grids or managing proxy pools.

import requests
import json
 
# Configuration
API_KEY = "HASDATA_API_KEY"
TARGET_URL = "https://electronics.nop-templates.com/laptops"
OUTPUT_FILE = "ecommerce_data.json"
 
# AI Extraction Schema
# Using AI rules decouples your scraper from the specific DOM structure,
# preventing breakage during minor frontend updates.
extraction_rules = {
    "products": {
        "type": "list",
        "description": "List of all laptop products visible on the page",
        "output": {
            "name": {"type": "string", "description": "Product full title"},
            "price": {"type": "string", "description": "Current price value"},
            "currency": {"type": "string", "description": "Currency symbol"},
            "availability": {"type": "string", "description": "Stock status (e.g. In Stock)"},
            "rating": {"type": "number", "description": "Average user rating (0-5)"},
            "image": {"type": "string", "description": "Main product image URL"}
        }
    }
}
 
# Payload Construction
payload = {
    "url": TARGET_URL,
    "proxyType": "datacenter",  # Use 'residential' for stricter targets (Amazon/Nike)
    "proxyCountry": "US",       # Localizes pricing and currency
    "jsRendering": True,        # Essential for SPA (React/Vue/Angular) sites
    "blockAds": True,           # Speeds up rendering and reduces bandwidth
    "aiExtractRules": extraction_rules
}
 
try:
    response = requests.post(
        "https://api.hasdata.com/scrape/web",
        headers={
            "Content-Type": "application/json",
            "x-api-key": API_KEY
        },
        json=payload,
        timeout=30 # Production timeouts prevent hanging processes
    )
 
    if response.status_code == 200:
        data = response.json()
       
        # Check if AI extraction was successful
        # Adjust 'aiResponse' key based on exact API response structure
        ai_content = data.get("aiResponse") or data
       
        if ai_content:
            with open(OUTPUT_FILE, "w", encoding="utf-8") as f:
                json.dump(ai_content, f, indent=2, ensure_ascii=False)
            print(f"Scraping successful. Data saved to {OUTPUT_FILE}")
        else:
             print(f"Warning: API returned 200 OK but no AI content found. Full response: {data}")
       
    else:
        print(f"Extraction failed. Status: {response.status_code}, Details: {response.text}")
 
except Exception as e:
    print(f"Network or Protocol Error: {str(e)}")

This pipeline ensures data continuity even when the target website updates its HTML structure.

Core Engineering Challenges

Legacy scraping scripts using standard libraries often fail against modern e-commerce platforms like Amazon, Nike, or Zalando. These sites deploy multi-layered defense systems that require a sophisticated architectural approach.

Here is why simple requests.get() calls are no longer sufficient:

Perimeter Defense

It is a common misconception that blocking happens solely based on IP addresses. Modern WAFs (Web Application Firewalls) analyze the TLS Handshake (JA3 fingerprint). Standard Python libraries like requests or urllib have distinct signatures that differ from real browsers (Chrome/Firefox).

If your scraper gets a 403 Forbidden or Error 1020 instantly, your IP might be clean, but your TLS fingerprint is flagged. Understanding how to spoof these signatures is critical. Read our guide on Bypass Cloudflare 1020.

Proxy Rotation

E-commerce sites aggressively ban entire subnets of Datacenter IPs (AWS, DigitalOcean). For stable data extraction, you need Residential Proxies (IPs assigned to real devices by ISPs), which are harder to detect but more expensive.

Furthermore, a static proxy will eventually be throttled. Production systems require a “Rotating Proxy” middleware that assigns a new IP for every request or session.

Dynamic Rendering

Modern storefronts built on Next.js or React often ship empty HTML shells. The actual product data is loaded asynchronously via JavaScript (Hydration). A standard HTTP client will retrieve the shell without the price or inventory data, requiring either a Headless Browser (like Playwright) or reverse-engineering of the internal API.

Extraction Strategy

Effective data extraction requires a systematic discovery process rather than blindly guessing CSS classes. We prioritize stable APIs and structured data over fragile DOM elements.

Data Scraping Protocol

Stable data pipelines prioritize raw JSON payloads over rendered HTML. Parsing the DOM is computationally expensive and fragile. Your extraction logic should follow a strict hierarchy of data sources to ensure long-term reliability.

Flowchart illustrating the optimal e-commerce data extraction hierarchy. The process starts by checking for internal APIs, moves to hidden JSON objects (like JSON-LD or Next.js data), and defaults to HTML parsing only as a fallback measure.

Hierarchy of Scraping Protocol

Always inspect the Network tab first. Modern Single Page Applications fetch data via internal API endpoints. Filter network traffic by Fetch/XHR and trigger pagination or product modals. Capturing these JSON responses provides the cleanest data structure. It bypasses the need for HTML parsing entirely.

If the Network tab yields no results, inspect the page source for embedded JSON blobs. Frameworks like Next.js or Nuxt.js inject the initial state into script tags to hydrate the client-side application. Look for __NEXT_DATA__, window.initialData, or standard application/ld+json tags. Extracting this dictionary is faster and more reliable than traversing the DOM tree.

Use CSS selectors only as a fallback. When HTML parsing is unavoidable, focus on semantic attributes rather than visual class names. Attributes like data-testid, itemprop, or data-product-id rarely change during design updates.

CMS Patterns

Identifying the underlying e-commerce platform allows you to use standardized extraction patterns. Platforms like Shopify, Magento, and WooCommerce have distinct signatures and predictable DOM structures.

Platform Signatures:

  • Shopify. Look for cdn.shopify.com in image sources or the window.Shopify object in the console.
  • WooCommerce. Identified by woocommerce-page classes in the <body> tag or /wp-content/ paths.
  • Magento. Check for Mage objects in JavaScript or x-magento-init script tags.
  • BigCommerce. Often contains “Powered by BigCommerce” in the footer or specific data-content-region attributes.

The following table maps common data points across the e-commerce engines.

ElementShopifyWooCommerceMagentoBigCommercePrestaShopOpenCart
Title.product-title.product_title.page-title.productView-titleh1[itemprop=“name”]h1
Price.price.woocommerce-Price-amount.price-final_price.price—withoutTax.current-price.price
Image.product__image.wp-post-image.fotorama__img.productView-image—default.js-qv-product-cover.thumbnail
Add to Cartbutton[name=“add”].single_add_to_cart_button#product-addtocart-button#form-action-addToCart.add-to-cart#button-cart
Description.product-description#tab-description.product.attribute.description.productView-description.product-description#tab-description
SKU.variant-sku.sku.sku[data-product-sku][itemprop=“sku”]li:contains(“Product Code:“)
Stock.product-form__item—stock.stock.stock[data-product-stock]#product-availabilityli:contains(“Availability:”)

Standardizing these selectors allows you to build a factory pattern scraper that auto-detects the platform and applies the correct extraction strategy.

Automating CSS Selectors with Python

Manual selector selection is slow at scale. You can automate this using heuristic algorithms that detect repeating DOM patterns.

Python-based Streamlit dashboard interface for debugging e-commerce scrapers. The screenshot shows the heuristic scoring output, highlighting extracted product containers with their calculated confidence scores based on image, price, and keyword detection.

Streamlit Demo

We developed an open-source tool hosted on Streamlit that automates this process. It parses the HTML DOM to cluster repeating elements and assigns confidence scores based on currency symbols ($ / €) and semantic class names (price, title, image).

This automated approach significantly reduces the time spent on selector maintenance for custom-built e-commerce sites.

Tool Selection & Architecture

In e-commerce scraping, stack selection is dictated by the target’s defense layer and the volume of SKUs. A script that scrapes 50 prices from a small boutique requires a completely different architecture than a pipeline monitoring 1 million products daily on Amazon or Zalando.

The following matrix maps architectural patterns to specific retail scraping use cases.

MethodStack ExamplesProsConsBest For
HTTP ClientsRequests, httpxExtremely fast, low CPU/RAM usageFails on JS hydration, high ban rate without TLS spoofingAccessing undocumented JSON endpoints or simple targets like shopify sites exposing .json or XML sitemaps.
Scraping FrameworkScrapyHigh performance (async), scalable architecture, built-in throttling/retries, structured data pipelinesComplex to learn, no native JS rendering (needs middleware), overkill for small scriptsCrawling entire categories to map new SKUs and sites with weak protection but thousands of pages
Standard HeadlessSelenium, PuppeteerRenders full JS, handles hydrationEasily detected (WebDriver flags), resource heavyCheckout emulation, verifying marketing tags firing
Stealth HeadlessSeleniumBase UC, Playwright Stealth, Undetected ChromeDriverBypasses basic bot detection, renders JSUnstable maintenance, breaks on browser updates, high RAMTracking prices on mid-tier sites (e.g., BestBuy, regional retailers)
Scraping APIsHasDataZero infra maintenance, auto-scaling, built-in unblockingCost per request, external dependencyMonitoring competitors (Amazon/Walmart) where blocking is aggressive, checking localized prices from 50+ countries

Understanding these trade-offs ensures you allocate engineering resources efficiently.

Local Headless (Playwright/Puppeteer)

Running your own browser grid gives you granular control over the fingerprint, but e-commerce sites present a unique challenge, payload bloat.

Product Detail Pages (PDPs) are notoriously heavy. To support features like “Image Zoom” and “360-View”, retailers load unoptimized high-resolution images (often 2MB+ each). Additionally, third-party scripts (Criteo, Bazaarvoice, Facebook Pixel) consume significant CPU cycles on the client side.

Rendering a full PDP just to extract a price and a title is engineering malpractice. It burns bandwidth and crashes instances due to memory leaks.

To scrape efficiently, you must intercept the network layer. The following Playwright script blocks heavy assets that carry no data value, focusing strictly on the DOM structure needed for extraction.

import asyncio
from playwright.async_api import async_playwright


async def scrape_optimized_pdp():
    async with async_playwright() as p:
        # Launch with flags to minimize rendering overhead
        browser = await p.chromium.launch(
            headless=True,
            args=["--disable-blink-features=AutomationControlled", "--no-sandbox"]
        )
       
        context = await browser.new_context(
            # Rotate User-Agents to avoid being served "mobile" versions with different selectors
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/142.0.0.0 Safari/537.36 Edg/142.0.3595.94"
        )
       
        page = await context.new_page()


        # E-COMMERCE OPTIMIZATION:
        # We aggressively block media and fonts.
        # On a retailer site, this prevents downloading 5-10MB of images per page.
        await page.route("**/*", lambda route: (
            route.abort()
            if route.request.resource_type in ["image", "media", "font", "stylesheet"]
            else route.continue_()
        ))
       
        # Navigate to a heavy product page
        await page.goto("https://electronics.nop-templates.com/laptops")
       
        # Wait for the price element (Critical Path only)
        # Don't wait for 'networkidle' as marketing pixels often keep connections open forever
        await page.wait_for_selector(".product-grid")
       
        products = await page.evaluate("""() => {
            return Array.from(document.querySelectorAll('.product-item')).map(p => ({
                title: p.querySelector('.product-title')?.innerText,
                price: p.querySelector('.price')?.innerText
            }));
        }""")
       
        print(f"Extracted {len(products)} products without rendering images.")
        await browser.close()


if __name__ == "__main__":
    asyncio.run(scrape_optimized_pdp())

Network interception significantly reduces the I/O overhead required to load a category page.

We benchmarked this impact by simulating a catalog crawl of 1000 SKUs against a JS-heavy target. The test compared a full rendering approach against the media-blocked optimization shown above. 

Test Environment Parameters

  • Concurrency: 10 Browser Contexts
  • Total Requests: 1000
  • Target: JS-heavy e-commerce listing (React Hydration)

The test compared a full rendering approach against an optimized setup that blocked images, fonts, and media resources. The following table details the resource consumption for both scenarios.

MetricStandard ExecutionOptimized (No Media/Fonts)Delta
Total Duration313.08 sec235.51 sec-24.80%
Throughput3.19 req/sec4.24 req/sec+1.05 req/sec
Avg Process CPU8.49%45.55%+436%
Process RAM (Avg)~40 MB~59 MB+47%

The optimized script improves individual request speed but fails to address the primary scaling constraint. Even with resource blocking, a headless browser requires significant CPU cycles to parse JavaScript and render the DOM tree. Scaling this architecture to handle 10,000 requests per minute forces engineering teams to manage complex orchestration layers using Kubernetes or AWS Lambda. This operational overhead often outweighs the cost savings of running self-hosted scrapers.

The Web Scraping API Approach

Scaling to thousands of pages per minute requires a shift in architecture. Managing a grid of headless browsers and rotating residential proxies becomes a DevOps problem.

Offloading the execution to a Scraping API removes the infrastructure layer entirely. The architecture shifts from Resource Management (CPU/RAM/Proxies) to Pipeline Management (Data Flow/Validation).

A critical advantage for e-commerce aggregators is the ability to use AI-driven extraction rules. Maintaining unique CSS selectors across fifty different competitor sites creates significant technical debt. LLM-based extraction allows you to apply a single data schema to multiple domains without writing unique parsers for each retailer.

Here is the implementation using the HasData API. The logic abstracts TLS spoofing, proxy rotation, and DOM parsing into a single payload.

import requests
 
# The architecture simplifies to a single HTTP request
# Infrastructure complexity is offloaded to the API provider
API_ENDPOINT = "https://api.hasdata.com/scrape/web"
API_KEY = "HASDATA_API_KEY"


# AI Extraction Schema 
# This single schema works across Amazon, Walmart, and Shopify 
# It decouples the scraper from specific HTML structures
payload = {
    "url": "https://electronics.nop-templates.com/laptops",
    "proxyType": "datacenter",
    "proxyCountry": "US",
    "jsRendering": True, # Replaces local Playwright instance
    "aiExtractRules": {
        "products": {
            "type": "list",
            "output": {
                "name": {"type": "string"},
                "price": {"type": "string"}
            }
        }
    }
}
 
try:
    response = requests.post(
        API_ENDPOINT,
        headers={"x-api-key": API_KEY, "Content-Type": "application/json"},
        json=payload,
        timeout=30
    )
   
    if response.status_code == 200:
        print(f"Success. Infrastructure handled remotely.")
        # Process JSON data here
    else:
        print(f"Error: {response.status_code}")
 
except Exception as e:
    print(f"Connection error: {e}")

This architecture allows you to scale from 10 to 10,000 concurrent requests without upgrading your local hardware or managing grid orchestration. The AI extraction layer also ensures data continuity even when target sites deploy A/B tests or change their frontend class names.

Advanced Handling

High-volume scraping on React or Vue storefronts presents specific engineering challenges regarding DOM stability and data integrity. We prioritize handling virtualized lists and enforcing strict types at the ingestion point.

Infinite Scroll

Major retailers implement DOM virtualization to maintain performance on long catalog pages. This technique actively removes off-screen elements from the DOM tree to conserve browser memory. A script that scrolls to the footer before extraction will fail to capture the first 80% of the product list.

Reliable extraction requires an iterative approach. The scraper must extract visible data from the viewport before triggering the next scroll event..

async def scrape_virtual_scroll(page):
    all_products = {} # Dict for automatic deduplication by SKU
   
    while True:
        # Extract currently visible items
        # We map by SKU/ID to prevent duplicates from overlapping viewports
        visible_items = await page.evaluate("""() => {
            return Array.from(document.querySelectorAll('.product-card')).map(p => ({
                sku: p.getAttribute('data-sku'),
                price: p.querySelector('.price')?.innerText,
                title: p.querySelector('h3')?.innerText
            }));
        }""")
       
        # Update the master dataset
        for item in visible_items:
            if item['sku']:
                all_products[item['sku']] = item
       
        previous_height = await page.evaluate("document.body.scrollHeight")
       
        # Scroll down by one viewport height
        await page.evaluate("window.scrollBy(0, window.innerHeight)")
       
        # Wait for potential network activity (Debounce)
        try:
            await page.wait_for_timeout(1000)
        except Exception:
            break
           
        new_height = await page.evaluate("document.body.scrollHeight")
        current_scroll = await page.evaluate("window.scrollY + window.innerHeight")
       
        # Terminate if we reached the bottom or height stopped increasing
        if current_scroll >= new_height and new_height == previous_height:
            break
           
    return list(all_products.values())

This Scrape-as-you-Go pattern ensures zero data loss even when the target site unloads top-level DOM nodes.

Data Normalization

Raw HTML extraction yields dirty string data containing localization artifacts and whitespace. Ingesting unverified strings like ”€ 1.200,00” into a pricing engine causes immediate downstream failures. You must enforce type safety at the edge.

We use Pydantic for runtime validation. It handles currency normalization and boolean coercion before the data enters the database.

from pydantic import BaseModel, field_validator
from typing import Optional
import re
 
class ProductSchema(BaseModel):
    title: str
    price: float
    currency: str = "USD"
    in_stock: bool
 
    @field_validator('price', mode='before')
    def clean_price(cls, v):
        if isinstance(v, (float, int)):
            return v
        # Remove currency symbols and commas
        cleaned = re.sub(r'[^\d.]', '', str(v))
        return float(cleaned)
 
    @field_validator('in_stock', mode='before')
    def parse_availability(cls, v):
        if isinstance(v, bool):
            return v
        # Normalize text variations
        return v.lower() in ['in stock', 'available', 'buy now']
 
# Example Usage
raw_data = {
    "title": "  Gaming Laptop  ",
    "price": "$2,499.00",
    "in_stock": "In Stock"
}
 
product = ProductSchema(**raw_data)
print(product.model_dump())
# Output: {'title': '  Gaming Laptop  ', 'price': 2499.0, 'currency': 'USD', 'in_stock': True}

Implementing this validation layer reduces the need for complex ETL cleaning scripts later in the pipeline.

Automated data collection from public e-commerce catalogs is generally legal when it respects server resources and intellectual property boundaries. The legal distinction typically rests on the difference between accessing public facts versus unauthorized intrusion into private systems.  You can find a detailed analysis of case law in our dedicated article Is Web Scraping Legal? Yes, If You Do It Right.

Engineering teams must implement specific safeguards to operate within the “Safe Harbor” of data scraping.

  1. Fact vs Creative Copyright. Price points, SKU numbers, and technical specifications are factual data and generally not subject to copyright protection. Product descriptions, editorial reviews, and high-resolution photography are creative assets. Limit your extraction pipeline to factual attributes to minimize copyright infringement risks.
  2. Traffic Control. High-concurrency scraping that degrades website performance can be legally classified as a cyberattack. Limit your request rate to match human browsing behavior. Ensure your bot backs off immediately upon receiving HTTP 429 or 503 status codes.
  3. Denial of Inventory. Checking stock levels by adding items to a shopping cart consumes server session storage and temporarily reserves inventory. Performing this at scale creates a “Denial of Inventory” state for real customers. Use frontend “In Stock” indicators rather than backend cart endpoints whenever possible.
  4. GDPR in User Reviews. E-commerce reviews often contain full names or locations of customers. Scrubing this Personally Identifiable Information (PII) at the extraction layer is mandatory for GDPR and CCPA compliance. Do not store raw HTML containing customer names.

Operational compliance safeguards your pipeline against IP blocking and ensures your data collection strategy remains sustainable.

Strategic Engineering Use Cases

Data extraction projects must align with revenue goals. We focus on four specific patterns where automated scraping delivers measurable ROI.

Inventory Gap Detection

Competitor stockouts create an immediate opportunity to capture market share. You can increase ad spend for specific keywords when a rival product goes offline. The scraper detects availability by parsing “Add to Cart” button states or specific “Out of Stock” text strings. This signal triggers an alert to your marketing team.

def check_availability(soup):
    # Common patterns for stock status
    out_of_stock_keywords = ["sold out", "currently unavailable", "notify me"]
    page_text = soup.get_text().lower()
    
    if any(kw in page_text for kw in out_of_stock_keywords):
        return False
        
    # Check for disabled buttons
    cart_btn = soup.find("button", {"name": "add-to-cart"})
    if cart_btn and "disabled" in cart_btn.attrs:
        return False
        
    return True

The boolean output drives your downstream ad bidding logic. You should pause campaigns for your own out of stock items and increase bids when a primary competitor runs dry.

Dynamic Pricing Engines

Real-time price intelligence allows you to adjust margins based on market movements. You can implement algorithmic repricing to undercut competitors or maximize profit during high demand. The scraper feeds the competitor price into a rule engine. This engine updates your CMS via API when specific thresholds are met.

def calculate_optimal_price(competitor_price, my_cost, min_margin=0.15):
    target_price = competitor_price * 0.99
    floor_price = my_cost * (1 + min_margin)
    
    if target_price < floor_price:
        return floor_price
    return target_price

You can read more about building complete repricing pipelines in our dedicated Price Monitoring Guide.

Catalog Discovery and Crawling

Category crawling reveals new product launches and assortment gaps. You must iterate through pagination to capture the full product list. This function extracts product URLs from a grid layout and identifies the next page link to continue the crawl.

def discover_products(soup):
    product_links = []
    # Extracts all links inside product card containers
    for card in soup.select(".product-card"):
        link = card.select_one("a.product-link")
        if link and link.has_attr("href"):
            product_links.append(link["href"])
            
    # Identifies the next page for recursion
    next_button = soup.select_one("a.pagination-next")
    next_url = next_button["href"] if next_button else None
    
    return product_links, next_url

Efficient crawlers use this logic to populate a URL frontier queue. You must implement a depth limit and a URL deduplication set to prevent infinite loops during the discovery phase.

Automated Catalog Discovery

Retailers constantly update their catalogs with new SKUs. You can crawl category pages to detect these additions immediately. This “New Arrival” monitoring helps marketing teams align their campaigns with emerging market trends.

def discover_new_products(soup, known_skus):
    new_products = []
    product_cards = soup.find_all("div", class_="product-card")
    
    for card in product_cards:
        sku = card.get("data-sku")
        if sku and sku not in known_skus:
            title = card.find("h3").get_text(strip=True)
            link = card.find("a")["href"]
            new_products.append({"sku": sku, "title": title, "url": link})
            
    return new_products

This logic relies on a pre-populated database of known SKUs. You should run this diff check daily to catch soft launches that competitors release without a main page banner.

Conclusion

Reliable e-commerce data pipelines prioritize structural stability over visual parsing. We have demonstrated that targeting internal APIs and embedded JSON schemas reduces maintenance time compared to fragile CSS selectors.

Scaling these operations introduces infrastructure challenges regarding proxy rotation and TLS fingerprinting. Engineering teams must decide whether to allocate resources to manage a headless browser fleet or offload this complexity to a dedicated scraping provider.

You can implement the production-ready strategies discussed here by cloning the repository or integrating the HasData API to handle the anti-bot layer automatically.

Next Step: Get your free HasData API key to test the AI-driven extraction rules on your target site today.

Valentina Skakun
Valentina Skakun
Valentina is a software engineer who builds data extraction tools before writing about them. With a strong background in Python, she also leverages her experience in JavaScript, PHP, R, and Ruby to reverse-engineer complex web architectures.If data renders in a browser, she will find a way to script its extraction.
Articles

Might Be Interesting