HasData
Back to all posts

How to Scrape Prices with Python

Valentina Skakun
Valentina Skakun
Last update: 20 Jan 2026

Price scraping is the automated extraction of financial data from e-commerce sources. It focuses on numerical precision rather than broad catalog indexing. Unlike general web scraping, price extraction requires strict data normalization to handle floating-point errors, regional currency formatting, and dynamic DOM updates.

A production-grade scraper must solve three engineering challenges. It needs to normalize values like 1.200,00 and 1,200.00 into a unified Decimal format. It must bypass anti-bot systems detecting non-browser TLS fingerprints. It requires logic to parse complex commercial offers such as bulk discounts or tax-inclusive pricing.

This guide outlines a production architecture for extracting prices using Python. We cover handling ISO 4217 currency codes, intercepting internal JSON APIs, and using LLMs to parse bundled offers that standard Regex cannot handle.

TL;DR Price Extraction Snippet

The following script automates the extraction of complex multi-variant pricing (e.g., size/color combinations) that standard HTML parsers miss. It uses the HasData API to render JavaScript and applies AI extraction rules to normalize currency and stock status into structured JSON/CSV formats.

import requests
import json
import csv
import re
from decimal import Decimal


# Configuration
API_KEY = "YOUR_HASDATA_API_KEY"
TARGET_URL = "https://www.amazon.com/dp/B0C138SH1L/"


# We use AI extraction rules to bypass fragile CSS selectors.
# This schema forces the LLM to return strictly typed financial data.
payload = {
    "url": TARGET_URL,
    "proxyType": "residential", # Essential for Amazon/Nike
    "proxyCountry": "US",       # Localizes currency to USD
    "jsRendering": True,        # Handles React hydration
    "aiExtractRules": {
        "price_data": {
            "type": "list",
            "output": {
                "product_variant": {"type": "string", "description": "Variant name (Color/Size)"},
                "current_price": {"type": "string", "description": "Price with currency symbol"},
                "original_price": {"type": "string", "description": "MSRP before discount"},
                "currency": {"type": "string", "description": "ISO 4217 code (USD, EUR)"},
                "availability": {"type": "string", "description": "Stock status"}
            }
        }
    }
}


def normalize_price(price_str):
    """
    Converts raw strings like '$1,299.00' into Decimal objects for financial math.
    Returns None if the price is missing or 'out of stock'.
    """
    if not price_str:
        return None
    # Remove non-numeric characters except the dot
    clean = re.sub(r"[^\d.]", "", price_str)
    return str(Decimal(clean)) if clean else None


try:
    print(f"Fetching data for {TARGET_URL}...")
    response = requests.post(
        "https://api.hasdata.com/scrape/web",
        headers={"x-api-key": API_KEY, "Content-Type": "application/json"},
        json=payload,
        timeout=30
    )


    if response.status_code == 200:
        data = response.json()
        price_info = data.get("aiResponse", {}).get("price_data", [])
       
        rows = []
        for item in price_info:
            rows.append({
                "variant": item.get("product_variant", "Unknown"),
                "current_price": normalize_price(item.get("current_price")),
                "original_price": normalize_price(item.get("original_price")),
                "currency": item.get("currency", "USD"),
                "availability": item.get("availability", "Unknown"),
            })


        # CLI Visualization
        print("\nExtraction Results:")
        header = f"{'Variant':45} | {'Current':10} | {'Original':10} | {'Cur':3} | Availability"
        print(header)
        print("-" * len(header))


        for r in rows:
            print(
                f"{r['variant'][:45]:45} | "
                f"{str(r['current_price']):10} | "
                f"{str(r['original_price']):10} | "
                f"{r['currency']:3} | "
                f"{r['availability']}"
            )


        # Export for Analysis
        with open("prices.csv", "w", newline="", encoding="utf-8") as f:
            writer = csv.DictWriter(f, fieldnames=rows[0].keys())
            writer.writeheader()
            writer.writerows(rows)
            print(f"\nSaved {len(rows)} variants to prices.csv")


    else:
        print(f"Extraction Failed: {response.status_code} - {response.text}")


except Exception as e:
    print(f"Runtime Error: {e}")

The script handles the heavy lifting (rotating residential proxies and parsing the DOM) returning clean, normalized data ready for database ingestion.

Intercepting Internal JSON APIs

Modern e-commerce sites load pricing via JavaScript. Instead of rendering the full DOM, intercept the XHR endpoint that returns structured JSON. This method eliminates parsing overhead and provides cleaner data.

Before writing a single line of selector logic, inspect the traffic between the browser and the server.

  1. Open Chrome DevTools (F12) and navigate to the Network tab.
  2. Select the Fetch/XHR filter to isolate API calls.
  3. Refresh the product page.
  4. Search the response bodies for the price value (e.g., 1299 or the product SKU).

You are looking for endpoints returning structured JSON. These often contain clean numeric values, stock levels, and future pricing data not yet rendered in the DOM.

Locating the hidden JSON payload containing raw pricing data

Attempting to replicate this request using requests or cURL requires perfect header mimicry (User-Agent, Origin, Sec-Fetch-Mode). If the site uses TLS Fingerprinting (JA3), standard Python libraries will be blocked immediately.

The superior engineering approach is Passive Interception. We use Playwright to launch a real browser that handles the WAF handshake naturally, while we simply “eavesdrop” on the incoming network traffic to capture the JSON payloads.

This script launches a browser, navigates to the category page, and silently captures the API response without explicitly requesting it.

from playwright.sync_api import sync_playwright


# We target the specific API endpoint discovered in the Network tab
TARGET_URL = "https://www.nike.com/us/w/futbol-1gdj0"
API_PART = "product-proxy-v2.adtech-prod.nikecloud.com/products"


def scrape_nike_json():
    with sync_playwright() as p:
        # Launching headless=True is faster, but headless=False helps debugging WAFs
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()


        # Set to prevent duplicate processing if the API sends redundant data
        seen_ids = set()


        def handle_response(response):
            # We filter only for the internal API, ignoring images/CSS
            if API_PART in response.url:
                try:
                    data = response.json()
                    # Nike nests products under 'hydratedProducts' or 'objects'
                    products = data.get("hydratedProducts", [])


                    for p in products:
                        pid = p.get("cloudProductId")
                        if pid in seen_ids:
                            continue
                        seen_ids.add(pid)


                        current = p.get("currentPrice")
                        full = p.get("fullPrice")


                        # Business Logic: Calculate hidden discount percentages
                        discount = None
                        if current and full and full > current:
                            discount = round((1 - current / full) * 100, 1)


                        print(
                            f"Name: {p.get('name')}\n"
                            f"Brand: {p.get('brand')}\n"
                            f"Category: {p.get('category')}\n"
                            f"Color: {p.get('color')}\n"
                            f"Current price: {current} USD\n"
                            f"Full price: {full} USD\n"
                            f"On sale: {p.get('isOnSale')}\n"
                            f"Discount: {discount}%\n"
                            f"{'-'*40}"
                        )
                except Exception as e:
                    # In production, log this error to Sentry/Datadog
                    pass


        # Attach the listener BEFORE navigating
        page.on("response", handle_response)
       
        # networkidle ensures the initial hydration is complete
        page.goto(TARGET_URL, wait_until="networkidle")
       
        # Explicit wait to allow for any lazy-loaded XHR requests to fire
        page.wait_for_timeout(8000)


        browser.close()


if __name__ == "__main__":
    scrape_nike_json()

Execution Log

Name: NIKE COURT LEGACY NN
Brand: Nike
Category: FOOTWEAR
Color: DH3162-100
Current price: 47.97 USD
Full price: 70 USD
On sale: True
Discount: 31.5%
----------------------------------------
Name: AIR JORDAN 1 RETRO LOW OG
Brand: Jordan
Category: FOOTWEAR
Color: HQ6998-600
Current price: 145 USD
Full price: 145 USD
On sale: False
Discount: 0%

Even if Nike changes their HTML class names (e.g., from .product-card to .card-v2), this script will continue to work as long as the underlying API schema remains stable.

The Classic HTML Parsing Stack

When JSON endpoints are unavailable, CSS selectors provide a fallback. A basic price scraper relies on a single CSS selector (e.g., .product-price), which breaks the moment a frontend developer changes a class name from price to text-lg.

To build a resilient price monitor, you must implement a Stability Hierarchy. This strategy prioritizes machine-readable data (JSON-LD) over visual markup. Visual classes change weekly, schema.org definitions change rarely.

We implement a “Fallbacks” pattern. The scraper attempts to read the most stable data source first and only falls back to basic CSS selectors if the primary sources fail.

  1. JSON-LD / Structured Data. Invisible to users but designed specifically for machines (Google Shopping bots).
  2. Meta Tags & Microdata. Standardized attributes like itemprop="price" that rarely change.
  3. Data Attributes. data-price attributes often used by analytics scripts.
  4. CSS Selectors. Extremely fragile and prone to breaking on UI updates.

The following script implements this priority logic. It uses the HasData API to handle the network transport (proxies and JS rendering) and BeautifulSoup to execute the extraction waterfall.

import requests
from bs4 import BeautifulSoup
from decimal import Decimal
import re
import json


# Configuration
API_KEY = "YOUR_HASDATA_API_KEY"
TARGET_URL = "https://demo.nopcommerce.com/leica-t-mirrorless-digital-camera"


def scrape_price_with_fallbacks():
    """
    Implements the 'Hierarchy of Reliability':
    1. Structured Data (JSON-LD). Most stable, machine-readable.
    2. Semantic HTML (Meta Tags). Very stable, used for SEO.
    3. Data Attributes. Stable, used for internal JS logic.
    4. CSS Classes. Fragile, prone to design changes.
    """
    payload = {
        "url": TARGET_URL,
        "proxyType": "residential",
        "proxyCountry": "US",       # Ensures currency is in USD
        "jsRendering": True,        # Essential for modern React/Vue sites
        "outputFormat": ["html"]    # We want the raw HTML to parse locally
    }


    print(f"Fetching {TARGET_URL}...")
    response = requests.post(
        "https://api.hasdata.com/scrape/web",
        headers={"x-api-key": API_KEY, "Content-Type": "application/json"},
        json=payload,
        timeout=30
    )
   
    if response.status_code != 200:
        raise ConnectionError(f"API Error: {response.status_code}")


    # Use the 'html' field from HasData response
    soup = BeautifulSoup(response.content, "html.parser")
   
    # Priority 1: JSON-LD Structured Data
    # E-commerce sites use this for Google Shopping. It rarely changes.
    json_ld = soup.find("script", {"type": "application/ld+json"})
    if json_ld:
        try:
            data = json.loads(json_ld.string)
            # JSON-LD structures vary; look for 'offers' key
            if "offers" in data:
                offer = data["offers"]
                # Handle list of offers (variants) vs single offer
                if isinstance(offer, list):
                    offer = offer[0]
               
                price_str = str(offer.get("price", ""))
                currency = offer.get("priceCurrency", "USD")
               
                if price_str:
                    print("Source: JSON-LD (Priority 1)")
                    return Decimal(price_str), currency
        except json.JSONDecodeError:
            pass
   
    # Priority 2: Semantic HTML (Schema.org microdata)
    # SEO tags like <meta itemprop="price" content="1200.00">
    price_meta = soup.find("meta", {"itemprop": "price"})
    if price_meta and price_meta.get("content"):
        currency_meta = soup.find("meta", {"itemprop": "priceCurrency"})
        currency = currency_meta.get("content", "USD") if currency_meta else "USD"
        print("Source: Meta Tags (Priority 2)")
        return Decimal(price_meta["content"]), currency
   
    # Priority 3: Common data-* attributes
    # Developers often put raw numbers in data attributes for JS calculations
    for attr in ["data-price", "data-product-price", "data-price-amount"]:
        elem = soup.find(attrs={attr: True})
        if elem:
            price_str = elem.get(attr)
            clean = re.sub(r'[^\d.]', '', price_str)
            if clean:
                print(f"Source: Attribute [{attr}] (Priority 3)")
                return Decimal(clean), "USD"
   
    # Priority 4: Class-based selectors (Fragile fallback)
    # Only use this if all above fail.
    price_selectors = [
        ".product-price", ".price-value-2", ".projected-price", ".money", ".price", ".product__single__price"
    ]
    for selector in price_selectors:
        elem = soup.select_one(selector)
        if elem:
            text = elem.get_text(strip=True)
            # Remove currency symbols and non-numeric chars
            clean = re.sub(r'[^\d.]', '', text)
            if clean:
                print(f"Source: CSS Selector [{selector}] (Priority 4)")
                return Decimal(clean), "USD"
   
    raise ValueError(f"No price found on {TARGET_URL}")


# Usage
if __name__ == "__main__":
    try:
        price, currency = scrape_price_with_fallbacks()
        print(f"Final Result: {price} {currency}")
    except Exception as e:
        print(f"Extraction failed: {e}")

Execution Log

Fetching https://demo.nopcommerce.com/leica-t-mirrorless-digital-camera...
Source: JSON-LD (Priority 1)
Final Result: 530.00 USD


Fetching https://demo.hyva.io/default/chaz-kangeroo-hoodie.html...
Source: Meta Tags (Priority 2)
Final Result: 52 EUR


Fetching https://demo.evershop.io/accessories/modern-ceramic-vase-green...
Source: CSS Selector [.product__single__price] (Priority 4)
Final Result: 25.00 USD

By checking application/ld+json first, you bypass the visual layer entirely. E-commerce sites frequently A/B test their buttons and layouts (breaking CSS selectors), but they rarely break their SEO schemas because doing so would hurt their Google rankings.

Cleaning and Normalizing Financial Data

Raw HTML contains localization artifacts that break numeric operations. European formats use commas as decimal separators (1.234,56 €), while US formats use periods ($1,234.56). You must normalize before storing.

Handling Decimal Separators and Formats

The primary challenge is distinguishing between the Decimal Point . and the Thousands Separator ,. The usage varies globally.

  • US/UK/China: 1,234.56 (Dot is decimal)
  • EU/South America: 1.234,56 (Comma is decimal)

To handle this at scale, we avoid float() entirely due to floating-point precision errors (IEEE 754). We use Python’s decimal.Decimal module and a heuristic algorithm to detect the format based on separator position.

This function automatically detects the number format. It looks at the right-most separator to determine if the string follows US or EU standards.

from decimal import Decimal, InvalidOperation
import re


def normalize_price(raw_text, locale_hint="AUTO"):
    """
    Converts localized price strings to precise Decimal objects.
   
    Args:
        raw_text: Dirty strings like "€ 1.234,56", "$1,234.56", or "£1 234.56"
        locale_hint: "US", "EU", or "AUTO" for heuristic detection
   
    Returns:
        Decimal: Safe for financial calculations (never float)
    """
    if not raw_text:
        return None


    # Step 1: Remove artifacts
    # We strip everything except digits, commas, and dots
    # This handles space separators (e.g., "1 200.00" becomes "1200.00")
    cleaned = re.sub(r'[^\d.,]', '', raw_text)
   
    if not cleaned:
        raise ValueError(f"No numeric data found in: {raw_text}")
   
    # Step 2: Detect Format
    if locale_hint == "AUTO":
        # If both separators exist, the right-most one is the decimal
        if ',' in cleaned and '.' in cleaned:
            last_comma = cleaned.rfind(',')
            last_period = cleaned.rfind('.')
            locale_hint = "EU" if last_comma > last_period else "US"
       
        # If only comma exists, check context
        elif ',' in cleaned:
            # Ambiguous Case: "1,234"
            # Logic: If exactly 2 digits follow the comma, assume EU (cents)
            # Otherwise assume US thousands separator
            parts = cleaned.split(',')
            locale_hint = "EU" if len(parts[-1]) == 2 else "US"
       
        else:
            # Default to US if no comma is present
            locale_hint = "US"
   
    # Step 3: Normalize to Python Standard (US)
    if locale_hint == "EU":
        # Convert "1.234,56" -> "1234.56"
        normalized = cleaned.replace('.', '').replace(',', '.')
    else:
        # Convert "1,234.56" -> "1234.56"
        normalized = cleaned.replace(',', '')
   
    try:
        return Decimal(normalized)
    except InvalidOperation:
        raise ValueError(f"Normalization failed: {raw_text} -> {normalized}")


# Unit Tests for Validation
if __name__ == "__main__":
    test_cases = [
        ("$1,234.56", "US", Decimal("1234.56")),
        ("€ 1.234,56", "EU", Decimal("1234.56")),
        ("Price: 1,200", "US", Decimal("1200")),    # US Integer
        ("1,20 €", "AUTO", Decimal("1.20")),        # EU Decimal
    ]


    for raw, locale, expected in test_cases:
        result = normalize_price(raw, locale)
        print(f"Input: {raw:15} | Mode: {locale:4} | Result: {result}")
        assert result == expected

Execution Log

Input: $1,234.56       | Mode: US   | Result: 1234.56
Input: € 1.234,56      | Mode: EU   | Result: 1234.56
Input: Price: 1,200    | Mode: US   | Result: 1200
Input: 1,20 €          | Mode: AUTO | Result: 1.20

This heuristic solves 99% of formatting issues without requiring manual configuration for every target country. The logic prioritizes the position of the separator over the specific character. If a comma appears at the end of the string, it is treated as a decimal marker regardless of the currency symbol.

Implementing ISO 4217 Currency Standards

Financial databases require strict three-letter codes. Relying on symbols is dangerous because they are not unique. The symbol kr is shared by Sweden (SEK), Norway (NOK), and Denmark (DKK). The symbol ¥ is shared by Japan (JPY) and China (CNY).

To resolve this, your scraper must be Context-Aware. You cannot determine the currency solely from the text string. You must pass the country_code of the proxy used during the request into your parsing logic.

The following script maps symbols to ISO codes. It uses the proxy location to resolve ambiguous symbols like $ or kr.

import re


# Static mapping for unique symbols
# Ambiguous symbols like '$' default to USD unless overridden by context
CURRENCY_MAP = {
    "$": "USD",  
    "€": "EUR",
    "£": "GBP",
    "¥": "JPY",   # Default to JPY, requires 'CN' context for CNY
    "₹": "INR",
    "₽": "RUB",
    "₩": "KRW",
    "฿": "THB",
    "R$": "BRL",
    "C$": "CAD",
    "A$": "AUD",
    "CHF": "CHF",
    "kr": "SEK",  # Default to SEK, requires 'NO' or 'DK' context
}


def extract_currency(text_snippet, proxy_country=None):
    """
    Resolves ISO 4217 codes using symbol lookup and geo-context.
   
    Args:
        text_snippet: The raw price string (e.g., "C$ 24.99")
        proxy_country: The ISO 3166-1 alpha-2 country code of your proxy (e.g., "CA")
    """
    if not text_snippet:
        return "USD"


    # Strategy 1: Explicit ISO Code Search
    # Some sites display "24.99 USD" directly
    iso_match = re.search(r'\b([A-Z]{3})\b', text_snippet)
    if iso_match:
        code = iso_match.group(1)
        # Validate against a known whitelist to avoid capturing unrelated uppercase words
        if code in ["USD", "EUR", "GBP", "JPY", "CAD", "AUD", "CHF", "CNY", "INR"]:
            return code
   
    # Strategy 2: Symbol Lookup with Geo-Context Override
    for symbol, default_iso in CURRENCY_MAP.items():
        if symbol in text_snippet:
           
            # Handle the generic Dollar Sign '$'
            if symbol == "$" and proxy_country:
                country_upper = proxy_country.upper()
                if country_upper == "CA": return "CAD"
                if country_upper == "AU": return "AUD"
                if country_upper == "SG": return "SGD"
                if country_upper == "MX": return "MXN"
           
            # Handle the Krone 'kr'
            if symbol == "kr" and proxy_country:
                country_upper = proxy_country.upper()
                if country_upper == "NO": return "NOK"
                if country_upper == "DK": return "DKK"


            return default_iso
   
    # Fallback assumption
    return "USD"


# Usage Example
if __name__ == "__main__":
    # Scenario: Scraping a Canadian site using a Canadian Residential Proxy
    raw_price = "$49.99"
    proxy_loc = "CA"
   
    iso_currency = extract_currency(raw_price, proxy_country=proxy_loc)
    print(f"Input: {raw_price} | Proxy: {proxy_loc} | Detected: {iso_currency}")

Execution Log

Input: $49.99 | Proxy: CA | Detected: CAD

This logic decouples the visual presentation from the financial meaning. Even if the website simply displays $, your database correctly logs the value as CAD because you injected the known context of your scraping infrastructure.

Removing Hidden Costs and Junk Text

Product pages inject marketing text near prices (“Was $99”, “Save 20%”). The following function implements an aggressive cleaning strategy. It removes patterns like “MSRP $XX” completely so the subsequent extraction logic sees only the valid price. Note that we rely on the normalize_price function defined in the previous section.

import re
from decimal import Decimal


# Ensure you import normalize_price from the previous section
# from normalization import normalize_price


def extract_clean_price(html_snippet):
    """
    Isolates the transactional price by scrubbing marketing copy.
   
    Logic:
    1. Lowercase the input for case-insensitive matching.
    2. Aggressively remove "noise phrases" AND the numbers following them.
    3. Extract the remaining valid price.
    """
    if not html_snippet:
        return None


    cleaned = html_snippet.lower()


    # Noise patterns to strip entirely
    # We include the number pattern within the removal regex to delete "Was $129.99"
    # Not just the word "Was"
    noise_patterns = [
        # Remove "Was $129.99" or "MSRP $50"
        r'\b(was|originally|msrp|rrp|old price)\s*[:\s]?\s*[\$£€¥]?\s*\d+(?:[.,]\d+)*',
       
        # Remove "(Save $10)" savings claims
        r'\(save\s*[\$£€¥]?\s*\d+(?:[.,]\d+)*\)',
       
        # Remove "From" or "As low as" (Misleading unit prices)
        r'\b(from|as low as|starting at)\b',
       
        # Remove per-unit qualifiers which skew logic
        r'\b(per\s+\w+|each)\b'
    ]
   
    for pattern in noise_patterns:
        cleaned = re.sub(pattern, '', cleaned, flags=re.IGNORECASE)
   
    # Extract the remaining numeric price with currency symbol
    # This regex looks for currency symbols followed by standard digit formats
    price_match = re.search(r'[\$£€¥]\s*(\d{1,3}(?:[,\s]\d{3})*(?:[.,]\d{2})?)', cleaned)
   
    if price_match:
        price_str = price_match.group(1)
        # Use the locale-aware normalizer from the previous section
        return normalize_price(price_str)
   
    # Fallback: Try finding numbers without currency symbols if strict match fails
    loose_match = re.search(r'(\d{1,3}(?:[,\s]\d{3})*(?:[.,]\d{2})?)', cleaned)
    if loose_match:
         return normalize_price(loose_match.group(1))


    raise ValueError(f"No valid price found after cleaning: {html_snippet}")


# Usage Example
if __name__ == "__main__":
    test_snippets = [
        "Was $129.99 Now $99.99",       # Should extract 99.99
        "$49.99 (Save $10.00)",         # Should extract 49.99
        "MSRP $199.00 Our Price $149",  # Should extract 149
        "From $29.99",                  # Should extract 29.99 (stripped 'from')
    ]


    for snippet in test_snippets:
        try:
            price = extract_clean_price(snippet)
            print(f"Input: {snippet:35} -> Parsed: ${price}")
        except ValueError as e:
            print(f"Error: {e}")

Execution Log

Input: Was $129.99 Now $99.99              -> Parsed: $99.99
Input: $49.99 (Save $10.00)                -> Parsed: $49.99
Input: MSRP $199.00 Our Price $149         -> Parsed: $149
Input: From $29.99                         -> Parsed: $29.99

Removing only the keyword (e.g., “Was”) leaves the decoy price ($129.99) in the string. The extractor would then capture the first available number, resulting in a false price spike in your database. The regex r'\b(was|msrp)...\d+' ensures the decoy price is deleted alongside the keyword.

Parsing Complex Pricing Logic with AI

Bundled offers, tiered discounts, and subscription pricing break traditional regex patterns. LLMs can understand context like “Buy 2 get 20% off” and extract the effective unit price.

Consider these pricing patterns:

  • “3 for $10 or $3.99 each”
  • “Subscribe & Save 15% - $8.49/delivery”
  • “Members pay $299/year + $9.99/mo”

Regex cannot parse the conditional logic. AI extraction maps these phrases to structured data.

The following script uses the HasData API to apply AI extraction rules. It bypasses the need for local DOM parsing by offloading the logic to an LLM optimized for web structures.

import requests
import json
import re
from decimal import Decimal


# Configuration
API_KEY = "YOUR_HASDATA_API_KEY"
TARGET_URL = "https://www.amazon.com/dp/B0C138SH1L/"


# We use AI extraction rules to bypass fragile CSS selectors.
# This schema forces the LLM to return strictly typed financial data.


payload = {
    "url": TARGET_URL,
    "proxyType": "residential",
    "proxyCountry": "US",       # Localizes currency to USD
    "jsRendering": True,        # Handles React hydration
    "aiExtractRules": {
        "price_data": {
            "type": "list",
            "output": {
                "product_variant": {
                    "type": "string",
                    "description": "Current product or product variant"
                },
                "current_price": {
                    "type": "string",
                    "description": "Current selling price with currency symbol"
                },
                "original_price": {
                    "type": "string",
                    "description": "Original price before discount if available"
                },
                "currency": {
                    "type": "string",
                    "description": "ISO 4217 currency code (USD, EUR, GBP)"
                },
                "availability": {
                    "type": "string",
                    "description": "Stock status: In Stock, Out of Stock, or specific quantity"
                }
            }
        }
    }
}


def normalize_price(price_str):
    """
    Converts raw strings like '$1,299.00' into Decimal objects.
    """
    if not price_str:
        return None
    # Remove non-numeric characters except the dot
    clean = re.sub(r"[^\d.]", "", price_str)
    return str(Decimal(clean)) if clean else None


try:
    print(f"Fetching data for {TARGET_URL}...")
    response = requests.post(
        "https://api.hasdata.com/scrape/web",
        headers={"x-api-key": API_KEY, "Content-Type": "application/json"},
        json=payload,
        timeout=30
    )


    if response.status_code == 200:
        data = response.json()
        # The API returns the LLM's structured response directly
        price_info = data.get("aiResponse", {}).get("price_data", [])
       
        print(f"\nExtracted {len(price_info)} variants:")
        header = f"{'Variant':45} | {'Current':10} | {'Original':10} | Availability"
        print(header)
        print("-" * len(header))


        for item in price_info:
            variant = item.get("product_variant", "Unknown")[:45]
            curr = normalize_price(item.get("current_price"))
            orig = normalize_price(item.get("original_price"))
            avail = item.get("availability", "Unknown")
           
            print(f"{variant:45} | {str(curr):10} | {str(orig):10} | {avail}")


    else:
        print(f"Extraction Failed: {response.status_code} - {response.text}")


except Exception as e:
    print(f"Runtime Error: {e}")

Execution Log

Fetching data for https://www.amazon.com/dp/B0C138SH1L/...


Extraction Results:
Variant                                       | Current    | Original   | Cur | Availability
--------------------------------------------------------------------------------------------
(001) Black / Black / Reflective              | 25.81      | 30.00      | USD | In Stock
(025) Castlerock / Castlerock / Reflective    | 26.80      | 30.00      | USD | In Stock
(100) White / White / Reflective              | 28.00      | 30.00      | USD | In Stock

The aiExtractRules parameter acts as a contract. The LLM parses the entire HTML document and looks for patterns matching “Price with currency symbol” relative to “Variant name”. It handles the logic of associating specific prices with specific size/color combinations which is often spread across disjointed DOM elements.

Managing Geo Specific Pricing

Dynamic geo-pricing is a standard revenue optimization strategy for SaaS companies, airlines, and global retailers. If your scraper uses a generic datacenter IP, it will likely default to the US version of the site (displaying prices in USD without VAT) or be blocked entirely.

Impact of IP Location

  • Currency Redirects. Users from 185.x.x.x (Germany) are auto-redirected to /de/ and shown Euros.
  • VAT Inclusion. EU prices must include tax, US prices usually exclude it.
  • Inventory Segmentation. Products available in the US warehouse might be “Out of Stock” for UK visitors.

To capture accurate local market data, you must bind your scraping session to a specific region. This requires a proxy network capable of granular country targeting.

The following script audits a target URL across multiple geographies. It iterates through a list of country codes, configuring the HasData residential proxy network to exit from that specific region. This allows you to compare price variances.

import requests
import json
from decimal import Decimal
import re


# Configuration
API_KEY = "YOUR_HASDATA_API_KEY"
TARGET_URL = "https://www.amazon.com/dp/B0DMXKG2QL/"


# We want to audit pricing across these specific markets
TARGET_REGIONS = ["US", "DE", "IN", "BR"]


def get_price_from_region(country_code):
    """
    Fetches the price as seen by a user in a specific country.
    """
    payload = {
        "url": TARGET_URL,
        "proxyType": "residential",
        # This parameter routes traffic through a physical ISP in the target nation
        "proxyCountry": country_code,
        "jsRendering": True,
        # Using AI extraction to handle different layouts/languages per region automatically
        "aiExtractRules": {
            "price": {
                "type": "string",
                "description": "The price of current product variant"
            }
        }
    }


    try:
        response = requests.post(
            "https://api.hasdata.com/scrape/web",
            headers={"x-api-key": API_KEY, "Content-Type": "application/json"},
            json=payload,
            timeout=45
        )
       
        if response.status_code == 200:
            data = response.json()
            raw_price = data.get("aiResponse", {}).get("price")
            return raw_price
        else:
            return f"Error {response.status_code}"
           
    except Exception as e:
        return f"Failed: {str(e)}"


# Execution Loop
print(f"{'Region':6} | {'Detected Price':20}")
print("-" * 30)


for region in TARGET_REGIONS:
    price_display = get_price_from_region(region)
    print(f"{region:6} | {str(price_display):20}")

Execution Log

Region | Detected Price      
------------------------------
US     | $24.69
DE     | EUR 17.56
IN     | INR 1,848.03
BR     | $24.69

By rotating the proxyCountry parameter, you simulate a physical presence in that region. The HasData API handles the underlying routing, ensuring that the target website sees a legitimate residential IP, which triggers the correct localized pricing logic on the frontend.

Designing a Price Monitoring Architecture

A production-ready price monitor does not simply save the current price. It records a snapshot of the price at a specific point in time. This creates an audit trail allowing you to calculate inflation, verify discount claims, and trigger alerts based on relative changes.

Never UPDATE a price row. Always INSERT a new record with a timestamp. This allows you to query historical trends and calculate volatility. While production systems often use PostgreSQL or TimescaleDB, SQLite is sufficient for lightweight monitoring agents.

The following class implements a “Snapshot Storage” pattern. It persists every price observation and includes logic to detect percentage drops between the two most recent data points.

import sqlite3
from datetime import datetime
from decimal import Decimal


class PriceTracker:
    """
    Minimal price monitoring system using SQLite.
   
    Architecture Note:
    In production (Postgres/MySQL), use the DECIMAL/NUMERIC type for the 'price' column.
    SQLite stores this as REAL (float), so we cast back to Decimal in Python
    to ensure calculation precision.
    """
   
    def __init__(self, db_path="prices.db"):
        self.conn = sqlite3.connect(db_path)
        self._setup()
   
    def _setup(self):
        """Initializes the time-series schema."""
        self.conn.execute("""
            CREATE TABLE IF NOT EXISTS price_history (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                url TEXT NOT NULL,
                price REAL NOT NULL,
                currency TEXT DEFAULT 'USD',
                scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
            )
        """)
        # Indexing URL is critical for fast history lookups
        self.conn.execute("CREATE INDEX IF NOT EXISTS idx_url ON price_history(url)")
        self.conn.commit()
   
    def save(self, url, price, currency="USD"):
        """
        Persists a price snapshot.
        Never updates old records; always appends new history.
        """
        self.conn.execute(
            "INSERT INTO price_history (url, price, currency) VALUES (?, ?, ?)",
            (url, float(price), currency)
        )
        self.conn.commit()
   
    def check_drop(self, url, threshold_percent=10):
        """
        Calculates variance between the latest two snapshots.
        Returns an alert dict if the drop exceeds the threshold.
        """
        cursor = self.conn.execute(
            "SELECT price FROM price_history WHERE url = ? ORDER BY scraped_at DESC LIMIT 2",
            (url,)
        )
        # Fetch latest two prices and convert back to Decimal for precise math
        prices = [Decimal(str(row[0])) for row in cursor.fetchall()]
       
        # Need at least two data points to compare
        if len(prices) < 2:
            return None
       
        current, previous = prices[0], prices[1]
       
        # Sanity Check: Ignore 0.00 prices (often scraping errors)
        if current <= 0 or previous <= 0:
            return None


        # Calculate percentage drop
        if current < previous:
            drop_percent = ((previous - current) / previous) * 100
           
            if drop_percent >= threshold_percent:
                return {
                    "previous": previous,
                    "current": current,
                    "savings": previous - current,
                    "discount": drop_percent
                }
        return None


# Usage: Monitor product prices
if __name__ == "__main__":
    tracker = PriceTracker()


    # Simulated Scrape Event
    target_url = "https://demo.hyva.io/default/chaz-kangeroo-hoodie.html"
   
    # Assume we scraped these values over time
    # tracker.save(target_url, Decimal("249.99")) # Yesterday
    tracker.save(target_url, Decimal("199.99"))   # Today (Sale)


    # Check for price drops
    alert = tracker.check_drop(target_url, threshold_percent=10)
    if alert:
        print(f"ALERT: Price dropped for {target_url}")
        print(f"Old: ${alert['previous']} | New: ${alert['current']}")
        print(f"Savings: ${alert['savings']} ({alert['discount']:.1f}% off)")

The check_drop method implements a variance check. In a real scenario, you must add a Sanity Filter (included above). If a scraper breaks and returns 0.00 or 0.01, you do not want to trigger a “99% Discount” alert. Always validate that current_price > reasonable_min_value before dispatching notifications.

Scaling Without Triggering Bans

Scaling a price monitor from 100 to 1 million requests requires overcoming Web Application Firewalls (WAFs) like Cloudflare and Akamai. These systems do not rely solely on IP reputation. They use TLS Fingerprinting.

When a Python script initiates an HTTPS connection, it sends a ClientHello packet. Standard libraries like requests or urllib send a handshake structure that identifies them clearly as a bot. Even if you use a premium residential IP, WAFs will block you immediately because your “digital fingerprint” matches a script, not a browser.

To scrape prices at scale, your infrastructure must mimic a human user at the network layer.

  1. TLS Spoofing. You must modify your HTTP client to send the same cipher suites and extensions as Chrome or Firefox. Libraries like curl-impersonate or specialized scraping APIs handle this automatically.
  2. Residential Proxy Rotation. Datacenter IPs (AWS, DigitalOcean) are flagged by default on sites like Nike or Amazon. You need a pool of Residential Proxies which are IPs assigned to real home devices by ISPs (Comcast, Verizon).
  3. Concurrency Management. Hammering a domain with 500 concurrent threads will trigger rate limiting. Implement exponential backoff strategies where the scraper waits 2, 4, then 8 seconds after receiving a 429 Too Many Requests status.

The HasData API abstracts this complexity. By setting proxyType="residential", you route traffic through a legitimate user device while the API handles the TLS handshake randomization on the backend.

Price monitoring sits at the intersection of business intelligence and contract law. Engineering teams must adhere to specific boundaries to mitigate liability.

Price points and SKU numbers are factual data. Facts generally do not qualify for copyright protection in the US and EU. You can scrape prices freely but avoid copying creative assets like editorial descriptions or product photography.

Checking stock levels by adding items to a shopping cart is technically aggressive. This action reserves inventory in the backend database and creates a temporary stockout for real customers. Legal teams classify this as a “Denial of Inventory” attack. Always rely on frontend text indicators or metadata rather than interacting with the cart API.

Conclusion

Production price scraping requires engineering discipline across five critical domains:

  1. Data Extraction. Prioritize JSON APIs over HTML parsing for stability
  2. Normalization. Use Decimal types and ISO standards to prevent data corruption
  3. AI Parsing. Deploy LLMs with schema validation for complex pricing logic
  4. Geo-Awareness. Rotate proxies by country to capture localized pricing
  5. Compliance. Implement rate limiting and respect robots.txt

The architectural choice between self-hosted scrapers and managed APIs depends on scale. For monitoring 50-500 SKUs, BeautifulSoup with proper error handling suffices. Beyond 1,000 products or when targeting anti-bot platforms (Amazon, Nike), offload infrastructure to a scraping API like HasData.

Next Steps:

Price data becomes strategic intelligence only when extracted consistently and stored correctly. Start with the TL;DR snippet, then scale using the monitoring architecture as your dataset grows.

Valentina Skakun
Valentina Skakun
Valentina is a software engineer who builds data extraction tools before writing about them. With a strong background in Python, she also leverages her experience in JavaScript, PHP, R, and Ruby to reverse-engineer complex web architectures.If data renders in a browser, she will find a way to script its extraction.
Articles

Might Be Interesting