HasData
Back to all posts

Web Scraping for SEO: Technical Guide

Sergey Ermakovich
Sergey Ermakovich
Last update: 17 Feb 2026

SEO scraping is the automated extraction of SERP data, on-page elements, and competitor content for analysis. It replaces manual analysis with programmatic scale.

This guide shows you how to build production-grade SEO scrapers using Python. You will extract organic rankings, AI Overviews, competitor keywords, content structure, and technical SEO issues.

Core Tech Stack

ComponentRecommended ToolFunction
Request Enginehttpx or requestsSends HTTP requests to target URLs.
Parsingparsel or BeautifulSoupExtracts specific data points (titles, ranks) from HTML.
Anti-BotHasData API / ProxiesHandles IP rotation, CAPTCHAs, and fingerprinting.
StoragePandas / CSVOrganizes unstructured HTML into analysis-ready datasets.

Common Use Cases

  1. Rank Tracking: Monitor keyword positions across locations and devices.
  2. SERP Feature Analysis: Extract PAA questions, AI Overviews, and Featured Snippets.
  3. Competitor Auditing: Scrape competitor headings, word counts, and schema markup.
  4. Technical SEO Monitoring: Check your site for broken links, missing meta tags, and noindex flags.
  5. Brand Monitoring: Detect unauthorized mentions or negative reviews.

Prerequisites and Setup

Install the required Python libraries for HTTP requests and HTML parsing.

pip install httpx parsel pandas requests

What each library does:

  • httpx: HTTP/2 client with better connection pooling than requests.
  • parsel: XPath and CSS selector engine (used by Scrapy).
  • pandas: Data structuring and CSV export.
  • requests: Standard HTTP client for API calls.

HasData API Setup

For scraping Google SERP or sites with JavaScript rendering, use the HasData API. It handles headless browsers, CAPTCHA solving, and proxy rotation automatically.

  1. Get your API Key: Sign up at HasData.com and copy your API key from the dashboard. Your key allows you to use both the Web Scraping API (for any website) and the Google SERP API (specifically for search results).
  2. Test the Connection: Verify your setup with a simple request.
import requests
API_KEY = "HASDATA_API_KEY"
url = "https://api.hasdata.com/scrape/web"
payload = {
    "url": "https://httpbin.org/ip", 
    "proxyType": "datacenter"
}
headers = {"x-api-key": API_KEY, "Content-Type": "application/json"}
response = requests.post(url, json=payload, headers=headers)
print(response.json())

Once you have a successful response, you are ready to start building the scrapers.

Scraping Your Own Website for Technical SEO Audit

Before analyzing competitors, you must ensure your own foundation is solid. Analyzing thousands of pages manually is impossible. A custom scraper allows you to audit your entire site structure, detect technical errors, and verify metadata implementation in minutes.

Standard SEO crawlers like Screaming Frog are excellent but are limited by their GUI and lack of customization. A Python scraper gives you direct access to raw HTML. You control what you extract and how you structure the output.

Bulk Metadata Extractor

This script extracts core on-page SEO elements from a list of URLs. It identifies missing canonicals, duplicate titles, and accidental noindex tags.

import httpx
from parsel import Selector
import pandas as pd
import time
import os

OUTPUT_DIR = "output/1_metadata"

URLS = [
    "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
    "https://hasdata.com/blog/python-for-seo"
]


def scrape_metadata(url: str) -> dict:
    """Scrapes SEO metadata from a single URL. Returns a dict of metadata fields."""
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
    }
    try:
        response = httpx.get(url, headers=headers, follow_redirects=True, timeout=10.0)

        if response.status_code != 200:
            return {"url": url, "status": response.status_code, "error": f"HTTP {response.status_code}"}

        selector = Selector(text=response.text)

        title = selector.xpath("//title/text()").get() or ""
        meta_desc = selector.xpath("//meta[@name='description']/@content").get() or ""

        return {
            "url": url,
            "final_url": str(response.url),
            "status": response.status_code,
            "title": title,
            "title_length": len(title),
            "meta_desc": meta_desc,
            "meta_desc_length": len(meta_desc),
            "canonical": selector.xpath("//link[@rel='canonical']/@href").get(),
            "h1": selector.xpath("//h1/text()").get(),
            "h1_count": len(selector.xpath("//h1").getall()),
            "word_count": len(" ".join(selector.xpath("//body//text()").getall()).split()),
            "robots_meta": selector.xpath("//meta[@name='robots']/@content").get(),
            "og_title": selector.xpath("//meta[@property='og:title']/@content").get(),
            "og_description": selector.xpath("//meta[@property='og:description']/@content").get(),
            "og_image": selector.xpath("//meta[@property='og:image']/@content").get(),
        }

    except Exception as e:
        return {"url": url, "status": 0, "error": str(e)}


def run_metadata_audit(urls: list = None, output_dir: str = OUTPUT_DIR) -> list:
    """
    Audits a list of URLs for SEO metadata.
    Saves results to {output_dir}/seo_audit.csv
    Returns list of result dicts.
    """
    if urls is None:
        urls = URLS

    os.makedirs(output_dir, exist_ok=True)
    results = []

    print(f"Auditing {len(urls)} pages...")
    for url in urls:
        print(f"  Scraping: {url}")
        data = scrape_metadata(url)
        results.append(data)
        time.sleep(1)

    df = pd.DataFrame(results)
    out_path = os.path.join(output_dir, "seo_audit.csv")
    df.to_csv(out_path, index=False)
    print(f"Saved {len(df)} rows → {out_path}\n")
    return results


if __name__ == "__main__":
    run_metadata_audit()

What This Detects:

IssueDetection LogicImpact
Missing Titletitle field is emptyPages cannot rank without titles
Title Too Longtitle_length > 60Google truncates in SERP, poor CTR
Duplicate H1h1_count > 1Dilutes topical focus
Missing Canonicalcanonical field is emptyRisk of duplicate content penalty
Accidental Noindexrobots_meta contains “noindex”Page excluded from index
Thin Contentword_count < 300Low ranking potential

Redirect Chain Detector

Long redirect chains (A → B → C → D) waste crawl budget and increase page load latency. Standard crawlers report the final status, but they often miss the intermediate hops. This script traces the full path of every URL.

import httpx
import pandas as pd
import os

OUTPUT_DIR = "output/2_redirects"

URLS_TO_CHECK = [
    "http://httpbin.org/redirect/3",
    "http://httpbin.org/redirect/1",
    "http://httpbin.org/status/200",
]


def check_redirect_chain(url: str) -> list:
    """
    Follows all redirects for a single URL.
    Returns a list of dicts, one per hop (including the final destination).
    """
    history = []
    try:
        response = httpx.get(url, follow_redirects=True, timeout=10.0)

        for resp in response.history:
            history.append({"url": str(resp.url), "status": resp.status_code})

        history.append({"url": str(response.url), "status": response.status_code})

    except Exception as e:
        history.append({"url": url, "status": 0, "error": str(e)})

    return history


def run_redirect_audit(urls: list = None, output_dir: str = OUTPUT_DIR) -> list:
    """
    Checks redirect chains for a list of URLs.
    Saves all hops to {output_dir}/redirect_chains.csv
    Returns flat list of hop dicts.
    """
    if urls is None:
        urls = URLS_TO_CHECK

    os.makedirs(output_dir, exist_ok=True)
    all_chains = []

    for url in urls:
        print(f"\nChecking: {url}")
        chain = check_redirect_chain(url)

        for i, step in enumerate(chain):
            step["original_url"] = url
            step["hop_number"] = i + 1
            all_chains.append(step)
            print(f"  [{step['status']}] → {step['url']}")

        if len(chain) > 2:
            print(f"  ⚠ Chain length: {len(chain)} (recommended ≤ 2)")

    df = pd.DataFrame(all_chains)
    out_path = os.path.join(output_dir, "redirect_chains.csv")
    df.to_csv(out_path, index=False)
    print(f"\nSaved {len(all_chains)} hops → {out_path}\n")
    return all_chains


if __name__ == "__main__":
    run_redirect_audit()

What This Detects:

IssueDetection LogicImpact
Long Redirect Chainshop_number > 2Wastes crawl budget, increases page load time
Temporary Redirectsstatus == 302Google may not pass full link equity
Redirect LoopsFinal status ≠ 200Page unreachable, complete crawl failure
Unintended Domain ChangeOriginal domain ≠ final domainLink equity loss if redirect is accidental

What You Have Now

You can audit thousands of pages for technical issues without Screaming Frog’s GUI. You can version control your audits in CSV format and automate them via cron or GitHub Actions.

But technical health is only half the battle. SEO is competitive. You need to understand what content structure and keywords competitors use to rank. The next section shows how to extract competitor headings, calculate keyword density, and identify schema markup patterns.

Scraping Competitor Websites for Content Analysis

While technical health is the foundation, content drives rankings. To outrank a competitor, you need to understand what they publish and how they structure it.  This section extracts heading structures, keyword patterns, and schema markup from top-ranking pages.

Content Structure Analysis (H2/H3 & Word Count)

Search engines value comprehensive coverage. By scraping the heading hierarchy (H1-H6) of top-ranking pages, you can create a “master outline” that covers all subtopics your competitors address.

This function extracts the skeleton of any article, allowing you to visualize its logical flow and depth.

import httpx
from parsel import Selector
import pandas as pd
import time
import os
from urllib.parse import urlparse

OUTPUT_DIR = "output/3_structure"

COMPETITOR_URLS = [
    "https://books.toscrape.com/",
    "https://hasdata.com/blog/python-for-seo"
]


def extract_content_structure(url: str) -> dict:
    """
    Extracts content structure from a single URL.
    Returns a dict with summary stats and heading lists.
    """
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
    }
    try:
        response = httpx.get(url, headers=headers, follow_redirects=True, timeout=10.0)

        if response.status_code != 200:
            return {"url": url, "error": f"HTTP {response.status_code}"}

        selector = Selector(text=response.text)

        return {
            "url": url,
            "h1": selector.xpath("//h1//text()").get(),
            "h2_list": [h.strip() for h in selector.xpath("//h2//text()").getall() if h.strip()],
            "h3_list": [h.strip() for h in selector.xpath("//h3//text()").getall() if h.strip()],
            "h2_count": len(selector.xpath("//h2").getall()),
            "h3_count": len(selector.xpath("//h3").getall()),
            "word_count": len(" ".join(selector.xpath("//body//text()").getall()).split()),
            "paragraph_count": len(selector.xpath("//p").getall()),
            "internal_links": len(selector.xpath("//a[starts-with(@href, '/')]").getall()),
            "external_links": len(selector.xpath("//a[starts-with(@href, 'http')]").getall()),
        }

    except Exception as e:
        return {"url": url, "error": str(e)}


def run_content_structure_audit(urls: list = None, output_dir: str = OUTPUT_DIR) -> list:
    """
    Analyzes content structure for a list of URLs.
    For each URL saves two files in {output_dir}/{domain}/:
      - summary.csv   — one row with counts (word_count, h2_count, etc.)
      - headings.csv  — one row per heading (level | text)
    Also saves a combined structure_summary.csv across all URLs.
    Returns list of result dicts.
    """
    if urls is None:
        urls = COMPETITOR_URLS

    os.makedirs(output_dir, exist_ok=True)
    results = []

    for url in urls:
        print(f"  Analyzing: {url}")
        data = extract_content_structure(url)
        results.append(data)

        # Per-site subfolder
        domain = urlparse(url).netloc.replace(".", "_")
        site_dir = os.path.join(output_dir, domain)
        os.makedirs(site_dir, exist_ok=True)

        # Summary row (no list columns)
        summary_row = {k: v for k, v in data.items() if k not in ("h2_list", "h3_list")}
        pd.DataFrame([summary_row]).to_csv(os.path.join(site_dir, "summary.csv"), index=False)

        # Headings flat (one heading per row)
        headings_rows = []
        for h2 in data.get("h2_list", []):
            headings_rows.append({"level": "H2", "text": h2})
        for h3 in data.get("h3_list", []):
            headings_rows.append({"level": "H3", "text": h3})
        pd.DataFrame(headings_rows).to_csv(os.path.join(site_dir, "headings.csv"), index=False)

        print(f"Saved → {site_dir}/summary.csv + headings.csv")
        time.sleep(1)

    # Combined summary across all URLs
    all_summary = [{k: v for k, v in r.items() if k not in ("h2_list", "h3_list")} for r in results]
    combined_path = os.path.join(output_dir, "structure_summary.csv")
    pd.DataFrame(all_summary).to_csv(combined_path, index=False)
    print(f"Combined summary → {combined_path}\n")

    return results


if __name__ == "__main__":
    run_content_structure_audit()

What This Reveals:

InsightDetection MethodAction
Missing topic sectionsH2 appears in 3+ competitors but not in your contentAdd that section to your article
Content depth gapCompetitors average 2,500 words, you have 800Expand content or add subsections
Multimedia deficitCompetitors use 8-12 images, you have 2Add diagrams, screenshots, or charts
Weak internal linkingCompetitors link to 5-10 related pages, you link to 1Build topic cluster with internal links

For JavaScript-rendered competitor sites, use HasData’s Web Scraping API with render: true to get the full DOM.

Keyword Extraction (N-Gram Analysis)

Competitors use specific phrases repeatedly. These are LSI (Latent Semantic Indexing) keywords. Extracting them reveals the semantic context Google associates with your topic.

This script extracts the 20 most frequent 2-word and 3-word phrases from competitor content. You compare this list against your own content to find missing terminology.

import httpx
from parsel import Selector
from collections import Counter
import re
import pandas as pd
import os
from urllib.parse import urlparse

OUTPUT_DIR = "output/4_keywords"

DEFAULT_URL = "https://hasdata.com/blog/python-for-seo"

STOPWORDS = {
    "the", "is", "at", "which", "on", "a", "an", "as", "are", "was", "were",
    "be", "been", "in", "of", "to", "for", "with", "and", "or", "but", "not",
    "this", "that", "by", "from", "it", "can", "will", "you", "your", "has",
    "have", "had", "all", "its", "our", "we", "do", "so", "if", "about",
    "what", "how", "more", "than", "when", "their", "also", "into", "other",
}


def extract_ngrams(text: str, n: int = 2, top_n: int = 20) -> list:
    """Extracts top N n-grams from text after filtering stopwords."""
    words = re.findall(r"\b[a-z]+\b", text.lower())
    words = [w for w in words if w not in STOPWORDS and len(w) > 2]
    ngrams = zip(*[words[i:] for i in range(n)])
    ngram_list = [" ".join(gram) for gram in ngrams]
    return Counter(ngram_list).most_common(top_n)


def analyze_competitor_keywords(url: str) -> dict:
    """
    Fetches a page and extracts top bigrams and trigrams from body text.
    Returns dict with url, bigrams, trigrams.
    """
    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"}
    try:
        response = httpx.get(url, headers=headers, timeout=15.0)
        selector = Selector(text=response.text)

        body_text = " ".join(selector.xpath("//article//text() | //main//text()").getall())
        if not body_text.strip():
            body_text = " ".join(selector.xpath("//body//p//text()").getall())

        return {
            "url": url,
            "bigrams": extract_ngrams(body_text, n=2, top_n=20),
            "trigrams": extract_ngrams(body_text, n=3, top_n=15),
        }

    except Exception as e:
        return {"url": url, "error": str(e)}


def run_keyword_extraction(url: str = None, output_dir: str = OUTPUT_DIR) -> dict:
    """
    Runs keyword extraction for a URL.
    Saves bigrams and trigrams to {output_dir}/keywords_{domain}.csv
    Returns result dict.
    """
    if url is None:
        url = DEFAULT_URL

    os.makedirs(output_dir, exist_ok=True)
    result = analyze_competitor_keywords(url)

    if "error" in result:
        print(f"Error: {result['error']}")
        return result

    domain = urlparse(url).netloc.replace(".", "_")
    rows = []
    for phrase, count in result.get("bigrams", []):
        rows.append({"type": "bigram", "phrase": phrase, "count": count})
    for phrase, count in result.get("trigrams", []):
        rows.append({"type": "trigram", "phrase": phrase, "count": count})

    df = pd.DataFrame(rows)
    out_path = os.path.join(output_dir, f"keywords_{domain}.csv")
    df.to_csv(out_path, index=False)

    print(f"Top 2-Word Phrases:")
    for phrase, count in result["bigrams"]:
        print(f"  {phrase}: {count}")
    print(f"\nTop 3-Word Phrases:")
    for phrase, count in result["trigrams"]:
        print(f"  {phrase}: {count}")
    print(f"\nSaved → {out_path}\n")

    return result


if __name__ == "__main__":
    run_keyword_extraction()

Example Output:

Top 2-Word Phrases:

  web scraping: 47
  search results: 34
  rank tracking: 28
  serp features: 22
  organic rankings: 19

Top 3-Word Phrases:

  google search results: 18
  people also ask: 15
  web scraping api: 12

How to Use This:

  1. Run this script on your top 3 competitors.
  2. Export their keyword lists to CSV.
  3. Compare against your own content using Ctrl+F in your draft.
  4. If competitors mention ‘serp features’ frequently and you do not, you may be underweighting a core topic.

This is not keyword stuffing. This is semantic alignment. Google expects certain terminology in comprehensive content. Missing it signals incomplete coverage.

This is frequency-based analysis, not TF-IDF. For weighted keyword importance across multiple competitors, see Script 4 in our Python SEO guide.

Schema Markup Extraction

Structured data (schema.org markup) helps pages appear in rich snippets, PAA boxes, and knowledge panels. If your competitors implement FAQPage or HowTo schema and you do not, you are invisible in those SERP features.

This script extracts all JSON-LD structured data from a competitor page.

import httpx
from parsel import Selector
import json
import os
from urllib.parse import urlparse

OUTPUT_DIR = "output/5_markup"

DEFAULT_URL = "https://hasdata.com/blog/python-for-seo"


def extract_schema(url: str) -> dict:
    """
    Extracts all JSON-LD schema blocks from a page.
    Returns dict with url, schemas list, and schema_types list.
    """
    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"}
    try:
        response = httpx.get(url, headers=headers, timeout=15.0)
        selector = Selector(text=response.text)

        raw_schemas = selector.xpath("//script[@type='application/ld+json']/text()").getall()

        parsed_schemas = []
        for raw in raw_schemas:
            try:
                parsed_schemas.append(json.loads(raw))
            except json.JSONDecodeError:
                continue

        return {
            "url": url,
            "schema_count": len(parsed_schemas),
            "schema_types": [s.get("@type") for s in parsed_schemas if "@type" in s],
            "schemas": parsed_schemas,
        }

    except Exception as e:
        return {"url": url, "error": str(e)}


def run_markup_extraction(url: str = None, output_dir: str = OUTPUT_DIR) -> dict:
    """
    Extracts schema markup from a URL and saves to JSON.
    Saves to {output_dir}/schema_{domain}.json
    Returns result dict.
    """
    if url is None:
        url = DEFAULT_URL

    os.makedirs(output_dir, exist_ok=True)
    result = extract_schema(url)

    if "error" in result:
        print(f"Error: {result['error']}")
        return result

    print(f"Schema Types Found: {result['schema_types']}")
    for schema in result["schemas"]:
        schema_type = schema.get("@type", "Unknown")
        print(f"\n--- {schema_type} ---")
        print(json.dumps(schema, indent=2)[:400])

    domain = urlparse(url).netloc.replace(".", "_")
    out_path = os.path.join(output_dir, f"schema_{domain}.json")
    with open(out_path, "w", encoding="utf-8") as f:
        json.dump(result, f, indent=2, ensure_ascii=False)
    print(f"\nSaved → {out_path}\n")

    return result


if __name__ == "__main__":
    run_markup_extraction()

Example Output:

Schema Types Found: ['Article', 'FAQPage', 'BreadcrumbList']

--- Article Schema ---
{
  "@context": "https://schema.org",
  "@type": "Article",
  "headline": "The Complete Guide to SEO Scraping",
  "author": {
    "@type": "Person",
    "name": "John Doe"
  },
  "datePublished": "2026-01-15"
}

--- FAQPage Schema ---
{
  "@context": "https://schema.org",
  "@type": "FAQPage",
  "mainEntity": [
    {
      "@type": "Question",
      "name": "What is SEO scraping?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "SEO scraping is..."
      }
    }
  ]
}

What This Tells You:

Schema TypePurposeAction
ArticleMarks content as editorial articleAdd to your blog posts
FAQPageEnables PAA-style rich snippetsAdd if you have Q&A section
HowToShows step-by-step instructions in SERPUse for tutorial content
ProductDisplays price, reviews, availabilityRequired for e-commerce
BreadcrumbListShows site hierarchy in SERPImproves site navigation signals

If your top-ranking competitor has FAQPage schema and appears in the “People Also Ask” box, their schema is working. You should implement the same structure.

What You Have Now

You can audit competitor content structure, extract their keyword patterns, and reverse-engineer their schema strategy. This data tells you exactly what topics to cover, which phrases to emphasize, and which structured data to implement.

But content lives on the SERP. The next section shows how to scrape Google Search results directly to see what ranks, what AI Overviews say, and what questions users ask.

Scraping Google Search Results for Strategic Intelligence

Analyzing your own pages is useful, but the real ROI comes from analyzing the search results themselves. You cannot optimize for rankings you do not measure. Google SERP is the most valuable data source in SEO. It reveals three layers of intelligence: what ranks (organic), what Google synthesizes (AI Overviews), and what users want to know next (PAA + Related Searches).

We will use the HasData Google SERP API. It returns structured JSON, eliminating the need to parse volatile HTML or manage headless browsers for JavaScript rendering.

Extracting Organic Rankings

This is your baseline rank tracker. For every keyword, you need position, URL, title, and snippet to detect title rewrites and identify competitors.

import requests
import pandas as pd
import os

OUTPUT_DIR = "output/6_rankings"
API_KEY = "HASDATA_API_KEY"  # https://app.hasdata.com/

KEYWORDS = [
    "python web scraping tutorial",
    "seo audit tools",
]


def scrape_organic_results(keyword: str, api_key: str = None) -> list:
    """
    Fetches organic search results for a keyword via HasData SERP API.
    Returns list of result dicts (position, title, url, snippet, domain).
    """
    if api_key is None:
        api_key = API_KEY

    url = "https://api.hasdata.com/scrape/google/serp"
    headers = {"x-api-key": api_key}
    params = {
        "q": keyword,
        "gl": "us",
        "hl": "en",
        "location": "New York,New York,United States",
    }

    try:
        response = requests.get(url, headers=headers, params=params, timeout=30)

        if response.status_code != 200:
            print(f"  Error {response.status_code}: {response.text[:200]}")
            return []

        data = response.json()
        organic = data.get("organicResults", [])

        return [
            {
                "keyword": keyword,
                "position": item.get("position"),
                "title": item.get("title"),
                "url": item.get("link"),
                "snippet": item.get("snippet"),
                "domain": item.get("displayedLink", "").split("/")[0],
            }
            for item in organic
        ]

    except Exception as e:
        print(f"  Exception: {e}")
        return []


def run_organic_rankings(
    keywords: list = None, api_key: str = None, output_dir: str = OUTPUT_DIR
) -> list:
    """
    Scrapes organic rankings for a list of keywords.
    Saves one CSV per keyword to {output_dir}/organic_{keyword}.csv
    Also saves a combined file: organic_all.csv
    Returns flat list of all result dicts.
    """
    if keywords is None:
        keywords = KEYWORDS

    os.makedirs(output_dir, exist_ok=True)
    all_data = []

    for kw in keywords:
        print(f"  Scraping: {kw}")
        results = scrape_organic_results(kw, api_key=api_key)
        all_data.extend(results)

        if results:
            safe_kw = kw.replace(" ", "_")
            out_path = os.path.join(output_dir, f"organic_{safe_kw}.csv")
            pd.DataFrame(results).to_csv(out_path, index=False)
            print(f"  {len(results)} results → {out_path}")

    if all_data:
        combined_path = os.path.join(output_dir, "organic_all.csv")
        pd.DataFrame(all_data).to_csv(combined_path, index=False)
        print(f"\nCombined: {len(all_data)} rows → {combined_path}\n")

    return all_data


if __name__ == "__main__":
    run_organic_rankings()

What this tells you:

  • Your current position for target keywords.
  • Title tag patterns Google rewards (length, structure, keyword placement).
  • Snippet length distribution (short vs. detailed).

Extracting AI Overviews

AI Overviews appear on approximately 30% of Google searches, dependent on query intent. If Google synthesizes an answer above your organic result, you must analyze its content and citations.

Monitor whether your domain is cited in AI Overviews for your target keywords. Track the narrative Google is presenting.

import requests
import json
import os

OUTPUT_DIR = "output/7_ai_overviews"
API_KEY = "HASDATA_API_KEY"  # https://app.hasdata.com/
DEFAULT_KEYWORD = "what is seo scraping"


def get_ai_overview(keyword: str, api_key: str = None) -> dict | None:
    """
    Fetches the AI Overview block for a keyword from Google SERP.
    Returns the aiOverview dict if present, otherwise None.
    """
    if api_key is None:
        api_key = API_KEY

    url = "https://api.hasdata.com/scrape/google/serp"
    params = {"q": keyword, "location": "United States", "gl": "us", "hl": "en"}
    headers = {"x-api-key": api_key}

    try:
        response = requests.get(url, params=params, headers=headers, timeout=30)

        if response.status_code != 200:
            print(f"  Error {response.status_code}: {response.text[:200]}")
            return None

        data = response.json()
        return data.get("aiOverview") or None

    except Exception as e:
        print(f"  Exception: {e}")
        return None


def run_ai_overviews(
    keyword: str = None, api_key: str = None, output_dir: str = OUTPUT_DIR
) -> dict | None:
    """
    Fetches and displays the AI Overview for a keyword.
    Saves result to {output_dir}/ai_overview_{keyword}.json
    Returns the aiOverview dict or None.
    """
    if keyword is None:
        keyword = DEFAULT_KEYWORD

    os.makedirs(output_dir, exist_ok=True)
    ai_overview = get_ai_overview(keyword, api_key=api_key)

    if ai_overview:
        print(f"\n--- AI Overview for '{keyword}' ---")
        print(f"Summary: {ai_overview.get('text', 'N/A')}\n")

        sources = ai_overview.get("sources", [])
        if sources:
            print("Cited Sources:")
            for src in sources:
                print(f"  - {src.get('title')}: {src.get('link')}")

        safe_kw = keyword.replace(" ", "_")
        out_path = os.path.join(output_dir, f"ai_overview_{safe_kw}.json")
        with open(out_path, "w", encoding="utf-8") as f:
            json.dump({"keyword": keyword, "ai_overview": ai_overview}, f, indent=2, ensure_ascii=False)
        print(f"\nSaved → {out_path}\n")
    else:
        print(f"No AI Overview found for '{keyword}'")

    return ai_overview


if __name__ == "__main__":
    run_ai_overviews()

Use Case:

  • Audit which competitor domains are cited most frequently across your keyword set.
  • Reverse-engineer their content patterns (depth, citation density, schema usage).
  • Track when AI Overviews appear or disappear for your keywords (indicates query intent shifts).

For tracking AI Overview citations at scale across hundreds of keywords, see our guide on Monitoring AI Overviews.

Extracting People Also Ask (PAA)

PAA questions are intent signals. They tell you what subtopics users expect in a complete answer.

import requests
import pandas as pd
import os

OUTPUT_DIR = "output/8_paa"
API_KEY = "HASDATA_API_KEY"  # https://app.hasdata.com/
DEFAULT_KEYWORD = "how to do seo for a website"


def scrape_paa(keyword: str, api_key: str = None) -> list:
    """
    Fetches People Also Ask questions for a keyword.
    Returns list of dicts: keyword, question, snippet, source.
    """
    if api_key is None:
        api_key = API_KEY

    url = "https://api.hasdata.com/scrape/google/serp"
    headers = {"x-api-key": api_key}
    params = {"q": keyword, "gl": "us", "hl": "en"}

    try:
        response = requests.get(url, headers=headers, params=params, timeout=30)

        if response.status_code != 200:
            print(f"  Error {response.status_code}: {response.text[:200]}")
            return []

        data = response.json()
        questions = []
        for item in data.get("relatedQuestions", []):
            questions.append(
                {
                    "keyword": keyword,
                    "question": item.get("question"),
                    "snippet": item.get("snippet"),
                    "source": item.get("link"),
                }
            )
        return questions

    except Exception as e:
        print(f"  Exception: {e}")
        return []


def run_people_also_ask(
    keyword: str = None, api_key: str = None, output_dir: str = OUTPUT_DIR
) -> list:
    """
    Fetches PAA questions for a keyword, prints and saves them.
    Saves to {output_dir}/paa_{keyword}.csv
    Returns list of question dicts.
    """
    if keyword is None:
        keyword = DEFAULT_KEYWORD

    os.makedirs(output_dir, exist_ok=True)
    paa_data = scrape_paa(keyword, api_key=api_key)

    if not paa_data:
        print(f"No PAA questions found for '{keyword}'")
        return []

    print(f"\n--- People Also Ask: '{keyword}' ---")
    for q in paa_data:
        print(f"Q: {q['question']}")
        snippet = (q.get("snippet") or "")[:120]
        print(f"A: {snippet}...\n")

    safe_kw = keyword.replace(" ", "_")
    out_path = os.path.join(output_dir, f"paa_{safe_kw}.csv")
    pd.DataFrame(paa_data).to_csv(out_path, index=False)
    print(f"Saved {len(paa_data)} questions → {out_path}\n")

    return paa_data


if __name__ == "__main__":
    run_people_also_ask()

Use Case:

  • Map PAA questions to H2/H3 sections in your content.
  • Identify semantic gaps (questions competitors answer but you do not).
  • Monitor PAA volatility (new questions appearing indicates shifting user interest).

To build a recursive PAA topic graph (scraping questions within questions), refer to Script 3 in this guide.

Related Searches appear at the bottom of the SERP. They represent semantic variations and adjacent intents. These are keyword expansion opportunities.

Use Case: Discover long-tail modifiers. Identify cluster topics for pillar pages.

import requests
import pandas as pd
import os

OUTPUT_DIR = "output/9_related"
API_KEY = "HASDATA_API_KEY"  # https://app.hasdata.com/
DEFAULT_KEYWORD = "seo scraping tools"


def get_related_searches(keyword: str, api_key: str = None) -> list:
    """
    Fetches related search queries for a keyword from Google SERP.
    Returns a list of query strings.
    """
    if api_key is None:
        api_key = API_KEY

    url = "https://api.hasdata.com/scrape/google/serp"
    params = {"q": keyword, "location": "United States", "gl": "us", "hl": "en"}
    headers = {"x-api-key": api_key}

    try:
        response = requests.get(url, params=params, headers=headers, timeout=30)

        if response.status_code != 200:
            print(f"  Error {response.status_code}: {response.text[:200]}")
            return []

        data = response.json()
        return [item.get("query") for item in data.get("relatedSearches", []) if item.get("query")]

    except Exception as e:
        print(f"  Exception: {e}")
        return []


def run_related_searches(
    keyword: str = None, api_key: str = None, output_dir: str = OUTPUT_DIR
) -> list:
    """
    Fetches related searches for a keyword, prints and saves them.
    Saves to {output_dir}/related_{keyword}.csv
    Returns list of query strings.
    """
    if keyword is None:
        keyword = DEFAULT_KEYWORD

    os.makedirs(output_dir, exist_ok=True)
    queries = get_related_searches(keyword, api_key=api_key)

    if not queries:
        print(f"No related searches found for '{keyword}'")
        return []

    print(f"\n--- Related Searches for '{keyword}' ---")
    for q in queries:
        print(f"  - {q}")

    safe_kw = keyword.replace(" ", "_")
    out_path = os.path.join(output_dir, f"related_{safe_kw}.csv")
    pd.DataFrame({"keyword": keyword, "related_query": queries}).to_csv(out_path, index=False)
    print(f"\nSaved {len(queries)} queries → {out_path}\n")

    return queries


if __name__ == "__main__":
    run_related_searches()

Use Case:

  • Identify content clusters (group related terms with high overlap).
  • Discover informational vs. transactional modifiers.
  • Build internal linking architecture around these semantic connections.

Complete Workflow Example

Combine all four data sources into one audit function.

import json
import os
from datetime import datetime
from urllib.parse import urlparse

from _1_metadata_extractor         import scrape_metadata, run_metadata_audit
from _2_redirect_chain_detector    import check_redirect_chain, run_redirect_audit
from _3_content_structure_analyzes import extract_content_structure, run_content_structure_audit
from _4_keyword_extraction         import analyze_competitor_keywords, run_keyword_extraction
from _5_markup_extraction          import extract_schema, run_markup_extraction
from _6_organic_rankings           import scrape_organic_results, run_organic_rankings
from _7_ai_overviews               import get_ai_overview, run_ai_overviews
from _8_people_also_ask            import scrape_paa, run_people_also_ask
from _9_related_searches           import get_related_searches, run_related_searches

OUTPUT_DIR = "output"
API_KEY = "HASDATA_API_KEY"  # https://app.hasdata.com/


#  MODE 1 — SERP AUDIT  (keyword → scripts 6, 7, 8, 9)

def serp_audit(keyword: str, api_key: str = None) -> dict:
    """
    Runs a full SERP audit for a keyword.
    """
    if api_key is None:
        api_key = API_KEY

    _header(f"SERP AUDIT: '{keyword}'")

    print("[6/9] Organic Rankings")
    organic = run_organic_rankings(keywords=[keyword], api_key=api_key)

    print("[7/9] AI Overview")
    ai_overview = run_ai_overviews(keyword=keyword, api_key=api_key)

    print("[8/9] People Also Ask")
    paa = run_people_also_ask(keyword=keyword, api_key=api_key)

    print("[9/9] Related Searches")
    related = run_related_searches(keyword=keyword, api_key=api_key)

    summary = {
        "mode": "serp_audit",
        "keyword": keyword,
        "timestamp": datetime.now().isoformat(),
        "organic_results": len(organic),
        "ai_overview_triggered": bool(ai_overview),
        "paa_questions": len(paa),
        "related_searches": len(related),
        "top_organic_urls": [r.get("url") for r in organic[:5]],
        "paa_questions_list": [q.get("question") for q in paa],
        "related_queries": related,
    }

    _save_summary(summary, label=keyword)
    _footer_serp(summary)
    return summary


#  MODE 2 — SITE AUDIT  (url → scripts 1, 2, 3, 4, 5)

def site_audit(url: str) -> dict:
    """
    Runs a full on-page audit for a single URL.
    """
    domain = urlparse(url).netloc
    _header(f"SITE AUDIT: {url}")

    print("[1/5] Metadata")
    metadata_list = run_metadata_audit(urls=[url])
    metadata = metadata_list[0] if metadata_list else {}

    print("[2/5] Redirect Chain")
    chain = check_redirect_chain(url)
    redirect_hops = len(chain)
    has_redirect_issue = redirect_hops > 2

    print("[3/5] Content Structure")
    structure_list = run_content_structure_audit(urls=[url])
    structure = structure_list[0] if structure_list else {}

    print("[4/5] Keyword Extraction")
    keywords = run_keyword_extraction(url=url)

    print("[5/5] Schema Markup")
    schema = run_markup_extraction(url=url)

    summary = {
        "mode": "site_audit",
        "url": url,
        "domain": domain,
        "timestamp": datetime.now().isoformat(),
        # Metadata
        "title": metadata.get("title"),
        "title_length": metadata.get("title_length"),
        "meta_desc_length": metadata.get("meta_desc_length"),
        "h1": metadata.get("h1"),
        "word_count": structure.get("word_count"),
        # Redirects
        "redirect_hops": redirect_hops,
        "redirect_warning": has_redirect_issue,
        # Structure
        "h2_count": structure.get("h2_count"),
        "h3_count": structure.get("h3_count"),
        "internal_links": structure.get("internal_links"),
        "external_links": structure.get("external_links"),
        # Schema
        "schema_types": schema.get("schema_types", []),
        # Keywords
        "top_bigrams": [phrase for phrase, _ in (keywords.get("bigrams") or [])[:5]],
        "top_trigrams": [phrase for phrase, _ in (keywords.get("trigrams") or [])[:5]],
    }

    _save_summary(summary, label=domain)
    _footer_site(summary)
    return summary


#  MODE 3 — FULL AUDIT  (keyword + url → all 9 scripts)

def full_audit(keyword: str, url: str, api_key: str = None) -> dict:
    """
    Runs both serp_audit and site_audit.
    Returns a combined summary dict with both results.
    """
    if api_key is None:
        api_key = API_KEY

    serp_summary = serp_audit(keyword=keyword, api_key=api_key)
    site_summary = site_audit(url=url)

    combined = {
        "mode": "full_audit",
        "keyword": keyword,
        "url": url,
        "timestamp": datetime.now().isoformat(),
        "serp": serp_summary,
        "site": site_summary,
    }

    _save_summary(combined, label=f"full_{keyword.replace(' ', '_')}")
    return combined


#  Helpers

def _header(title: str) -> None:
    print(f"\n{'═' * 60}")
    print(f"  {title}")
    print(f"{'═' * 60}\n")


def _save_summary(summary: dict, label: str) -> None:
    os.makedirs(OUTPUT_DIR, exist_ok=True)
    safe = label.replace(" ", "_").replace("/", "_").replace(".", "_")
    path = os.path.join(OUTPUT_DIR, f"summary_{safe}.json")
    with open(path, "w", encoding="utf-8") as f:
        json.dump(summary, f, indent=2, ensure_ascii=False)
    print(f"\nSummary saved → {path}")


def _footer_serp(s: dict) -> None:
    print(f"\n{'═' * 60}")
    print(f"  SERP AUDIT DONE — '{s['keyword']}'")
    print(f"  Organic     : {s['organic_results']} results")
    print(f"  AI Overview : {'Yes' if s['ai_overview_triggered'] else 'No'}")
    print(f"  PAA         : {s['paa_questions']} questions")
    print(f"  Related     : {s['related_searches']} queries")
    print(f"{'═' * 60}\n")


def _footer_site(s: dict) -> None:
    print(f"\n{'═' * 60}")
    print(f"  SITE AUDIT DONE — {s['url']}")
    print(f"  Title       : {s['title']} ({s['title_length']} chars)")
    print(f"  Words       : {s['word_count']}")
    print(f"  Headings    : {s['h2_count']} H2 / {s['h3_count']} H3")
    print(f"  Redirects   : {s['redirect_hops']} hop(s)" + (" ⚠" if s['redirect_warning'] else ""))
    print(f"  Schema      : {s['schema_types'] or 'none'}")
    print(f"  Top bigrams : {', '.join(s['top_bigrams'])}")
    print(f"{'═' * 60}\n")


#  Entry point — edit mode and values below

if __name__ == "__main__":
    DEMO_KEYWORD = "python web scraping"
    DEMO_URL = "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"

    # Pick one:

    # Mode 1 — SERP only (requires API key)
    # serp_audit(keyword=DEMO_KEYWORD, api_key=API_KEY)

    # Mode 2 — Site only (no API key needed)
    # site_audit(url=DEMO_URL)

    # Mode 3 — Full audit (requires API key)
    full_audit(keyword=DEMO_KEYWORD, url=DEMO_URL, api_key=API_KEY)

This gives you a snapshot of the entire competitive landscape in under 5 seconds. No manual copying. No screenshots. Just structured data ready for analysis.

For more advanced SERP automation (intent clustering, trend monitoring, content gap analysis), see our guide on 7 Python Scripts to Automate SEO in 2026.

When to Use Scraping vs Traditional Tools

Use scraping when you need:

  • Real-time data (not cached databases)
  • Custom location or device targeting
  • Bulk analysis across thousands of keywords
  • Integration with your internal dashboards

Use traditional tools (Ahrefs, Semrush) when you need:

  • Historical trend data
  • Backlink analysis
  • Competitor domain overviews

SEO scraping transforms you from a passive user of third-party data into an owner of your own intelligence pipeline. Whether you are auditing technical health, reverse-engineering competitor content, or tracking real-time rankings, the scripts in this guide give you the raw capability to see what others miss.

Sergey Ermakovich
Sergey Ermakovich
Sergey is the Co-founder and CMO at HasData, a web scraping API handling billions of requests. He specializes in web data extraction infrastructure, anti-bot evasion strategies, and technical SEO. Sergey writes extensively on headless browser orchestration, API development, and scaling data pipelines for enterprise applications.
Articles

Might Be Interesting