Web Scraping for SEO: Technical Guide
SEO scraping is the automated extraction of SERP data, on-page elements, and competitor content for analysis. It replaces manual analysis with programmatic scale.
This guide shows you how to build production-grade SEO scrapers using Python. You will extract organic rankings, AI Overviews, competitor keywords, content structure, and technical SEO issues.
Core Tech Stack
| Component | Recommended Tool | Function |
|---|---|---|
| Request Engine | httpx or requests | Sends HTTP requests to target URLs. |
| Parsing | parsel or BeautifulSoup | Extracts specific data points (titles, ranks) from HTML. |
| Anti-Bot | HasData API / Proxies | Handles IP rotation, CAPTCHAs, and fingerprinting. |
| Storage | Pandas / CSV | Organizes unstructured HTML into analysis-ready datasets. |
Common Use Cases
- Rank Tracking: Monitor keyword positions across locations and devices.
- SERP Feature Analysis: Extract PAA questions, AI Overviews, and Featured Snippets.
- Competitor Auditing: Scrape competitor headings, word counts, and schema markup.
- Technical SEO Monitoring: Check your site for broken links, missing meta tags, and
noindexflags. - Brand Monitoring: Detect unauthorized mentions or negative reviews.
Prerequisites and Setup
Install the required Python libraries for HTTP requests and HTML parsing.
pip install httpx parsel pandas requestsWhat each library does:
httpx: HTTP/2 client with better connection pooling thanrequests.parsel: XPath and CSS selector engine (used by Scrapy).pandas: Data structuring and CSV export.requests: Standard HTTP client for API calls.
HasData API Setup
For scraping Google SERP or sites with JavaScript rendering, use the HasData API. It handles headless browsers, CAPTCHA solving, and proxy rotation automatically.
- Get your API Key: Sign up at HasData.com and copy your API key from the dashboard. Your key allows you to use both the Web Scraping API (for any website) and the Google SERP API (specifically for search results).
- Test the Connection: Verify your setup with a simple request.
import requests
API_KEY = "HASDATA_API_KEY"
url = "https://api.hasdata.com/scrape/web"
payload = {
"url": "https://httpbin.org/ip",
"proxyType": "datacenter"
}
headers = {"x-api-key": API_KEY, "Content-Type": "application/json"}
response = requests.post(url, json=payload, headers=headers)
print(response.json())Once you have a successful response, you are ready to start building the scrapers.
Scraping Your Own Website for Technical SEO Audit
Before analyzing competitors, you must ensure your own foundation is solid. Analyzing thousands of pages manually is impossible. A custom scraper allows you to audit your entire site structure, detect technical errors, and verify metadata implementation in minutes.
Standard SEO crawlers like Screaming Frog are excellent but are limited by their GUI and lack of customization. A Python scraper gives you direct access to raw HTML. You control what you extract and how you structure the output.
Bulk Metadata Extractor
This script extracts core on-page SEO elements from a list of URLs. It identifies missing canonicals, duplicate titles, and accidental noindex tags.
import httpx
from parsel import Selector
import pandas as pd
import time
import os
OUTPUT_DIR = "output/1_metadata"
URLS = [
"https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
"https://hasdata.com/blog/python-for-seo"
]
def scrape_metadata(url: str) -> dict:
"""Scrapes SEO metadata from a single URL. Returns a dict of metadata fields."""
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
}
try:
response = httpx.get(url, headers=headers, follow_redirects=True, timeout=10.0)
if response.status_code != 200:
return {"url": url, "status": response.status_code, "error": f"HTTP {response.status_code}"}
selector = Selector(text=response.text)
title = selector.xpath("//title/text()").get() or ""
meta_desc = selector.xpath("//meta[@name='description']/@content").get() or ""
return {
"url": url,
"final_url": str(response.url),
"status": response.status_code,
"title": title,
"title_length": len(title),
"meta_desc": meta_desc,
"meta_desc_length": len(meta_desc),
"canonical": selector.xpath("//link[@rel='canonical']/@href").get(),
"h1": selector.xpath("//h1/text()").get(),
"h1_count": len(selector.xpath("//h1").getall()),
"word_count": len(" ".join(selector.xpath("//body//text()").getall()).split()),
"robots_meta": selector.xpath("//meta[@name='robots']/@content").get(),
"og_title": selector.xpath("//meta[@property='og:title']/@content").get(),
"og_description": selector.xpath("//meta[@property='og:description']/@content").get(),
"og_image": selector.xpath("//meta[@property='og:image']/@content").get(),
}
except Exception as e:
return {"url": url, "status": 0, "error": str(e)}
def run_metadata_audit(urls: list = None, output_dir: str = OUTPUT_DIR) -> list:
"""
Audits a list of URLs for SEO metadata.
Saves results to {output_dir}/seo_audit.csv
Returns list of result dicts.
"""
if urls is None:
urls = URLS
os.makedirs(output_dir, exist_ok=True)
results = []
print(f"Auditing {len(urls)} pages...")
for url in urls:
print(f" Scraping: {url}")
data = scrape_metadata(url)
results.append(data)
time.sleep(1)
df = pd.DataFrame(results)
out_path = os.path.join(output_dir, "seo_audit.csv")
df.to_csv(out_path, index=False)
print(f"Saved {len(df)} rows → {out_path}\n")
return results
if __name__ == "__main__":
run_metadata_audit()What This Detects:
| Issue | Detection Logic | Impact |
|---|---|---|
| Missing Title | title field is empty | Pages cannot rank without titles |
| Title Too Long | title_length > 60 | Google truncates in SERP, poor CTR |
| Duplicate H1 | h1_count > 1 | Dilutes topical focus |
| Missing Canonical | canonical field is empty | Risk of duplicate content penalty |
| Accidental Noindex | robots_meta contains “noindex” | Page excluded from index |
| Thin Content | word_count < 300 | Low ranking potential |
Redirect Chain Detector
Long redirect chains (A → B → C → D) waste crawl budget and increase page load latency. Standard crawlers report the final status, but they often miss the intermediate hops. This script traces the full path of every URL.
import httpx
import pandas as pd
import os
OUTPUT_DIR = "output/2_redirects"
URLS_TO_CHECK = [
"http://httpbin.org/redirect/3",
"http://httpbin.org/redirect/1",
"http://httpbin.org/status/200",
]
def check_redirect_chain(url: str) -> list:
"""
Follows all redirects for a single URL.
Returns a list of dicts, one per hop (including the final destination).
"""
history = []
try:
response = httpx.get(url, follow_redirects=True, timeout=10.0)
for resp in response.history:
history.append({"url": str(resp.url), "status": resp.status_code})
history.append({"url": str(response.url), "status": response.status_code})
except Exception as e:
history.append({"url": url, "status": 0, "error": str(e)})
return history
def run_redirect_audit(urls: list = None, output_dir: str = OUTPUT_DIR) -> list:
"""
Checks redirect chains for a list of URLs.
Saves all hops to {output_dir}/redirect_chains.csv
Returns flat list of hop dicts.
"""
if urls is None:
urls = URLS_TO_CHECK
os.makedirs(output_dir, exist_ok=True)
all_chains = []
for url in urls:
print(f"\nChecking: {url}")
chain = check_redirect_chain(url)
for i, step in enumerate(chain):
step["original_url"] = url
step["hop_number"] = i + 1
all_chains.append(step)
print(f" [{step['status']}] → {step['url']}")
if len(chain) > 2:
print(f" ⚠ Chain length: {len(chain)} (recommended ≤ 2)")
df = pd.DataFrame(all_chains)
out_path = os.path.join(output_dir, "redirect_chains.csv")
df.to_csv(out_path, index=False)
print(f"\nSaved {len(all_chains)} hops → {out_path}\n")
return all_chains
if __name__ == "__main__":
run_redirect_audit()What This Detects:
| Issue | Detection Logic | Impact |
|---|---|---|
| Long Redirect Chains | hop_number > 2 | Wastes crawl budget, increases page load time |
| Temporary Redirects | status == 302 | Google may not pass full link equity |
| Redirect Loops | Final status ≠ 200 | Page unreachable, complete crawl failure |
| Unintended Domain Change | Original domain ≠ final domain | Link equity loss if redirect is accidental |
What You Have Now
You can audit thousands of pages for technical issues without Screaming Frog’s GUI. You can version control your audits in CSV format and automate them via cron or GitHub Actions.
But technical health is only half the battle. SEO is competitive. You need to understand what content structure and keywords competitors use to rank. The next section shows how to extract competitor headings, calculate keyword density, and identify schema markup patterns.
Scraping Competitor Websites for Content Analysis
While technical health is the foundation, content drives rankings. To outrank a competitor, you need to understand what they publish and how they structure it. This section extracts heading structures, keyword patterns, and schema markup from top-ranking pages.
Content Structure Analysis (H2/H3 & Word Count)
Search engines value comprehensive coverage. By scraping the heading hierarchy (H1-H6) of top-ranking pages, you can create a “master outline” that covers all subtopics your competitors address.
This function extracts the skeleton of any article, allowing you to visualize its logical flow and depth.
import httpx
from parsel import Selector
import pandas as pd
import time
import os
from urllib.parse import urlparse
OUTPUT_DIR = "output/3_structure"
COMPETITOR_URLS = [
"https://books.toscrape.com/",
"https://hasdata.com/blog/python-for-seo"
]
def extract_content_structure(url: str) -> dict:
"""
Extracts content structure from a single URL.
Returns a dict with summary stats and heading lists.
"""
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
}
try:
response = httpx.get(url, headers=headers, follow_redirects=True, timeout=10.0)
if response.status_code != 200:
return {"url": url, "error": f"HTTP {response.status_code}"}
selector = Selector(text=response.text)
return {
"url": url,
"h1": selector.xpath("//h1//text()").get(),
"h2_list": [h.strip() for h in selector.xpath("//h2//text()").getall() if h.strip()],
"h3_list": [h.strip() for h in selector.xpath("//h3//text()").getall() if h.strip()],
"h2_count": len(selector.xpath("//h2").getall()),
"h3_count": len(selector.xpath("//h3").getall()),
"word_count": len(" ".join(selector.xpath("//body//text()").getall()).split()),
"paragraph_count": len(selector.xpath("//p").getall()),
"internal_links": len(selector.xpath("//a[starts-with(@href, '/')]").getall()),
"external_links": len(selector.xpath("//a[starts-with(@href, 'http')]").getall()),
}
except Exception as e:
return {"url": url, "error": str(e)}
def run_content_structure_audit(urls: list = None, output_dir: str = OUTPUT_DIR) -> list:
"""
Analyzes content structure for a list of URLs.
For each URL saves two files in {output_dir}/{domain}/:
- summary.csv — one row with counts (word_count, h2_count, etc.)
- headings.csv — one row per heading (level | text)
Also saves a combined structure_summary.csv across all URLs.
Returns list of result dicts.
"""
if urls is None:
urls = COMPETITOR_URLS
os.makedirs(output_dir, exist_ok=True)
results = []
for url in urls:
print(f" Analyzing: {url}")
data = extract_content_structure(url)
results.append(data)
# Per-site subfolder
domain = urlparse(url).netloc.replace(".", "_")
site_dir = os.path.join(output_dir, domain)
os.makedirs(site_dir, exist_ok=True)
# Summary row (no list columns)
summary_row = {k: v for k, v in data.items() if k not in ("h2_list", "h3_list")}
pd.DataFrame([summary_row]).to_csv(os.path.join(site_dir, "summary.csv"), index=False)
# Headings flat (one heading per row)
headings_rows = []
for h2 in data.get("h2_list", []):
headings_rows.append({"level": "H2", "text": h2})
for h3 in data.get("h3_list", []):
headings_rows.append({"level": "H3", "text": h3})
pd.DataFrame(headings_rows).to_csv(os.path.join(site_dir, "headings.csv"), index=False)
print(f"Saved → {site_dir}/summary.csv + headings.csv")
time.sleep(1)
# Combined summary across all URLs
all_summary = [{k: v for k, v in r.items() if k not in ("h2_list", "h3_list")} for r in results]
combined_path = os.path.join(output_dir, "structure_summary.csv")
pd.DataFrame(all_summary).to_csv(combined_path, index=False)
print(f"Combined summary → {combined_path}\n")
return results
if __name__ == "__main__":
run_content_structure_audit()What This Reveals:
| Insight | Detection Method | Action |
|---|---|---|
| Missing topic sections | H2 appears in 3+ competitors but not in your content | Add that section to your article |
| Content depth gap | Competitors average 2,500 words, you have 800 | Expand content or add subsections |
| Multimedia deficit | Competitors use 8-12 images, you have 2 | Add diagrams, screenshots, or charts |
| Weak internal linking | Competitors link to 5-10 related pages, you link to 1 | Build topic cluster with internal links |
For JavaScript-rendered competitor sites, use HasData’s Web Scraping API with render: true to get the full DOM.
Keyword Extraction (N-Gram Analysis)
Competitors use specific phrases repeatedly. These are LSI (Latent Semantic Indexing) keywords. Extracting them reveals the semantic context Google associates with your topic.
This script extracts the 20 most frequent 2-word and 3-word phrases from competitor content. You compare this list against your own content to find missing terminology.
import httpx
from parsel import Selector
from collections import Counter
import re
import pandas as pd
import os
from urllib.parse import urlparse
OUTPUT_DIR = "output/4_keywords"
DEFAULT_URL = "https://hasdata.com/blog/python-for-seo"
STOPWORDS = {
"the", "is", "at", "which", "on", "a", "an", "as", "are", "was", "were",
"be", "been", "in", "of", "to", "for", "with", "and", "or", "but", "not",
"this", "that", "by", "from", "it", "can", "will", "you", "your", "has",
"have", "had", "all", "its", "our", "we", "do", "so", "if", "about",
"what", "how", "more", "than", "when", "their", "also", "into", "other",
}
def extract_ngrams(text: str, n: int = 2, top_n: int = 20) -> list:
"""Extracts top N n-grams from text after filtering stopwords."""
words = re.findall(r"\b[a-z]+\b", text.lower())
words = [w for w in words if w not in STOPWORDS and len(w) > 2]
ngrams = zip(*[words[i:] for i in range(n)])
ngram_list = [" ".join(gram) for gram in ngrams]
return Counter(ngram_list).most_common(top_n)
def analyze_competitor_keywords(url: str) -> dict:
"""
Fetches a page and extracts top bigrams and trigrams from body text.
Returns dict with url, bigrams, trigrams.
"""
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"}
try:
response = httpx.get(url, headers=headers, timeout=15.0)
selector = Selector(text=response.text)
body_text = " ".join(selector.xpath("//article//text() | //main//text()").getall())
if not body_text.strip():
body_text = " ".join(selector.xpath("//body//p//text()").getall())
return {
"url": url,
"bigrams": extract_ngrams(body_text, n=2, top_n=20),
"trigrams": extract_ngrams(body_text, n=3, top_n=15),
}
except Exception as e:
return {"url": url, "error": str(e)}
def run_keyword_extraction(url: str = None, output_dir: str = OUTPUT_DIR) -> dict:
"""
Runs keyword extraction for a URL.
Saves bigrams and trigrams to {output_dir}/keywords_{domain}.csv
Returns result dict.
"""
if url is None:
url = DEFAULT_URL
os.makedirs(output_dir, exist_ok=True)
result = analyze_competitor_keywords(url)
if "error" in result:
print(f"Error: {result['error']}")
return result
domain = urlparse(url).netloc.replace(".", "_")
rows = []
for phrase, count in result.get("bigrams", []):
rows.append({"type": "bigram", "phrase": phrase, "count": count})
for phrase, count in result.get("trigrams", []):
rows.append({"type": "trigram", "phrase": phrase, "count": count})
df = pd.DataFrame(rows)
out_path = os.path.join(output_dir, f"keywords_{domain}.csv")
df.to_csv(out_path, index=False)
print(f"Top 2-Word Phrases:")
for phrase, count in result["bigrams"]:
print(f" {phrase}: {count}")
print(f"\nTop 3-Word Phrases:")
for phrase, count in result["trigrams"]:
print(f" {phrase}: {count}")
print(f"\nSaved → {out_path}\n")
return result
if __name__ == "__main__":
run_keyword_extraction()Example Output:
Top 2-Word Phrases:
web scraping: 47
search results: 34
rank tracking: 28
serp features: 22
organic rankings: 19
Top 3-Word Phrases:
google search results: 18
people also ask: 15
web scraping api: 12How to Use This:
- Run this script on your top 3 competitors.
- Export their keyword lists to CSV.
- Compare against your own content using
Ctrl+Fin your draft. - If competitors mention ‘serp features’ frequently and you do not, you may be underweighting a core topic.
This is not keyword stuffing. This is semantic alignment. Google expects certain terminology in comprehensive content. Missing it signals incomplete coverage.
This is frequency-based analysis, not TF-IDF. For weighted keyword importance across multiple competitors, see Script 4 in our Python SEO guide.
Schema Markup Extraction
Structured data (schema.org markup) helps pages appear in rich snippets, PAA boxes, and knowledge panels. If your competitors implement FAQPage or HowTo schema and you do not, you are invisible in those SERP features.
This script extracts all JSON-LD structured data from a competitor page.
import httpx
from parsel import Selector
import json
import os
from urllib.parse import urlparse
OUTPUT_DIR = "output/5_markup"
DEFAULT_URL = "https://hasdata.com/blog/python-for-seo"
def extract_schema(url: str) -> dict:
"""
Extracts all JSON-LD schema blocks from a page.
Returns dict with url, schemas list, and schema_types list.
"""
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"}
try:
response = httpx.get(url, headers=headers, timeout=15.0)
selector = Selector(text=response.text)
raw_schemas = selector.xpath("//script[@type='application/ld+json']/text()").getall()
parsed_schemas = []
for raw in raw_schemas:
try:
parsed_schemas.append(json.loads(raw))
except json.JSONDecodeError:
continue
return {
"url": url,
"schema_count": len(parsed_schemas),
"schema_types": [s.get("@type") for s in parsed_schemas if "@type" in s],
"schemas": parsed_schemas,
}
except Exception as e:
return {"url": url, "error": str(e)}
def run_markup_extraction(url: str = None, output_dir: str = OUTPUT_DIR) -> dict:
"""
Extracts schema markup from a URL and saves to JSON.
Saves to {output_dir}/schema_{domain}.json
Returns result dict.
"""
if url is None:
url = DEFAULT_URL
os.makedirs(output_dir, exist_ok=True)
result = extract_schema(url)
if "error" in result:
print(f"Error: {result['error']}")
return result
print(f"Schema Types Found: {result['schema_types']}")
for schema in result["schemas"]:
schema_type = schema.get("@type", "Unknown")
print(f"\n--- {schema_type} ---")
print(json.dumps(schema, indent=2)[:400])
domain = urlparse(url).netloc.replace(".", "_")
out_path = os.path.join(output_dir, f"schema_{domain}.json")
with open(out_path, "w", encoding="utf-8") as f:
json.dump(result, f, indent=2, ensure_ascii=False)
print(f"\nSaved → {out_path}\n")
return result
if __name__ == "__main__":
run_markup_extraction()Example Output:
Schema Types Found: ['Article', 'FAQPage', 'BreadcrumbList']
--- Article Schema ---
{
"@context": "https://schema.org",
"@type": "Article",
"headline": "The Complete Guide to SEO Scraping",
"author": {
"@type": "Person",
"name": "John Doe"
},
"datePublished": "2026-01-15"
}
--- FAQPage Schema ---
{
"@context": "https://schema.org",
"@type": "FAQPage",
"mainEntity": [
{
"@type": "Question",
"name": "What is SEO scraping?",
"acceptedAnswer": {
"@type": "Answer",
"text": "SEO scraping is..."
}
}
]
}
What This Tells You:
| Schema Type | Purpose | Action |
|---|---|---|
Article | Marks content as editorial article | Add to your blog posts |
FAQPage | Enables PAA-style rich snippets | Add if you have Q&A section |
HowTo | Shows step-by-step instructions in SERP | Use for tutorial content |
Product | Displays price, reviews, availability | Required for e-commerce |
BreadcrumbList | Shows site hierarchy in SERP | Improves site navigation signals |
If your top-ranking competitor has FAQPage schema and appears in the “People Also Ask” box, their schema is working. You should implement the same structure.
What You Have Now
You can audit competitor content structure, extract their keyword patterns, and reverse-engineer their schema strategy. This data tells you exactly what topics to cover, which phrases to emphasize, and which structured data to implement.
But content lives on the SERP. The next section shows how to scrape Google Search results directly to see what ranks, what AI Overviews say, and what questions users ask.
Scraping Google Search Results for Strategic Intelligence
Analyzing your own pages is useful, but the real ROI comes from analyzing the search results themselves. You cannot optimize for rankings you do not measure. Google SERP is the most valuable data source in SEO. It reveals three layers of intelligence: what ranks (organic), what Google synthesizes (AI Overviews), and what users want to know next (PAA + Related Searches).
We will use the HasData Google SERP API. It returns structured JSON, eliminating the need to parse volatile HTML or manage headless browsers for JavaScript rendering.
Extracting Organic Rankings
This is your baseline rank tracker. For every keyword, you need position, URL, title, and snippet to detect title rewrites and identify competitors.
import requests
import pandas as pd
import os
OUTPUT_DIR = "output/6_rankings"
API_KEY = "HASDATA_API_KEY" # https://app.hasdata.com/
KEYWORDS = [
"python web scraping tutorial",
"seo audit tools",
]
def scrape_organic_results(keyword: str, api_key: str = None) -> list:
"""
Fetches organic search results for a keyword via HasData SERP API.
Returns list of result dicts (position, title, url, snippet, domain).
"""
if api_key is None:
api_key = API_KEY
url = "https://api.hasdata.com/scrape/google/serp"
headers = {"x-api-key": api_key}
params = {
"q": keyword,
"gl": "us",
"hl": "en",
"location": "New York,New York,United States",
}
try:
response = requests.get(url, headers=headers, params=params, timeout=30)
if response.status_code != 200:
print(f" Error {response.status_code}: {response.text[:200]}")
return []
data = response.json()
organic = data.get("organicResults", [])
return [
{
"keyword": keyword,
"position": item.get("position"),
"title": item.get("title"),
"url": item.get("link"),
"snippet": item.get("snippet"),
"domain": item.get("displayedLink", "").split("/")[0],
}
for item in organic
]
except Exception as e:
print(f" Exception: {e}")
return []
def run_organic_rankings(
keywords: list = None, api_key: str = None, output_dir: str = OUTPUT_DIR
) -> list:
"""
Scrapes organic rankings for a list of keywords.
Saves one CSV per keyword to {output_dir}/organic_{keyword}.csv
Also saves a combined file: organic_all.csv
Returns flat list of all result dicts.
"""
if keywords is None:
keywords = KEYWORDS
os.makedirs(output_dir, exist_ok=True)
all_data = []
for kw in keywords:
print(f" Scraping: {kw}")
results = scrape_organic_results(kw, api_key=api_key)
all_data.extend(results)
if results:
safe_kw = kw.replace(" ", "_")
out_path = os.path.join(output_dir, f"organic_{safe_kw}.csv")
pd.DataFrame(results).to_csv(out_path, index=False)
print(f" {len(results)} results → {out_path}")
if all_data:
combined_path = os.path.join(output_dir, "organic_all.csv")
pd.DataFrame(all_data).to_csv(combined_path, index=False)
print(f"\nCombined: {len(all_data)} rows → {combined_path}\n")
return all_data
if __name__ == "__main__":
run_organic_rankings()What this tells you:
- Your current position for target keywords.
- Title tag patterns Google rewards (length, structure, keyword placement).
- Snippet length distribution (short vs. detailed).
Extracting AI Overviews
AI Overviews appear on approximately 30% of Google searches, dependent on query intent. If Google synthesizes an answer above your organic result, you must analyze its content and citations.
Monitor whether your domain is cited in AI Overviews for your target keywords. Track the narrative Google is presenting.
import requests
import json
import os
OUTPUT_DIR = "output/7_ai_overviews"
API_KEY = "HASDATA_API_KEY" # https://app.hasdata.com/
DEFAULT_KEYWORD = "what is seo scraping"
def get_ai_overview(keyword: str, api_key: str = None) -> dict | None:
"""
Fetches the AI Overview block for a keyword from Google SERP.
Returns the aiOverview dict if present, otherwise None.
"""
if api_key is None:
api_key = API_KEY
url = "https://api.hasdata.com/scrape/google/serp"
params = {"q": keyword, "location": "United States", "gl": "us", "hl": "en"}
headers = {"x-api-key": api_key}
try:
response = requests.get(url, params=params, headers=headers, timeout=30)
if response.status_code != 200:
print(f" Error {response.status_code}: {response.text[:200]}")
return None
data = response.json()
return data.get("aiOverview") or None
except Exception as e:
print(f" Exception: {e}")
return None
def run_ai_overviews(
keyword: str = None, api_key: str = None, output_dir: str = OUTPUT_DIR
) -> dict | None:
"""
Fetches and displays the AI Overview for a keyword.
Saves result to {output_dir}/ai_overview_{keyword}.json
Returns the aiOverview dict or None.
"""
if keyword is None:
keyword = DEFAULT_KEYWORD
os.makedirs(output_dir, exist_ok=True)
ai_overview = get_ai_overview(keyword, api_key=api_key)
if ai_overview:
print(f"\n--- AI Overview for '{keyword}' ---")
print(f"Summary: {ai_overview.get('text', 'N/A')}\n")
sources = ai_overview.get("sources", [])
if sources:
print("Cited Sources:")
for src in sources:
print(f" - {src.get('title')}: {src.get('link')}")
safe_kw = keyword.replace(" ", "_")
out_path = os.path.join(output_dir, f"ai_overview_{safe_kw}.json")
with open(out_path, "w", encoding="utf-8") as f:
json.dump({"keyword": keyword, "ai_overview": ai_overview}, f, indent=2, ensure_ascii=False)
print(f"\nSaved → {out_path}\n")
else:
print(f"No AI Overview found for '{keyword}'")
return ai_overview
if __name__ == "__main__":
run_ai_overviews()Use Case:
- Audit which competitor domains are cited most frequently across your keyword set.
- Reverse-engineer their content patterns (depth, citation density, schema usage).
- Track when AI Overviews appear or disappear for your keywords (indicates query intent shifts).
For tracking AI Overview citations at scale across hundreds of keywords, see our guide on Monitoring AI Overviews.
Extracting People Also Ask (PAA)
PAA questions are intent signals. They tell you what subtopics users expect in a complete answer.
import requests
import pandas as pd
import os
OUTPUT_DIR = "output/8_paa"
API_KEY = "HASDATA_API_KEY" # https://app.hasdata.com/
DEFAULT_KEYWORD = "how to do seo for a website"
def scrape_paa(keyword: str, api_key: str = None) -> list:
"""
Fetches People Also Ask questions for a keyword.
Returns list of dicts: keyword, question, snippet, source.
"""
if api_key is None:
api_key = API_KEY
url = "https://api.hasdata.com/scrape/google/serp"
headers = {"x-api-key": api_key}
params = {"q": keyword, "gl": "us", "hl": "en"}
try:
response = requests.get(url, headers=headers, params=params, timeout=30)
if response.status_code != 200:
print(f" Error {response.status_code}: {response.text[:200]}")
return []
data = response.json()
questions = []
for item in data.get("relatedQuestions", []):
questions.append(
{
"keyword": keyword,
"question": item.get("question"),
"snippet": item.get("snippet"),
"source": item.get("link"),
}
)
return questions
except Exception as e:
print(f" Exception: {e}")
return []
def run_people_also_ask(
keyword: str = None, api_key: str = None, output_dir: str = OUTPUT_DIR
) -> list:
"""
Fetches PAA questions for a keyword, prints and saves them.
Saves to {output_dir}/paa_{keyword}.csv
Returns list of question dicts.
"""
if keyword is None:
keyword = DEFAULT_KEYWORD
os.makedirs(output_dir, exist_ok=True)
paa_data = scrape_paa(keyword, api_key=api_key)
if not paa_data:
print(f"No PAA questions found for '{keyword}'")
return []
print(f"\n--- People Also Ask: '{keyword}' ---")
for q in paa_data:
print(f"Q: {q['question']}")
snippet = (q.get("snippet") or "")[:120]
print(f"A: {snippet}...\n")
safe_kw = keyword.replace(" ", "_")
out_path = os.path.join(output_dir, f"paa_{safe_kw}.csv")
pd.DataFrame(paa_data).to_csv(out_path, index=False)
print(f"Saved {len(paa_data)} questions → {out_path}\n")
return paa_data
if __name__ == "__main__":
run_people_also_ask()Use Case:
- Map PAA questions to H2/H3 sections in your content.
- Identify semantic gaps (questions competitors answer but you do not).
- Monitor PAA volatility (new questions appearing indicates shifting user interest).
To build a recursive PAA topic graph (scraping questions within questions), refer to Script 3 in this guide.
Extracting Related Searches
Related Searches appear at the bottom of the SERP. They represent semantic variations and adjacent intents. These are keyword expansion opportunities.
Use Case: Discover long-tail modifiers. Identify cluster topics for pillar pages.
import requests
import pandas as pd
import os
OUTPUT_DIR = "output/9_related"
API_KEY = "HASDATA_API_KEY" # https://app.hasdata.com/
DEFAULT_KEYWORD = "seo scraping tools"
def get_related_searches(keyword: str, api_key: str = None) -> list:
"""
Fetches related search queries for a keyword from Google SERP.
Returns a list of query strings.
"""
if api_key is None:
api_key = API_KEY
url = "https://api.hasdata.com/scrape/google/serp"
params = {"q": keyword, "location": "United States", "gl": "us", "hl": "en"}
headers = {"x-api-key": api_key}
try:
response = requests.get(url, params=params, headers=headers, timeout=30)
if response.status_code != 200:
print(f" Error {response.status_code}: {response.text[:200]}")
return []
data = response.json()
return [item.get("query") for item in data.get("relatedSearches", []) if item.get("query")]
except Exception as e:
print(f" Exception: {e}")
return []
def run_related_searches(
keyword: str = None, api_key: str = None, output_dir: str = OUTPUT_DIR
) -> list:
"""
Fetches related searches for a keyword, prints and saves them.
Saves to {output_dir}/related_{keyword}.csv
Returns list of query strings.
"""
if keyword is None:
keyword = DEFAULT_KEYWORD
os.makedirs(output_dir, exist_ok=True)
queries = get_related_searches(keyword, api_key=api_key)
if not queries:
print(f"No related searches found for '{keyword}'")
return []
print(f"\n--- Related Searches for '{keyword}' ---")
for q in queries:
print(f" - {q}")
safe_kw = keyword.replace(" ", "_")
out_path = os.path.join(output_dir, f"related_{safe_kw}.csv")
pd.DataFrame({"keyword": keyword, "related_query": queries}).to_csv(out_path, index=False)
print(f"\nSaved {len(queries)} queries → {out_path}\n")
return queries
if __name__ == "__main__":
run_related_searches()Use Case:
- Identify content clusters (group related terms with high overlap).
- Discover informational vs. transactional modifiers.
- Build internal linking architecture around these semantic connections.
Complete Workflow Example
Combine all four data sources into one audit function.
import json
import os
from datetime import datetime
from urllib.parse import urlparse
from _1_metadata_extractor import scrape_metadata, run_metadata_audit
from _2_redirect_chain_detector import check_redirect_chain, run_redirect_audit
from _3_content_structure_analyzes import extract_content_structure, run_content_structure_audit
from _4_keyword_extraction import analyze_competitor_keywords, run_keyword_extraction
from _5_markup_extraction import extract_schema, run_markup_extraction
from _6_organic_rankings import scrape_organic_results, run_organic_rankings
from _7_ai_overviews import get_ai_overview, run_ai_overviews
from _8_people_also_ask import scrape_paa, run_people_also_ask
from _9_related_searches import get_related_searches, run_related_searches
OUTPUT_DIR = "output"
API_KEY = "HASDATA_API_KEY" # https://app.hasdata.com/
# MODE 1 — SERP AUDIT (keyword → scripts 6, 7, 8, 9)
def serp_audit(keyword: str, api_key: str = None) -> dict:
"""
Runs a full SERP audit for a keyword.
"""
if api_key is None:
api_key = API_KEY
_header(f"SERP AUDIT: '{keyword}'")
print("[6/9] Organic Rankings")
organic = run_organic_rankings(keywords=[keyword], api_key=api_key)
print("[7/9] AI Overview")
ai_overview = run_ai_overviews(keyword=keyword, api_key=api_key)
print("[8/9] People Also Ask")
paa = run_people_also_ask(keyword=keyword, api_key=api_key)
print("[9/9] Related Searches")
related = run_related_searches(keyword=keyword, api_key=api_key)
summary = {
"mode": "serp_audit",
"keyword": keyword,
"timestamp": datetime.now().isoformat(),
"organic_results": len(organic),
"ai_overview_triggered": bool(ai_overview),
"paa_questions": len(paa),
"related_searches": len(related),
"top_organic_urls": [r.get("url") for r in organic[:5]],
"paa_questions_list": [q.get("question") for q in paa],
"related_queries": related,
}
_save_summary(summary, label=keyword)
_footer_serp(summary)
return summary
# MODE 2 — SITE AUDIT (url → scripts 1, 2, 3, 4, 5)
def site_audit(url: str) -> dict:
"""
Runs a full on-page audit for a single URL.
"""
domain = urlparse(url).netloc
_header(f"SITE AUDIT: {url}")
print("[1/5] Metadata")
metadata_list = run_metadata_audit(urls=[url])
metadata = metadata_list[0] if metadata_list else {}
print("[2/5] Redirect Chain")
chain = check_redirect_chain(url)
redirect_hops = len(chain)
has_redirect_issue = redirect_hops > 2
print("[3/5] Content Structure")
structure_list = run_content_structure_audit(urls=[url])
structure = structure_list[0] if structure_list else {}
print("[4/5] Keyword Extraction")
keywords = run_keyword_extraction(url=url)
print("[5/5] Schema Markup")
schema = run_markup_extraction(url=url)
summary = {
"mode": "site_audit",
"url": url,
"domain": domain,
"timestamp": datetime.now().isoformat(),
# Metadata
"title": metadata.get("title"),
"title_length": metadata.get("title_length"),
"meta_desc_length": metadata.get("meta_desc_length"),
"h1": metadata.get("h1"),
"word_count": structure.get("word_count"),
# Redirects
"redirect_hops": redirect_hops,
"redirect_warning": has_redirect_issue,
# Structure
"h2_count": structure.get("h2_count"),
"h3_count": structure.get("h3_count"),
"internal_links": structure.get("internal_links"),
"external_links": structure.get("external_links"),
# Schema
"schema_types": schema.get("schema_types", []),
# Keywords
"top_bigrams": [phrase for phrase, _ in (keywords.get("bigrams") or [])[:5]],
"top_trigrams": [phrase for phrase, _ in (keywords.get("trigrams") or [])[:5]],
}
_save_summary(summary, label=domain)
_footer_site(summary)
return summary
# MODE 3 — FULL AUDIT (keyword + url → all 9 scripts)
def full_audit(keyword: str, url: str, api_key: str = None) -> dict:
"""
Runs both serp_audit and site_audit.
Returns a combined summary dict with both results.
"""
if api_key is None:
api_key = API_KEY
serp_summary = serp_audit(keyword=keyword, api_key=api_key)
site_summary = site_audit(url=url)
combined = {
"mode": "full_audit",
"keyword": keyword,
"url": url,
"timestamp": datetime.now().isoformat(),
"serp": serp_summary,
"site": site_summary,
}
_save_summary(combined, label=f"full_{keyword.replace(' ', '_')}")
return combined
# Helpers
def _header(title: str) -> None:
print(f"\n{'═' * 60}")
print(f" {title}")
print(f"{'═' * 60}\n")
def _save_summary(summary: dict, label: str) -> None:
os.makedirs(OUTPUT_DIR, exist_ok=True)
safe = label.replace(" ", "_").replace("/", "_").replace(".", "_")
path = os.path.join(OUTPUT_DIR, f"summary_{safe}.json")
with open(path, "w", encoding="utf-8") as f:
json.dump(summary, f, indent=2, ensure_ascii=False)
print(f"\nSummary saved → {path}")
def _footer_serp(s: dict) -> None:
print(f"\n{'═' * 60}")
print(f" SERP AUDIT DONE — '{s['keyword']}'")
print(f" Organic : {s['organic_results']} results")
print(f" AI Overview : {'Yes' if s['ai_overview_triggered'] else 'No'}")
print(f" PAA : {s['paa_questions']} questions")
print(f" Related : {s['related_searches']} queries")
print(f"{'═' * 60}\n")
def _footer_site(s: dict) -> None:
print(f"\n{'═' * 60}")
print(f" SITE AUDIT DONE — {s['url']}")
print(f" Title : {s['title']} ({s['title_length']} chars)")
print(f" Words : {s['word_count']}")
print(f" Headings : {s['h2_count']} H2 / {s['h3_count']} H3")
print(f" Redirects : {s['redirect_hops']} hop(s)" + (" ⚠" if s['redirect_warning'] else ""))
print(f" Schema : {s['schema_types'] or 'none'}")
print(f" Top bigrams : {', '.join(s['top_bigrams'])}")
print(f"{'═' * 60}\n")
# Entry point — edit mode and values below
if __name__ == "__main__":
DEMO_KEYWORD = "python web scraping"
DEMO_URL = "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"
# Pick one:
# Mode 1 — SERP only (requires API key)
# serp_audit(keyword=DEMO_KEYWORD, api_key=API_KEY)
# Mode 2 — Site only (no API key needed)
# site_audit(url=DEMO_URL)
# Mode 3 — Full audit (requires API key)
full_audit(keyword=DEMO_KEYWORD, url=DEMO_URL, api_key=API_KEY)This gives you a snapshot of the entire competitive landscape in under 5 seconds. No manual copying. No screenshots. Just structured data ready for analysis.
For more advanced SERP automation (intent clustering, trend monitoring, content gap analysis), see our guide on 7 Python Scripts to Automate SEO in 2026.
When to Use Scraping vs Traditional Tools
Use scraping when you need:
- Real-time data (not cached databases)
- Custom location or device targeting
- Bulk analysis across thousands of keywords
- Integration with your internal dashboards
Use traditional tools (Ahrefs, Semrush) when you need:
- Historical trend data
- Backlink analysis
- Competitor domain overviews
SEO scraping transforms you from a passive user of third-party data into an owner of your own intelligence pipeline. Whether you are auditing technical health, reverse-engineering competitor content, or tracking real-time rankings, the scripts in this guide give you the raw capability to see what others miss.


