HasData
Back to all posts

7 Python Scripts to Automate SEO in 2026

Sergey Ermakovich
Sergey Ermakovich
6 Jan 2026

Writing Python code is no longer the primary challenge for SEOs in 2026. AI agents like Claude and Gemini generate functional scraping scripts in seconds. The real challenge for Technical SEOs has shifted from syntax to scale and access.

Most SEO scripts fail in production. They crash when processing large datasets or get blocked by anti-bot systems. Google’s updates in early 2025 made JavaScript rendering mandatory for accurate SERP data. Standard libraries like requests or urllib fail to retrieve complete data due to TLS fingerprinting and client-side rendering.

This guide provides seven Python scripts for production-grade SEO automation. We move beyond basic status code checkers. These scripts solve specific engineering challenges. You will learn to automate SERP intent clustering, monitor AI Overviews, and detect real-time search trends. 

The Tech Stack

We minimize dependencies to ensure stability and ease of deployment.

  • Python 3.12+: Leveraging modern features for concurrency and type safety.
  • HasData APIs: Handles headless browsing, proxy rotation, and CAPTCHA management. This replaces the maintenance-heavy Selenium/Puppeteer stack.
  • Requests & URLLib3: Used with Session objects and HTTPAdapters for high-performance, persistent TCP connections.
  • Pandas & Scikit-Learn: For dataframe manipulation and N-Gram vectorization without heavy NLP pipelines.
  • Trafilatura: For extracting clean main body text from raw HTML, stripping ads and boilerplate.
  • Seaborn & Matplotlib: For generating similarity heatmaps and visual data analysis.

Why HasData Instead of Direct Requests?

Direct scraping is expensive to maintain. Google updates DOM selectors and anti-bot logic frequently. A SERP API offloads this complexity. You receive structured JSON data instead of raw HTML. This ensures your scripts remain unbreakable during core updates.

Test the extraction process below. Enter a keyword to see how the API renders the SERP and transforms it into a clean JSON response.

This JSON response provides the foundation for most of the scripts in this guide. It allows you to extract AI Overviews, global position, and related queries without writing complex parsing logic.

Script 1: Recursive Google Autosuggest Scraper

Most SEOs rely on third-party databases (Ahrefs, Semrush) for keyword research. These databases rely on historical clickstream data. They often miss long-tail queries and emerging search patterns.

Google Autosuggest contains real-time intent data, but extracting it at scale is difficult. Querying Google’s suggest endpoint 500 times in a minute from a local IP to build a semantic tree will trigger an immediate block.

The Code

We bypass this by combining the Google Suggest XML endpoint with the HasData API. By iterating through the alphabet (e.g., “keyword + a”, “keyword + b”), we force the expansion of the suggestion tree.

import requests
import xml.etree.ElementTree as ET
import string
import time
from concurrent.futures import ThreadPoolExecutor, as_completed
from urllib.parse import quote
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry

# --- CONFIGURATION ---
API_KEY = "HASDATA_API_KEY"
BASE_KEYWORD = "coffee"
MAX_DEPTH = 2  # Depth of recursion (e.g., 1 = 'coffee a', 2 = 'coffee ab')

# Set this according to your HasData plan's concurrency limit.
# For paid plans, this is 15–1500. For the trial plan, keep it at 1.
MAX_WORKERS = 1 

# --- SESSION SETUP (PERFORMANCE OPTIMIZATION) ---
# We use a global Session object to enable TCP connection reuse (Keep-Alive).
# This significantly reduces the overhead of establishing new SSL handshakes for every request.
session = requests.Session()

# Configure the HTTP Adapter with a connection pool.
# pool_maxsize must be greater than or equal to MAX_WORKERS to prevent blocking.
adapter = HTTPAdapter(
    pool_connections=1, 
    pool_maxsize=MAX_WORKERS + 5,  # Adding a small buffer
    max_retries=Retry(total=3, backoff_factor=1, status_forcelist=[500, 502, 503, 504])
)

# Mount the adapter to both HTTP and HTTPS protocols
session.mount("https://", adapter)
session.mount("http://", adapter)

def fetch_suggestions(query):
    """
    Sends a request to Google Suggest via HasData API using a rotating proxy.
    """
    # 1. URL Encode the query to handle spaces and special characters correctly
    # 'coffee ad' becomes 'coffee%20ad'
    encoded_query = quote(query)
    target_url = f"https://suggestqueries.google.com/complete/search?output=toolbar&hl=en&q={encoded_query}"
    
    # HasData requires the API key in the headers
    headers = {
        "x-api-key": API_KEY,
        "Content-Type": "application/json"
    }
    
    # payload configuration
    payload = {
        "url": target_url,
        "jsRendering": False,      # JS is not needed for the XML endpoint, saves credits/time
        "outputFormat": ["html"]   # HasData will return the raw response body
    }

    try:
        # IMPORTANT: Use session.post() instead of requests.post() to leverage the connection pool
        response = session.post(
            "https://api.hasdata.com/scrape/web",
            headers=headers,
            json=payload,
            timeout=30
        )
        
        if response.status_code != 200:
            # Log errors for debugging
            print(f"Error {response.status_code} for '{query}': {response.text}")
            return []
        
        # 2. Parse the Response
        # Google Suggest returns XML. HasData passes this raw content in the response body.
        xml_content = response.content
        root = ET.fromstring(xml_content)
        
        suggestions = []
        for child in root:
            if len(child) > 0 and 'data' in child[0].attrib:
                suggestions.append(child[0].attrib['data'])
                
        return suggestions

    except ET.ParseError:
        return [] # Ignore parsing errors (e.g., malformed XML)
    except Exception as e:
        print(f"Exception for '{query}': {e}")
        return []

def generate_search_terms_suffix(base, current_suffix, depth, max_depth):
    """
    Recursively generates search terms by appending characters to a suffix.
    Example: 
    Depth 1: coffee a ... coffee z
    Depth 2: coffee aa ... coffee zz
    """
    terms = []
    for char in string.ascii_lowercase:
        new_suffix = current_suffix + char
        term = f"{base} {new_suffix}"
        terms.append(term)
        
        # Recursive call to go deeper if max_depth is not reached
        if depth < max_depth:
            terms.extend(generate_search_terms_suffix(base, new_suffix, depth + 1, max_depth))
            
    return terms

def run_harvest():
    print(f"Generating queries for Depth {MAX_DEPTH}...")
    queries = generate_search_terms_suffix(BASE_KEYWORD, "", 1, MAX_DEPTH)
    print(f"Total queries to process: {len(queries)}")

    results = set()
    start_time_global = time.time()

    # ThreadPoolExecutor manages concurrent execution
    with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
        future_to_query = {executor.submit(fetch_suggestions, q): q for q in queries}
        
        count = 0
        total = len(queries)
        
        print(f"\nStarting harvest with {MAX_WORKERS} workers...")
        
        for future in as_completed(future_to_query):
            query = future_to_query[future]
            count += 1
            
            # Progress logging every 20 requests to monitor speed
            if count % 20 == 0:
                elapsed = time.time() - start_time_global
                rps = count / elapsed if elapsed > 0 else 0
                print(f"Progress: {count}/{total} | Speed: {rps:.2f} req/s")

            try:
                suggestions = future.result()
                if suggestions:
                    for s in suggestions:
                        results.add(s)
            except Exception:
                pass # Skip errors during mass scraping

    total_time = time.time() - start_time_global
    print(f"\nFinished in {total_time:.2f} seconds. Average Speed: {len(queries)/total_time:.2f} req/s")
    return results

if __name__ == "__main__":
    unique_keywords = run_harvest()
    
    filename = "long_tail_keywords_hasdata.csv"
    
    # Save results to CSV
    with open(filename, "w", encoding="utf-8") as f:
        f.write("Keyword\n")
        f.write("\n".join(unique_keywords))

    print(f"Done. Collected {len(unique_keywords)} unique keywords.")
    print(f"Saved to {filename}")

Execution Log

Generating queries for Depth 2...
Total queries to process: 702

Starting harvest with 15 workers...
Progress: 20/702 | Speed: 17.32 req/s

Progress: 680/702 | Speed: 33.84 req/s
Progress: 700/702 | Speed: 33.94 req/s

Finished in 19.30 seconds. Average Speed: 33.96 req/s
Done. Collected 5107 unique keywords.
Saved to long_tail_keywords_session.csv

We optimize throughput by combining ThreadPoolExecutor with a global requests.Session. This enables TCP connection pooling (Keep-Alive), eliminating the latency of repeated SSL handshakes. With 15 concurrent workers, this architecture collects over 5,000 keywords in under 20 seconds.

While standard tools provide lagging indicators, Google Trends offers leading indicators. To capture traffic before competitors, you need to identify what users are searching for right now.

The “Rising” metric in related queries identifies terms with breakout growth (e.g., +3,450%), often before they register in volume databases.

The Code

We utilize the HasData Google Trends endpoint to bypass the instability of libraries like pytrends and avoid CAPTCHAs triggered by strict rate limiting. We request relatedQueries to extract specific search terms associated with the seed keyword.

import requests

# Configuration
API_KEY = "HASDATA_API_KEY"
SEED_TOPIC = "Coffee"

def get_breakout_trends(topic):
    """
    Fetches 'Rising' and 'Top' related queries from Google Trends.
    """
    # Endpoint documentation: https://docs.hasdata.com/apis/google-trends/search
    url = "https://api.hasdata.com/scrape/google-trends/search"
    
    # HasData requires the API key in headers
    headers = { "x-api-key": API_KEY, "Content-Type": "application/json" }

    params = {
        "q": topic,
        "dataType": "relatedQueries", # Requests both 'top' and 'rising' lists
        "geo": "US",                  # Optional: Target specific region
        "date": "now 7-d"             # Past 7 days
    }

    try:
        response = requests.get(url, headers=headers, params=params)
        
        if response.status_code == 200:
            return response.json()
        else:
            print(f"Error {response.status_code}: {response.text}")
            return None
            
    except Exception as e:
        print(f"Connection Error: {e}")
        return None

if __name__ == "__main__":
    print(f"Fetching breakout trends for: '{SEED_TOPIC}'...")
    data = get_breakout_trends(SEED_TOPIC)

    if data and "relatedQueries" in data:
        # 1. Rising Queries (Breakout topics)
        # These are terms with the highest growth rate recently
        rising = data["relatedQueries"].get("rising", [])
        print(f"\n--- Rising Queries (The Opportunity) ---")
        for item in rising[:10]: # Top 10 rising
            print(f"[Growth: {item['value']}] {item['query']}")

        # 2. Top Queries (The Base)
        # These are the most popular queries overall
        top = data["relatedQueries"].get("top", [])
        print(f"\n--- Top Queries (The Volume) ---")
        for item in top[:5]:
            print(f"[Index: {item['value']}] {item['query']}")
    else:
        print("No trend data found.")

Output Example

--- Rising Queries (The Opportunity) ---
[Growth: +3,450%] javvy protein coffee
[Growth: +1,100%] anomalous coffee machine
[Growth: +400%] the coffee boy movie
[Growth: +350%] good coffee great coffee
[Growth: +350%] us coffee market

--- Top Queries (The Volume) ---
[Index: 100] the coffee
[Index: 75] coffee shop
[Index: 54] coffee machine

The script separates “Top” queries (evergreen volume like “coffee shop”) from “Rising” queries (viral intent like “javvy protein coffee”). By automating this retrieval daily, you can detect shifting market interests before they appear in mainstream SEO tools.

To filter these trends by specific regions (geoMap) or clustering concepts (relatedTopics), refer to our guide on How to Scrape Google Trends.

Script 3: The PAA Topic Graph (Recursive Extraction)

Standard keyword research tools show you what people type. Google’s “People Also Ask” (PAA) shows you what people want to know.

By scraping PAA questions recursively, we build a Topic Authority Tree. Answering the root question along with its 2nd and 3rd-level derivatives signals deep expertise to the ranking algorithm.

The Code

This script performs a Depth-First Search (DFS) on PAA questions. It accepts a root keyword, extracts the relatedQuestions array from the JSON response, and treats each question as a new search query to find its related questions.

import requests
import json

# Configuration
API_KEY = "HASDATA_API_KEY"
ROOT_KEYWORD = "coffee"
MAX_DEPTH = 2  # How deep to go into the rabbit hole

def get_paa_questions(query):
    """
    Fetches PAA questions for a given query using HasData's Google SERP API.
    """
    url = "https://api.hasdata.com/scrape/google/serp"
    
    # We use 'json' output to get structured 'relatedQuestions' directly
    params = {
        "q": query,
        "location": "United States",
        "deviceType": "desktop",
    }
    
    headers = {
        "x-api-key": API_KEY,
        "Content-Type": "application/json"
    }

    try:
        response = requests.get(url, params=params, headers=headers, timeout=20)
        
        if response.status_code == 200:
            data = response.json()
            # Extract list of questions from the 'relatedQuestions' key
            # Check if key exists to prevent errors on pages without PAA
            questions = [item['question'] for item in data.get("relatedQuestions", [])]
            return questions
        else:
            print(f"Error {response.status_code}: {response.text}")
            return []
            
    except Exception as e:
        print(f"Connection Failed: {e}")
        return []

def build_topic_tree(query, current_depth, max_depth, visited):
    """
    Recursively builds a tree of questions.
    """
    # Avoid infinite loops or reprocessing the same question
    if current_depth > max_depth or query in visited:
        return {}

    print(f"Scanning Level {current_depth}: {query}")
    visited.add(query)
    
    # Get questions for the current query
    paa_list = get_paa_questions(query)
    
     tree = {}
    
    for question in paa_list:
        # Recursion: Treat this question as a new query
        sub_tree = build_topic_tree(question, current_depth + 1, max_depth, visited)
        tree[question] = sub_tree
        
    return tree

def print_tree(tree, level=0):
    for question, sub_questions in tree.items():
        indent = "    " * level
        print(f"{indent}- {question}")
        print_tree(sub_questions, level + 1)

if __name__ == "__main__":
    visited_queries = set()
    print(f"Building PAA Tree for: '{ROOT_KEYWORD}'...\n")
    
    topic_tree = build_topic_tree(ROOT_KEYWORD, 1, MAX_DEPTH, visited_queries)
    
    print(f"\n--- Topic Authority Tree ---")
    print(f"- {ROOT_KEYWORD}")
    print_tree(topic_tree)

We leverage the SERP API to receive structured JSON, eliminating the need to parse Google’s accordion DOM elements manually.

Output Example

Running this script for “coffee” generates a hierarchy ready for H2 and H3 tags:

- Coffee
    - What are the health benefits of coffee?
        - Does drinking coffee have any health benefits?
        - What are the 10 benefits of drinking coffee?
        - Is it good to drink coffee every day?
        - What is the healthiest way to drink coffee?
    - What are the different types of coffee?
        - What are the four types of coffee?
        - Which coffee type is best?
        - What are the top 5 best coffees?
        - What different types of coffee are there?
    - What is the history of coffee?
        - What is coffee and its history?
        - Why was coffee called Satan's drink?
        - Why did Japan take 20 years to drink coffee?
        - When did humans start drinking coffee?

Instead of a flat list, we generate a semantic graph. If “coffee” leads to “Health benefits,” querying that node reveals nuances like “Does coffee raise blood pressure?” This depth ensures your content covers sub-topics that generic tools miss.

Script 4: SERP Intent Qualifier (Page Type Analyzer)

“Keyword Difficulty” metrics are often abstract. A keyword may show low difficulty but feature ten enterprise homepages in the top results. In this scenario, a “How-to” blog post will fail because Google has determined the intent is navigational (Branded) rather than informational.

You must match the SERP Intent. If the top results are product pages, you need an e-commerce page. If they are blogs, you need content.

The Code

This script scrapes the Top 10 organic results and analyzes the URL structure using heuristics (RegEx). It classifies every result into Informational (blogs, guides), Transactional (products, shops), Encyclopedic, Video, UGC, or Navigational (homepages).

import requests
import pandas as pd
from urllib.parse import urlparse

# Configuration
API_KEY = "HASDATA_API_KEY"
KEYWORD = "instant coffee"

def get_serp_links(query):
    url = "https://api.hasdata.com/scrape/google/serp"
    params = {
        "q": query,
        "gl": "us",
        "hl": "en",
        "deviceType": "desktop"
    }
    headers = {"x-api-key": API_KEY}

    try:
        response = requests.get(url, params=params, headers=headers, timeout=20)
        if response.status_code == 200:
            data = response.json()
            # Extract organic links
            return [result['link'] for result in data.get('organicResults', [])]
        return []
    except Exception as e:
        print(f"Error: {e}")
        return []

def classify_url(url):
    """
    Classifies a URL based on domain reputation, path keywords, and structure.
    """
    try:
        parsed = urlparse(url)
        domain = parsed.netloc.lower()
        path = parsed.path.lower()
        
        # 1. Specialized Platforms (Domain-based)
        if "wikipedia.org" in domain:
            return "Encyclopedic (Wikipedia)"
        
        if "youtube.com" in domain or "youtu.be" in domain:
            return "Video Content (YouTube)"
            
        if any(x in domain for x in ['reddit.com', 'quora.com', 'stackoverflow.com']):
            return "UGC (Forum/Discussion)"

        # 2. General Forum Detection (Path-based)
        if any(x in path for x in ['/forum', '/threads', '/community', '/board']):
            return "UGC (Forum/Discussion)"


        # 3. Homepage Detection
        if path == "" or path == "/":
            return "Homepage (Brand)"
            
        # 4. Transactional Keywords
        if any(x in path for x in ['/product', '/shop', '/item', '/collections', '/buy', '/pricing', '/store']) or "amazon.com" in domain:
            return "Transactional (Product)"
            
        # 5. Informational Keywords
        if any(x in path for x in ['/blog', '/guide', '/news', '/article', '/how-to', '/tips', '/wiki']):
            return "Informational (Blog)"
            
        # 6. Fallback (General Page)
        return "General Page"
    except:
        return "Unknown"

if __name__ == "__main__":
    print(f"Analyzing SERP for: '{KEYWORD}'...\n")
    links = get_serp_links(KEYWORD)
    
    if links:
        # Create a DataFrame for clean visualization
        df = pd.DataFrame(links, columns=['URL'])
        df['Type'] = df['URL'].apply(classify_url)
        
        # Calculate percentages
        breakdown = df['Type'].value_counts(normalize=True) * 100
        
        # Display the Data
        print("--- SERP Composition ---")
        print(df[['Type', 'URL']].to_string(index=True))
        
        print("\n--- Strategic Verdict ---")
        # Handle cases where multiple types might have the same top percentage
        if not breakdown.empty:
            dominant_type = breakdown.idxmax()
            percent = breakdown.max()
            
            print(f"Dominant Type: {dominant_type} ({percent:.1f}%)")
            
            if "Informational" in dominant_type:
                print("Action: Create a long-form Guide or Blog Post.")
            elif "Transactional" in dominant_type:
                print("Action: Create a Product Page or Collection Page.")
            elif "Homepage" in dominant_type:
                print("Action: High difficulty. Requires strong Brand Authority.")
            elif "Video" in dominant_type:
                print("Action: Text alone won't rank. Produce a high-quality Video.")
            elif "UGC" in dominant_type:
                print("Action: Engage in community discussions or create 'Real Review' style content.")
            elif "Encyclopedic" in dominant_type:
                print("Action: Definitional intent. Very hard to outrank Wikipedia directly.")
            else:
                print("Action: Mixed SERP. Manual review recommended.")
        else:
            print("Action: No valid classification data available.")
    else:
        print("No results found.")

Output Example

The output provides an instant distribution table.

--- SERP Composition ---
                       Type                                                                                 URL
0  Encyclopedic (Wikipedia)                                        https://en.wikipedia.org/wiki/Instant_coffee
1              General Page              https://www.thetakeout.com/1917415/instant-coffee-store-bought-ranked/
2   Transactional (Product)                            https://www.happyproducts.com/collections/instant-coffee
3    UGC (Forum/Discussion)   https://www.reddit.com/r/Coffee/comments/1nblr8x/what_is_the_best_instant_coffee/
4   Transactional (Product)              https://onyxcoffeelab.com/collections/instant-specialty-soluble-coffee
5   Transactional (Product)                                       https://cafely.com/collections/instant-coffee
6   Transactional (Product)          https://www.amazon.com/Best-Sellers-Instant-Coffee/zgbs/grocery/2251594011
7   Transactional (Product)                        https://www.vervecoffee.com/collections/instant-craft-coffee
8              General Page                                https://www.bonappetit.com/story/best-instant-coffee

--- Strategic Verdict ---
Dominant Type: Transactional (Product) (55.6%)
Action: Create a Product Page or Collection Page.

Google defines the rules, and this script reveals them. If 56% of results are transactional, the user wants to buy, not learn. This tool aligns strategy with algorithmic evidence rather than abstract difficulty scores.

Script 5: Google Keyword Similarity Analyzer (SERP Overlap)

Targeting two similar keywords with one page is a risk. Guessing wrong leads to cannibalization or diluted relevance.

The solution is in the SERP itself. If Google ranks the same set of URLs for both queries, the intent is identical. If the results differ, the topics require separate pages.

The Code

This script calculates the Jaccard Index (overlap percentage) between search results for a list of keywords. It visualizes the data as a heatmap for instant clustering decisions.

import requests
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from collections import Counter
import time

# --- CONFIGURATION ---
API_KEY = "HASDATA_API_KEY"
KEYWORDS = [
    "health benefits of coffee",
    "benefits of coffee for men",
    "benefits of coffee for women",
    "health benefits of black coffee",
    "health benefits of mushroom coffee",
    "health benefits of decaf coffee",
]
# ---------------------

def get_top_urls_set(query):
    """
    Requests the API and returns a SET of organic URLs.
    Using a set is convenient for mathematical intersection operations.
    """
    url = "https://api.hasdata.com/scrape/google/serp"
    params = {
        "q": query,
        "gl": "us",
        "hl": "en",
        "deviceType": "desktop"
      }
    headers = {"x-api-key": API_KEY}

    print(f"Scanning SERP for: '{query}'...")
    try:
        response = requests.get(url, params=params, headers=headers, timeout=25)
        if response.status_code == 200:
            data = response.json()
            organic = data.get('organicResults', [])
            # Extract links only. Taking the first 10 if more are returned.
            links = [result['link'] for result in organic[:10] if 'link' in result]
            return set(links)
        else:
            print(f"Error {response.status_code} for query '{query}'")
            return set()
    except Exception as e:
        print(f"Exception for query '{query}': {e}")
        return set()

def calculate_jaccard(set1, set2):
    """Calculates Jaccard Index between two sets."""
    intersection = len(set1.intersection(set2))
    union = len(set1.union(set2))
    if union == 0:
        return 0.0
    return intersection / union

def visualize_similarity_heatmap(similarity_df):
    """Builds and displays the heatmap."""
    plt.figure(figsize=(10, 8))
    sns.set_theme(context='notebook', style='whitegrid', font_scale=1.1)
    
    # Create heatmap
    # annot=True shows numbers in cells
    # fmt=".0%" formats numbers as percentages
    # cmap="YlGnBu" - color scheme (Yellow to Blue)
    ax = sns.heatmap(
        similarity_df, 
        annot=True, 
        fmt=".0%", 
        cmap="YlGnBu", 
        vmin=0, 
        vmax=1,
        cbar_kws={'label': 'Jaccard Similarity Score'}
    )
    
    plt.title(f"SERP Similarity Heatmap", fontsize=16, pad=20)
    plt.xticks(rotation=45, ha='right')
    plt.yticks(rotation=0)
    plt.tight_layout()
    print("Visualizing heatmap...")
    plt.show()

def show_top_common_urls(results_data):
    """Counts and prints the most frequently occurring URLs."""
    all_urls_flat = []
    # Collect all URLs from all sets into one list
    for url_set in results_data.values():
        all_urls_flat.extend(list(url_set))
    
    # Count frequency
    url_counts = Counter(all_urls_flat).most_common(15)
    
    print(f"\n### Top Recurring URLs (across {len(KEYWORDS)} queries)")
    print(f"{'Frequency':<10} | URL")
    print("-" * 80)
    for url, count in url_counts:
        # Show only if it appears more than once, unless we have very few queries
        if count > 1 or len(KEYWORDS) <= 2: 
             print(f"{count:<10} | {url}")      

# --- MAIN LOGIC ---
if __name__ == "__main__":
    # 1. Data Collection
    results_data = {}
    print("--- Start SERP Collection ---\n")
    for keyword in KEYWORDS:
        results_data[keyword] = get_top_urls_set(keyword)
        # Short pause between requests
        time.sleep(1) 
    print("\n--- Collection Finished ---")

    # 2. Calculation of Similarity Matrix (Jaccard)
    n = len(KEYWORDS)
    similarity_matrix = pd.DataFrame(
        index=KEYWORDS, 
        columns=KEYWORDS, 
        dtype=float
    )

    # Double loop to compare every query against every other query
    for i in range(n):
        for j in range(n):
            kw1 = KEYWORDS[i]
            kw2 = KEYWORDS[j]
            score = calculate_jaccard(results_data[kw1], results_data[kw2])
            similarity_matrix.iloc[i, j] = score

    # 3. Output Results to Console (Text Table)
    print("\n### Similarity Matrix (Jaccard Index)")
    pd.set_option('display.max_columns', None)
    pd.set_option('display.width', 1000)
    # Format output as percentages
    print(similarity_matrix.style.format("{:.1%}").to_string())
    print("\n" + "="*50 + "\n")
    
    # 4. Common URL Analysis
    show_top_common_urls(results_data)

    # 5. Visualization (Heatmap)
    visualize_similarity_heatmap(similarity_matrix)

Scaling this script allows you to cluster thousands of keywords into groups automatically.

Output Example

### Similarity Matrix (Jaccard Index)
 health benefits of coffee benefits of coffee for men benefits of coffee for women health benefits of black coffee health benefits of mushroom coffee health benefits of decaf coffee
health benefits of coffee 100.0% 38.5% 30.8% 18.8% 0.0% 0.0%
benefits of coffee for men 38.5% 100.0% 36.4% 21.4% 0.0% 0.0%
benefits of coffee for women 30.8% 36.4% 100.0% 23.1% 0.0% 0.0%
health benefits of black coffee 18.8% 21.4% 23.1% 100.0% 0.0% 0.0%
health benefits of mushroom coffee 0.0% 0.0% 0.0% 0.0% 100.0% 0.0%
health benefits of decaf coffee 0.0% 0.0% 0.0% 0.0% 0.0% 100.0%

==================================================

### Top Recurring URLs (across 6 queries)
Frequency  | URL
--------------------------------------------------------------------------------
4          | https://www.hopkinsmedicine.org/health/wellness-and-prevention/9-reasons-why-the-right-amount-of-coffee-is-good-for-you
4          | https://www.mayoclinic.org/healthy-lifestyle/nutrition-and-healthy-eating/expert-answers/coffee-and-health/faq-20058339
3          | https://www.nhlbi.nih.gov/news/2025/when-it-comes-health-benefits-coffee-timing-may-count
3          | https://www.rush.edu/news/health-benefits-coffee
3          | https://www.healthline.com/nutrition/top-evidence-based-health-benefits-of-coffee
2          | https://nutritionsource.hsph.harvard.edu/food-features/coffee/

Visualizing heatmap...

SERP similarity heatmap showing Jaccard overlap between Google search results for coffee-related keywords to identify topic clustering and cannibalization risk.

The data is definitive. “Health benefits of coffee” overlaps significantly with “benefits for men” (38.5%) and “benefits for women” (30.8%). Google treats them as the same topic. You should merge these into one comprehensive guide.

Conversely, “Mushroom coffee” and “decaf coffee” have a 0% overlap. They share no URLs with the main cluster. These topics require dedicated, separate articles.

This approach removes the guesswork from site architecture.

Script 6: Content Gap Analyzer (N-Gram Extraction)

Search algorithms rely on vector space models to determine document relevance. If top-ranking results share a vocabulary your document lacks, you have a semantic distance problem.

Manual gap analysis is subjective. We solve this by treating the SERP as a training corpus. By vectorizing competitor content and comparing it against the target URL using N-Gram analysis, we mathematically define the “market vocabulary.”

The Code

This script utilizes scikit-learn to perform Bag-of-Words (BoW) vectorization and frequency analysis. It bypasses the noise of standard HTML parsing by using trafilatura to extract only the main body text, stripping ads, navigation, and boilerplate.

The algorithm constructs a feature matrix based on competitor consensus (Document Frequency). It then projects your content onto this matrix to identify missing n-grams. This process highlights precise semantic concepts absent from your content but present across the market leaders.

import trafilatura
import requests
import pandas as pd
import json
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

# --- CONFIGURATION ---
API_KEY = "HASDATA_API_KEY"
TARGET_KEYWORD = "health benefits of decaf coffee"
MY_URL = "https://www.telegraph.co.uk/health-fitness/diet/nutrition/is-decaf-coffee-good-or-bad-for-you/"
TOP_N_COMPETITORS = 10

# Extend standard stop words to filter out conversational noise and contractions
CUSTOM_STOP_WORDS = list(ENGLISH_STOP_WORDS) + ['ve', 'll', 're', 'don', 'won', 't', 's', 'm', 'd']

def get_serp_links(query):
    """Retrieves top organic search results to establish the competitor baseline."""
    url = "https://api.hasdata.com/scrape/google/serp"
    try:
        response = requests.get(url, params={"q": query, "gl": "us", "hl": "en", "deviceType": "desktop", "num": TOP_N_COMPETITORS}, headers={"x-api-key": API_KEY}, timeout=20)
        return [res['link'] for res in response.json().get('organicResults', [])] if response.status_code == 200 else []
    except Exception:
        return []

def scrape_content(target_url):
    """
    Extracts main content from a URL using Trafilatura for DOM parsing.
    Bypasses standard bot protections via HasData proxies.
    """
    api_url = "https://api.hasdata.com/scrape/web"
    # Compact payload for brevity
    payload = {"url": target_url, "proxyType": "datacenter", "proxyCountry": "US", "jsRendering": True, "outputFormat": ["html"]}
    headers = {'x-api-key': API_KEY, 'Content-Type': 'application/json'}

    try:
        response = requests.post(api_url, data=json.dumps(payload), headers=headers, timeout=60)
        if response.status_code == 200:
            # Trafilatura is superior to newspaper3k for extracting core text from modern DOMs
            text = trafilatura.extract(response.text, include_comments=False, include_tables=False, no_fallback=True)
            return text if text else ""
        return ""
    except Exception as e:
        print(f"Scrape Error: {e}")
        return ""

def analyze_ngrams_sklearn(my_text, competitor_texts, n_gram_range=(1, 1)):
    """
    Performs N-Gram analysis using a TF-IDF style approach (CountVectorizer).
    1. Fits vocabulary on the Competitor Corpus (Market Standard).
    2. Transforms Target Text against that vocabulary to identify gaps.
    """
    if not competitor_texts: return pd.DataFrame()

    # Establish significance threshold: phrase must appear in at least 2 competitors (unless corpus is small)
    min_freq = 2 if len(competitor_texts) >= 3 else 1
   
    vec = CountVectorizer(ngram_range=n_gram_range, stop_words=CUSTOM_STOP_WORDS, min_df=min_freq)
   
    try:
        # Generate feature matrix for competitors
        X_comp = vec.fit_transform(competitor_texts)
    except ValueError:
        return pd.DataFrame() # Handle empty vocabulary cases

    feature_names = vec.get_feature_names_out()
   
    # Binarize occurrence (document frequency) rather than raw term frequency
    competitors_count = (X_comp > 0).astype(int).sum(axis=0).A1
   
    # Map target content against the established market vocabulary
    my_counts = vec.transform([my_text]).toarray()[0] if my_text else [0] * len(feature_names)

    # Aggregate and sort by market popularity
    df = pd.DataFrame({'Phrase': feature_names, 'Competitors_Count': competitors_count, 'My_Count': my_counts})
    return df.sort_values(by=['Competitors_Count', 'Phrase'], ascending=[False, True])

def run_analysis():
    # 1. Competitor Discovery
    competitor_urls = get_serp_links(TARGET_KEYWORD)
    if not competitor_urls:
        print("No competitors found via SERP API.")
        return

    print(f"\n--- CONTENT EXTRACTION & ANALYSIS ---")
   
    # 2. Target Site Extraction (Normalized Logging)
    my_text = scrape_content(MY_URL)
    my_text = re.sub(r'\s+', ' ', my_text).strip()
    my_wc = len(my_text.split())
   
    status_icon = "✅" if my_wc > 100 else "⚠️"
    print(f"{status_icon} [TARGET] {my_wc} words | {MY_URL}")

    # 3. Competitor Extraction
    competitor_texts = []
    print(f"\nScanning {len(competitor_urls)} competitors...")
   
    for url in competitor_urls:
        if url == MY_URL: continue
       
        txt = scrape_content(url)
        txt = re.sub(r'\s+', ' ', txt).strip()
        wc = len(txt.split())
       
        if wc > 200:
            print(f"✅ [COMP]   {wc} words | {url}")
            competitor_texts.append(txt)
        else:
            print(f"⚠️ [SKIP]   {wc} words | {url} (Insufficient content)")

    total_competitors = len(competitor_texts)
    if total_competitors == 0:
        print("No valid competitor content available for analysis.")
        return

    print(f"\nAnalyzing gaps against {total_competitors} validated competitors...\n")

    # 4. N-Gram Calculation
    dfs = {
        "MISSING KEYWORDS (1-Gram)": analyze_ngrams_sklearn(my_text, competitor_texts, (1, 1)),
        "MISSING PHRASES (2-Grams)": analyze_ngrams_sklearn(my_text, competitor_texts, (2, 2)),
        "MISSING LONG-TAIL (3-Grams)": analyze_ngrams_sklearn(my_text, competitor_texts, (3, 3))
    }

    # 5. Reporting
    print("="*60 + "\nGAP ANALYSIS REPORT\n" + "="*60)

    if my_wc <= 50:
        print("❌ Critical: Target site content could not be parsed. Analysis aborted.")
        return

    for title, df in dfs.items():
        if df.empty: continue
       
        # Filter for gaps (Frequency on target = 0) and take top 10
        subset = df[df['My_Count'] == 0].head(10)
       
        if subset.empty:
            print(f"\n### {title}: No significant gaps detected.")
            continue

        print(f"\n### {title}")
        print(f"{'Phrase':<30} | {'Competitors':<15} | {'My Count':<10}")
        print("-" * 65)
        for _, row in subset.iterrows():
            print(f"{row['Phrase']:<30} | {row['Competitors_Count']} of {total_competitors:<10} | ❌")

    print("\n" + "="*60 + "\nSHARED TERMS OVERVIEW\n" + "="*60)
    # Show terms where we align with competitors (using 1-gram data as proxy)
    shared = dfs["MISSING KEYWORDS (1-Gram)"]
    shared_subset = shared[shared['My_Count'] > 0].head(5)
   
    if not shared_subset.empty:
        for _, row in shared_subset.iterrows():
            print(f"{row['Phrase']:<30} | {row['Competitors_Count']} of {total_competitors:<10} | {row['My_Count']}")

if __name__ == "__main__":
    run_analysis()

Output Example

--- CONTENT EXTRACTION & ANALYSIS ---
⚠️ [TARGET] 990 words | https://www.telegraph.co.uk/health-fitness/diet/nutrition/is-decaf-coffee-good-or-bad-for-you/

Scanning 8 competitors...
 [OK] 1264 words | https://www.healthline.com/nutrition/decaf-coffee-good-or-bad
 [OK] 1603 words | https://www.webmd.com/diet/what-to-know-decaf-coffee
⚠️ [SKIP] 24 words | https://pubmed.ncbi.nlm.nih.gov/32551832/ (Low content)
 [OK] 1966 words | https://orleanscoffee.com/health-benefits-of-decaf-coffee/?srsltid=AfmBOopii8VU-Rg2xvF2WxJ7VQhP0vp3ymHw4eNiM0BEiYMD_jQKEGqx
 [OK] 460 words | https://www.uclahealth.org/news/article/health-benefits-of-coffee-remain-in-decaf-version
 [OK] 738 words | https://www.swisswater.com/blogs/sw/three-big-questions-about-decaf-and-your-health-swiss-water-process
 [OK] 927 words | https://www.souterbros.co.uk/blogs/news/7-surprising-benefits-of-decaffeinated-coffee
 [OK] 977 words | https://www.aboutcoffee.org/beans/decaf-coffee/

Comparing your content against 7 valid competitors.

============================================================
GAP ANALYSIS REPORT
============================================================

### MISSING KEYWORDS (1-Gram)
Phrase                         | Competitors Use | My Count  
-----------------------------------------------------------------
antioxidants                   | 6 of 7          |         
reduce                         | 6 of 7          |         
diabetes                       | 5 of 7          |         

### MISSING PHRASES (2-Grams)
Phrase                         | Competitors Use | My Count  
-----------------------------------------------------------------
caffeine intake                | 5 of 7          |         
coffee decaf                   | 5 of 7          |         
swiss water                    | 5 of 7          |         

### MISSING LONG-TAIL (3-Grams)
Phrase                         | Competitors Use | My Count  
-----------------------------------------------------------------
coffee decaf coffee            | 5 of 7          |         
swiss water process            | 4 of 7          |         
benefits regular coffee        | 3 of 7          |         

============================================================
SHARED TERMS
============================================================
benefits                       | 7 of 7          | 8
caffeine                       | 7 of 7          | 10
coffee                         | 7 of 7          | 35
decaf                          | 7 of 7          | 13
diabetes                       | 7 of 7          | 3

The data highlights a clear relevance gap. While the target page covers the general topic, it misses specific scientific and technical terms found in the majority of competitors.

Script 7: AI Overview (SGE) Visibility Monitor

Rank trackers are often blind to AI Overviews. You may rank #1 organically yet lose traffic to an AI summary citing a competitor. This creates a “Phantom Traffic Loss” scenario where rankings remain stable but clicks evaporate.

The Code

This script audits SGE visibility by parsing the aiOverview object from the SERP API. It validates target domain presence within the citation array to derive two KPIs: AI Coverage (trigger frequency) and Citation Share of Voice.

import requests
import pandas as pd
import time
from urllib.parse import urlparse

# --- CONFIGURATION ---
API_KEY = "HASDATA_API_KEY"
TARGET_DOMAIN = "webmd.com" # The domain we are monitoring for citations

KEYWORDS = [
    "health benefits of coffee",
    "benefits of coffee for men",
    "benefits of coffee for women",
    "health benefits of black coffee",
    "health benefits of mushroom coffee",
    "health benefits of decaf coffee"
]

def normalize_domain(url):
    """
    Extracts the base domain from a URL to ensure accurate matching.
    e.g., 'https://www.sub.example.com/page' -> 'sub.example.com'
    """
    try:
        parsed = urlparse(url)
        # Returns netloc (e.g., www.example.com). We strip 'www.' for broader matching.
        return parsed.netloc.lower().replace('www.', '')
    except Exception:
        return ""

def fetch_serp_data(query):
    """
    Executes a SERP API request targeting Google's AI Overview.
    Note: AI Overviews are volatile; successful triggering depends on
    location (US) and device type (Desktop/Mobile).
    """
    endpoint = "https://api.hasdata.com/scrape/google/serp"
    params = {
        "q": query,
        "gl": "us",       # Geo-location: US is critical for SGE consistency
        "hl": "en",       # Language: English
        "deviceType": "desktop"
    }
    headers = {"x-api-key": API_KEY}
   
    try:
        response = requests.get(endpoint, params=params, headers=headers, timeout=30)
        if response.status_code == 200:
            return response.json()
        else:
            print(f"[API Error] Status: {response.status_code} for query '{query}'")
            return None
    except Exception as e:
        print(f"[Network Error] {e}")
        return None

def extract_ai_citations(serp_data):
    """
    Parses the JSON response to locate the 'aiOverview' block.
    Extracts structured references (citations) if present.
    """
    # HasData typically returns SGE data in the 'aiOverview' key
    ai_overview = serp_data.get('aiOverview')
   
    if not ai_overview:
        return None, [] # AI Overview not triggered for this query

    # Extract the 'references' array which contains the source links
    references = ai_overview.get('references', [])
   
    # Map into a cleaner format for analysis
    citations = []
    for ref in references:
        citations.append({
            'index': ref.get('index'),
            'title': ref.get('title'),
            'url': ref.get('link'),
            'source_name': ref.get('source')
        })
       
    return ai_overview, citations

def run_monitor():
    results = []
    print(f"--- Starting AI Overview Monitor for: {TARGET_DOMAIN} ---")
    print(f"Processing {len(KEYWORDS)} keywords...\n")

    for query in KEYWORDS:
        print(f"Analyzing: '{query}'...")
       
        data = fetch_serp_data(query)
       
        if not data:
            continue

        # 1. Check for AI Overview existence
        ai_block, citations = extract_ai_citations(data)
       
        is_triggered = ai_block is not None
        is_cited = False
        citation_rank = None
        found_url = None
       
        # 2. Analyze Citations if AI Overview exists
        if is_triggered:
            target_clean = normalize_domain(f"https://{TARGET_DOMAIN}")
           
            for cit in citations:
                cited_domain = normalize_domain(cit['url'])
               
                # Check for substring match (handles subdomains and exact matches)
                if target_clean in cited_domain:
                    is_cited = True
                    citation_rank = cit['index'] # 0-based index in the citation carousel
                    found_url = cit['url']
                    break # Stop after finding the first occurrence
       
        # 3. Aggregate Data
        results.append({
            "Keyword": query,
            "AI Triggered": is_triggered,
            "Is Cited": is_cited,
            "Citation Index": citation_rank if is_cited else "-",
            "Cited URL": found_url if is_cited else "-"
        })
       
        # Respect API rate limits
        time.sleep(1)

    # --- REPORTING ---
    df = pd.DataFrame(results)
   
    print("\n" + "="*60)
    print("AI OVERVIEW VISIBILITY REPORT")
    print("="*60)
   
    # formatting booleans for readability
    df['AI Triggered'] = df['AI Triggered'].map({True: '✅ Yes', False: '❌ No'})
    df['Is Cited'] = df['Is Cited'].map({True: '✅ YES', False: '❌ No'})
   
    # Handle display if dataframe is empty
    if not df.empty:
        # Use to_markdown if available, otherwise string
        try:
            print(df.to_markdown(index=False))
        except ImportError:
            print(df.to_string(index=False))
           
        # Summary Metrics
        total = len(df)
        triggered = len(df[df['AI Triggered'] == '✅ Yes'])
        cited = len(df[df['Is Cited'] == '✅ YES'])
       
        print("\n--- SUMMARY METRICS ---")
        print(f"AI Coverage: {triggered}/{total} keywords ({(triggered/total)*100:.1f}%)")
        if triggered > 0:
            print(f"Share of Voice: {cited}/{triggered} AI Overviews ({(cited/triggered)*100:.1f}%)")
        else:
            print("Share of Voice: N/A (No AI Overviews generated)")
           
    else:
        print("No data collected.")

if __name__ == "__main__":
    run_monitor()

Output Example

============================================================
AI OVERVIEW VISIBILITY REPORT
============================================================
| Keyword                            | AI Triggered   | Is Cited   | Citation Index   | Cited URL                                                                                                                                                                                  |
|:-----------------------------------|:---------------|:-----------|:-----------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| health benefits of coffee          | Yes         | No      | -                | -                                                                                                                                                                                          |
| benefits of coffee for men         | Yes         | YES     | 11               | https://www.webmd.com/diet/health-benefits-black-coffee#:~:text=Coffee%20is%20rich%20in%20several,antioxidants%20in%20most%20people's%20diets.                                             |
| benefits of coffee for women       | ✅ Yes         | ❌ No      | -                | -                                                                                                                                                                                          |
| health benefits of black coffee    | ✅ Yes         | ✅ YES     | 1                | https://www.webmd.com/diet/health-benefits-black-coffee#:~:text=Research%20shows%20that%20drinking%20coffee,builds%20up%20in%20your%20blood.                                               |
| health benefits of mushroom coffee | ✅ Yes         | ✅ YES     | 1                | https://www.webmd.com/diet/mushroom-coffee-health-benefits#:~:text=These%20have%20major%20antioxidant%20properties,to%20back%20this%20claim%20up.                                          |
| health benefits of decaf coffee    | ✅ Yes         | ✅ YES     | 1                | https://www.webmd.com/diet/what-to-know-decaf-coffee#:~:text=doctor%20allows%20it.-,What%20Are%20the%20Benefits%20of%20Drinking%20Decaf%20Coffee?,best%20option%20for%20health%20benefits. |

--- SUMMARY METRICS ---
AI Coverage: 6/6 keywords (100.0%)
Share of Voice: 4/6 AI Overviews (66.7%)

This provides the feedback loop missing from Search Console. High AI Trigger rates combined with low Citation Share indicates your content lacks the “Liftability” (clear definitions and direct answers) required for LLM extraction.

Conclusion

The scripts above are not just isolated snippets. Together, they form the backend of a custom, headless SEO platform.

Enterprise suites cost $500-$2,000 per month. You pay for the UI, the overhead, and features you never use. By leveraging Python and a SERP API (starting at $49/mo), you can replicate the critical 80% of functionality, such as Rank Tracking, Intent Analysis, and SGE Monitoring.

We have packaged all seven scripts, including a configured requirements.txt and environment templates, into a single repository.

[GitHub Repo: HasData Python SEO Suite 2026]

Your Next Step:

  1. Get your API Key: Sign up for HasData to get 1,000 free credits (no credit card required for the trial).
  2. Clone the Repo: Pull the code to your local machine or Google Colab.
  3. Execute: Insert your key and run any script to instantly perform a live keyword analysis or page audit.

Search algorithms evolve faster than enterprise SaaS roadmaps. Python provides the agility to adapt your tooling in real-time.

Sergey Ermakovich
Sergey Ermakovich
Sergey is the Co-founder and CMO at HasData, a web scraping API handling billions of requests. He specializes in web data extraction infrastructure, anti-bot evasion strategies, and technical SEO. Sergey writes extensively on headless browser orchestration, API development, and scaling data pipelines for enterprise applications.
Articles

Might Be Interesting