HasData
Back to all posts

Scraping Google News: The 2025 Python Guide

Valentina Skakun
Valentina Skakun
Last update: 25 Nov 2025

To scrape Google News effectively, you need to choose the extraction method that matches your scale: RSS Feeds for lightweight monitoring, Google Search (tbm=nws) for specific keyword tracking, or a dedicated Google News API for high-volume collection that bypasses rate limits and CAPTCHAs.

Available Data Points

Primary metadata extractable from Google News articles includes:

  • Core Data: Headlines, Summaries, Descriptions, Links.
  • Timing: Publication Dates (UTC), Timestamps.
  • Source: Publisher Name, Source URL.
  • Media: Thumbnails, Images.

Setup and Prerequisites

Ensure you have a Python environment with the following libraries:

pip install requests feedparser pandas
# Optional: For sentiment analysis and visualization later
pip install nltk matplotlib

Extracting Structured Data from Google News RSS Feed

Google News RSS feeds utilize undocumented but predictable URL structures. This method is ideal for lightweight, real-time monitoring without heavy dependencies.

To get a feed, insert /rss into the standard URL path.

  • Original:
https://news.google.com/topics/CAAqIAgKIhpDQkFTRFFvSEwyMHZNRzFyZWhJQ1pXNG9BQVAB
  • RSS Feed:
https://news.google.com/rss/topics/CAAqIAgKIhpDQkFTRFFvSEwyMHZNRzFyZWhJQ1pXNG9BQVAB

Be aware that XML feeds may occasionally lack complete metadata or encounter rate limits during high-volume fetching. Since URL structures change without notice, ensure your parser includes error handling.

Four main endpoints:

TypeURL FormatDescription
Top newshttps://news.google.com/rssMain news feed, optionally filtered by language and region
By topichttps://news.google.com/rss/topics/<TOPIC_ID>News for a specific topic (requires topic ID)
By topic sectionhttps://news.google.com/rss/topics/<TOPIC_ID>/sections/<SECTION_ID>News for a specific section within a topic (requires section ID)
Searchhttps://news.google.com/rss/search?q=<QUERY>RSS feed for custom keyword search, supports modifiers like site:, when:

Search feed parameters:

ParameterExampleDescription
qq=site:bbc.com when:1dSearch query for keywords, phrases, site filters, or time modifiers
hlhl=en-USInterface language (controls localization of results)
glgl=USGeographical region (country code for results)
ceidceid=US:enCountry and language code for content feed, usually matches hl and gl

Modifiers:

ModifierExampleDescription
site:site:bbc.comLimit search to a specific website or domain
when:when:1dRestrict results to a time range (1h, 12h, 1d, 7d)
-word-rumorExclude a word from results
”phrase""new release”Exact match for a phrase
ORapple OR samsungLogical OR between terms

To scrape Google News via RSS:

  1. Create a script to generate the RSS URL with the desired parameters.
  2. Parse the XML from the feed. It’s best to use a specialized library like feedparser for handling RSS feeds.

Here’s a small Python example to generate a search-based RSS feed URL:

import urllib.parse


# Base parameters
base_url = "https://news.google.com/rss/search"
hl = "en-US"
gl = "US"
ceid = "US:en"


# Search modifiers
keyword = "tech"
site_filter = "site:bbc.com"
time_filter = "when:1d"   # when:1h, when:12h, when:1d, when:7d
exclude_word = "-rumor"
exact_phrase = '"new release"'

query_parts = [
    keyword,
    site_filter,
    time_filter,
    exclude_word,
    exact_phrase
]

query = " ".join(part for part in query_parts if part)

encoded_query = urllib.parse.quote(query)
encoded_ceid = urllib.parse.quote(ceid)

rss_url = f"{base_url}?q={encoded_query}&hl={hl}&gl={gl}&ceid={encoded_ceid}"

print(rss_url)

Parse the RSS feed and extract structured data:

import feedparser
feed = feedparser.parse(rss_url)
items = []
for entry in feed.entries:
    items.append({
        "title": entry.get("title", ""),
        "link": entry.get("link", ""),
        "pubDate": entry.get("published", ""),
        "source": entry.get("source", {}).get("title", "") if entry.get("source") else "",
        "description": entry.get("description", "")
    })

A working Streamlit wrapper is publicly available for practical use and can be adapted for custom projects:

Scraping News Results from Google Search (Using tbm=nws)

The tbm=nws parameter filters standard Google Search results to show only news items. This method is essential for collecting historical data or specific keyword trends that RSS feeds miss.

Three main approaches:

  • Headless Browsers (Selenium/Playwright): Flexible but resource-heavy. Requires constant maintenance to evade bot detection.
  • LLM + Headless: Abstracts selector handling (e.g., Crawl4AI) but scales poorly due to cost and latency.
  • SERP APIs: The enterprise standard. APIs like HasData handle IP rotation, CAPTCHAs, and HTML parsing server-side.

For production environments requiring high availability and IP rotation, we recommend dedicated scraping APIs over headless browsers to eliminate maintenance overhead.

This script uses HasData’s Google SERP API to bypass blocking and extract structured CSV/JSON data immediately. Retrieve your API Key from the HasData dashboard.

import requests
import json
import csv


# Set up HasData’s Google SERP API endpoint and query parameters
url = "https://api.hasdata.com/scrape/google/serp"
params = {
    "q": "technology",  # search keyword
    "location": "Austin,Texas,United States",  # geolocation for search
    "tbm": "nws",  # news search mode
    "deviceType": "desktop",  # emulate desktop browser
}


# API headers with your HasData API key
headers = {
    "Content-Type": "application/json",
    "x-api-key": "HASDATA-API-KEY"
}


# Send request and parse JSON response
response = requests.get(url, params=params, headers=headers)
data = response.json()


news_items = []


# Extract relevant fields from each news result
for item in data.get("newsResults", []):
    news_items.append({
        "position": item.get("position"),
        "title": item.get("title"),
        "link": item.get("link"),
        "source": item.get("source"),
        "snippet": item.get("snippet"),
        "date": item.get("date"),
        "thumbnail": item.get("thumbnail")
    })


# Save results to JSON
with open("news.json", "w", encoding="utf-8") as f:
    json.dump(news_items, f, ensure_ascii=False, indent=2)


# Define CSV fields
csv_fields = ["position", "title", "link", "source", "snippet", "date", "thumbnail"]


# Save results to CSV
with open("news.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=csv_fields)
    writer.writeheader()
    writer.writerows(news_items)

Google limits results to 10 per page. To paginate through results, iterate requests by incrementing the start parameter (e.g., start=10, start=20) in your API requests.

The full code and a live Streamlit demo are available, so you can try it instantly or fork the project and adapt it to your own workflow.

Direct scraping via the Google News API allows for precise filtering by topic, section, or location without the parsing overhead of generic search results.

1. Scraping “Top Stories”

The “Top Stories” section is the default feed displayed on the Google News homepage.

To collect data from this specific feed using HasData’s API, you must provide its unique identifier − the topicToken.

import requests
import json
import pandas as pd

API_KEY = "HASDATA-API-KEY"

# Google News API endpoint and topicToken for specific section
url = "https://api.hasdata.com/scrape/google/news"
params = {
    "topicToken": "CAAqIggKIhxDQkFTRHdvSkwyMHZNRGxqTjNjd0VnSmxiaWdBUAE"  # U.S. news example
}

# Headers including API key
headers = {
    "Content-Type": "application/json",
    "x-api-key": API_KEY
}

# Request data from API
response = requests.get(url, params=params, headers=headers)

if response.status_code == 200:
    data = response.json()

    # Save raw JSON response
    with open("news.json", "w", encoding="utf-8") as f:
        json.dump(data, f, indent=2, ensure_ascii=False)

    news = data.get("newsResults", [])
    clean_rows = []

    # Extract and structure relevant fields for each news item
    for item in news:
        h = item.get("highlight", {})
        source = h.get("source", {})

        row = {
            "position": item.get("position"),
            "title": h.get("title"),
            "link": h.get("link"),
            "date": h.get("date"),
            "thumbnail": h.get("thumbnail"),
            "thumbnailSmall": h.get("thumbnailSmall"),
            "source_name": source.get("name"),
            "source_icon": source.get("icon"),
            "source_authors": ", ".join(source.get("authors", [])),
            "stories": json.dumps(item.get("stories", []), ensure_ascii=False)  
        }

        clean_rows.append(row)

    # Convert structured data to CSV
    df = pd.DataFrame(clean_rows)
    df.to_csv("news.csv", index=False)

The API also supports keyword search, country, language, sort order, and other parameters listed in the official documentation.

2. Scraping Specific Topics

To target niche verticals (e.g., Technology, Business), replace the topicToken in the script above.

Common Topic Tokens:

Category (Title)topicToken
Top StoriesCAAqJggKIiBDQkFTRWdvSUwyMHZNRFZxYUdjU0FtVnVHZ0pWVXlnQVAB
U.S.CAAqIggKIhxDQkFTRHdvSkwyMHZNRGxqTjNjd0VnSmxiaWdBUAE
WorldCAAqJggKIiBDQkFTRWdvSUwyMHZNRGx1YlY4U0FtVnVHZ0pWVXlnQVAB
LocalCAAqHAgKIhZDQklTQ2pvSWJHOWpZV3hmZGpJb0FBUAE
BusinessCAAqJggKIiBDQkFTRWdvSUwyMHZNRGx6TVdZU0FtVnVHZ0pWVXlnQVAB
TechnologyCAAqJggKIiBDQkFTRWdvSUwyMHZNRGRqTVhZU0FtVnVHZ0pWVXlnQVAB
EntertainmentCAAqJggKIiBDQkFTRWdvSUwyMHZNREpxYW5RU0FtVnVHZ0pWVXlnQVAB
SportsCAAqJggKIiBDQkFTRWdvSUwyMHZNRFp1ZEdvU0FtVnVHZ0pWVXlnQVAB
ScienceCAAqJggKIiBDQkFTRWdvSUwyMHZNRFp0Y1RjU0FtVnVHZ0pWVXlnQVAB
HealthCAAqIQgKIhtDQkFTRGdvSUwyMHZNR3QwTlRFU0FtVnVLQUFQAQ

A complete example with all available parameters allows dynamic topic selection via a dropdown or custom value:

3. Scraping Full Article Content

The Google News API provides links, but not the full article text. To build a complete dataset, you must pass the extracted links to a Web Scraping API that handles Javascript rendering and content extraction.

Workflow:

  1. Fetch Links: Use the News API to get a list of URLs.
  2. Crawl Content: Pass URLs to the Web Scraping API with outputFormat: ["markdown"].
  3. Store: Save individual files for RAG pipelines or analysis.
import requests
import json
import os
import time

API_KEY = "HASDATA-API-KEY"

# Prepare parameters for Google News API request
params_raw = {
    "q": "",  # optional search query
    "gl": "us",  # geographic location
    "hl": "en",  # language
    "topicToken": "CAAqJggKIiBDQkFTRWdvSUwyMHZNRFZxYUdjU0FtVnVHZ0pWVXlnQVAB",
    "sectionToken": "",
    "publicationToken": "",
    "storyToken": "",
    "so": ""
}

# Remove empty parameters
params = {k: v for k, v in params_raw.items() if v}

news_url = "https://api.hasdata.com/scrape/google/news"
news_headers = {"Content-Type": "application/json", "x-api-key": API_KEY}

# Fetch news metadata from Google News API
resp = requests.get(news_url, params=params, headers=news_headers)
resp.raise_for_status()
data = resp.json()

# Extract links to full articles
news_links = [item.get("highlight", {}).get("link") for item in data.get("newsResults", []) if item.get("highlight", {}).get("link")]

web_url = "https://api.hasdata.com/scrape/web"
web_headers = {"Content-Type": "application/json", "x-api-key": API_KEY}

# Create folder to save markdown files
os.makedirs("news_md", exist_ok=True)

# Loop through each news link and scrape full article in Markdown
for idx, link in enumerate(news_links, start=1):
    payload = {
        "url": link,
        "proxyType": "datacenter",  # choose proxy type
        "proxyCountry": "US",       # set proxy country
        "jsRendering": True,        # enable JS rendering
        "outputFormat": ["markdown"]  # get content as Markdown
    }
    
    try:
        resp = requests.post(web_url, headers=web_headers, data=json.dumps(payload))
        resp.raise_for_status()
        md_content = resp.text
        filename = os.path.join("news_md", f"news_{idx}.md")
        with open(filename, "w", encoding="utf-8") as f:
            f.write(md_content)
        print(f"[{idx}/{len(news_links)}] Saved: {filename}")
    except Exception as e:
        print(f"[{idx}/{len(news_links)}] Error: {link} -> {e}")

print("All news saved in 'news_md' folder")

Track progress when working:

Practical Applications: From Data to Insights

Raw data is valuable only when transformed into insights. Below are two Python pipelines to convert scraping results into actionable intelligence: Trend Detection and Sentiment Monitoring.

Pipeline 1: Topic Frequency Analysis

This script identifies dominant narratives by tokenizing headlines, filtering out noise (stop words), and visualizing the top keywords.

Instead of hardcoding stop words, we use nltk.corpus for a robust, standardized list.

import requests
import json
from collections import Counter
import matplotlib.pyplot as plt
import nltk
from nltk.corpus import stopwords
import re
import string

nltk.download('stopwords')

API_KEY = "HASDATA-API-KEY"

params_raw = {
    "q": "",
    "gl": "us",
    "hl": "en",
    "topicToken": "CAAqJggKIiBDQkFTRWdvSUwyMHZNRFp0Y1RjU0FtVnVHZ0pWVXlnQVAB",
    "sectionToken": "",
    "publicationToken": "",
    "storyToken": "",
    "so": ""
}

params = {k: v for k, v in params_raw.items() if v}

news_url = "https://api.hasdata.com/scrape/google/news"
news_headers = {"Content-Type": "application/json", "x-api-key": API_KEY}

resp = requests.get(news_url, params=params, headers=news_headers)
resp.raise_for_status()
data = resp.json()

# Extract titles
titles = [
    item.get("highlight", {}).get("title", "")
    for item in data.get("newsResults", [])
]

# Prepare stopwords
stop_words = set(stopwords.words('english'))

# Collect words
words = []
for title in titles:
    for word in re.findall(r'\w+', title.lower()):
        if (
            word not in stop_words
            and len(word) > 2
            and word not in string.punctuation
        ):
            words.append(word)

# Count frequency
counter = Counter(words)
most_common = counter.most_common(20)

# Handle empty case
if not most_common:
    print("No meaningful words.")
else:
    labels, counts = zip(*most_common)
    plt.figure(figsize=(12, 6))
    plt.bar(labels, counts, color='skyblue')
    plt.xticks(rotation=45, ha='right')
    plt.title("Top 20 meaningful words in news headlines")
    plt.ylabel("Frequency")
    plt.tight_layout()
    plt.show()

For deeper insight, upgrade from single words (unigrams) to bigrams (e.g., “Artificial Intelligence” instead of “Artificial” and “Intelligence”).

Pipeline 2: Sentiment Analysis with VADER

VADER (Valence Aware Dictionary and sEntiment Reasoner) is a rule-based model specifically optimized for social media and short headlines. It requires less computational power than LLMs while remaining effective for directional sentiment (Positive/Negative)

Logic:

  • Compound > 0.05: Positive
  • Compound < -0.05: Negative
  • Else: Neutral

Short text may produce false positives/negatives, especially for headlines with numbers or factual statements without emotional tone.

import requests
import json
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import matplotlib.pyplot as plt
import nltk

# Download VADER lexicon for sentiment analysis
nltk.download('vader_lexicon')

API_KEY = "HASDATA-API-KEY"

# Parameters for Google News API request
params_raw = {
    "q": "",  # optional search query
    "gl": "us",  # geographic location
    "hl": "en",  # language
    "topicToken": "CAAqJggKIiBDQkFTRWdvSUwyMHZNREpxYW5RU0FtVnVHZ0pWVXlnQVAB",
    "sectionToken": "",
    "publicationToken": "",
    "storyToken": "",
    "so": ""
}

# Remove empty parameters
params = {k: v for k, v in params_raw.items() if v}

news_url = "https://api.hasdata.com/scrape/google/news"
news_headers = {"Content-Type": "application/json", "x-api-key": API_KEY}

# Fetch news data
resp = requests.get(news_url, params=params, headers=news_headers)
resp.raise_for_status()
data = resp.json()

# Initialize VADER sentiment analyzer
sid = SentimentIntensityAnalyzer()

# Prepare containers for sentiment groups
grouped_news = {"positive": [], "neutral": [], "negative": []}

# Analyze sentiment for each news item
for item in data.get("newsResults", []):
    highlight = item.get("highlight", {})
    text_to_analyze = highlight.get("title", "") + " " + highlight.get("snippet", "")

    score = sid.polarity_scores(text_to_analyze)
    news_item = {
        "title": highlight.get("title"),
        "link": highlight.get("link"),
        "snippet": highlight.get("snippet"),
        "source_name": highlight.get("source", {}).get("name"),
        "date": highlight.get("date"),
        "thumbnail": highlight.get("thumbnail")
    }

    # Classify news item based on compound score
    if score['compound'] >= 0.05:
        grouped_news["positive"].append(news_item)
    elif score['compound'] <= -0.05:
        grouped_news["negative"].append(news_item)
    else:
        grouped_news["neutral"].append(news_item)

# Save sentiment-classified news to JSON
with open("news_sentiment.json", "w", encoding="utf-8") as f:
    json.dump(grouped_news, f, ensure_ascii=False, indent=2)

VADER excels at emotional tone but may struggle with financial nuance. A headline like “Profits fell by 20%” might be rated neutral if the dictionary lacks context for specific economic terms. For high-precision financial sentiment, consider fine-tuning a BERT model.

When scraping news data, understanding the distinction between data access and content usage is critical for compliance.

  • Public Facts vs. Copyrighted Work: generally, metadata (headlines, timestamps, factual snippets) is considered public information and is safe to index for analytics. However, the full body text of an article is the intellectual property of the publisher.
  • Permissible Use:
    • Analytics & AI: Using full text internally to train sentiment models, summaries, or trend dashboards is typically considered fair use in many jurisdictions.
    • Republishing: Displaying full articles on your own public website without a license is a copyright violation.
  • The API Advantage: Direct scraping can inadvertently trigger anti-bot measures or violate server resource policies. By using HasData, you offload the responsibility of polite crawling (rate limiting, headers management) to our infrastructure, ensuring that data is gathered without disrupting publisher ecosystems.

Disclaimer: This guide is for informational purposes only and does not constitute legal advice. Data scrapping laws (like GDPR in Europe or CFAA in the US) vary by region. Always consult with legal counsel regarding your specific use case.

Final Thoughts

Collecting and analyzing news from Google News opens a wide range of possibilities, from tracking trends to monitoring brand mentions or building custom news dashboards. Each approach has its strengths and limitations: 

  • RSS feeds are structured and reliable but offer limited flexibility; 
  • scraping Google Search results can provide broader coverage but requires constant maintenance to handle changes, proxies, and captchas; 
  • direct Google News scraping using API delivers clean, structured data with minimal overhead.

If you need to collect large volumes of news or bypass complex blocks, try HasData Google News API, we deliver the data in a structured format ready for analysis and integration into your workflow.

Valentina Skakun
Valentina Skakun
Valentina is a software engineer who builds data extraction tools before writing about them. With a strong background in Python, she also leverages her experience in JavaScript, PHP, R, and Ruby to reverse-engineer complex web architectures.If data renders in a browser, she will find a way to script its extraction.
Articles

Might Be Interesting