Datasets Prices Documentation Blog

How to Do Your Own Content Gap Analysis

Sergey Ermakovich Sergey Ermakovich
Last update: 19 Dec 2024

In this guide, I’d like to share a small internal project that initially aimed to solve a specific in-house problem, but ultimately, I believe, could be useful to a wider audience. I’m an SEO specialist at HasData, where we develop various web scraping tools, including a SERP API and a Web Scraping API. Let’s face it − working at a company that provides tools for extracting web data but not using them yourself is a bit like being a professional chef who lives off nothing but crackers.

For those who want to dive right in or prefer to explore tools hands-on, here’s the link to our application − Page Content Gap Analyzer. This tool will allow you to easily analyze your content against your competitors.

One of the tasks I aimed to automate was analyzing our own articles and comparing their content against competitors’ pages in the search results. The objective is to identify entities and concepts we may have overlooked but that consistently appear among the top 10 Google results for our target queries. This analysis helps refine our semantic framework, expand our content coverage, and ultimately enhance visibility and appeal. In the following sections, I’ll detail how this mini-project was developed and the insights we gained from it.

What Is Google NLP and What Are Entities?

Before we get into the technical details, let’s briefly review these foundational concepts.

Google NLP (Natural Language Processing) is a service provided by Google that leverages machine learning and text analysis to understand and break down written content. It can identify various characteristics − such as tone, syntax, categories, and, most importantly for our purposes, entities. In this context, entities include proper names, brands, organizations, products, places, and other significant elements mentioned in the text. Beyond simply extracting these entities, Google NLP also attempts to determine their type, relevance, and how they relate to the surrounding content.

Entities play a crucial role in SEO content analysis. They effectively highlight the core topics, individuals, or objects that your text emphasizes. By comparing your content to that of your competitors and identifying the key entities they feature, you can develop a more comprehensive list of aspects to address. This approach enhances the usefulness, informativeness, and relevance of your material to users.

In our mini-project, we use Google NLP to extract entities from the content automatically. Then we compare which entities are mentioned by competitors but not covered in our own text. This approach allows us to swiftly identify “gaps” − topics that should be addressed to make our content more complete.

How to Use It

Now, let’s transition to the practical aspect. The interface is intuitive, making it easy to get started even if you’ve never used similar tools before.

Page Content Gap Analyzer Interface

Page Content Gap Analyzer Interface

Getting Started

  1. Enter your page URL.
    Input the address of the page you want to analyze. This could be your article, a landing page, or any other content whose quality you want to improve.
  2. Enter the main search query.
    Input the keyword or phrase that you want to rank in the top 10 Google results for. The application will analyze competitors’ content based on this query.
  3. Insert your API keys.
    You will need two keys − one from HasData and one from Google NLP. Instructions on how to obtain them are provided below.
  4. Click “Start Analysis” and relax.
    Click “Start Analysis” and let the application automatically perform all necessary steps, from scraping search results to extracting entities with Google NLP and conducting a comparative analysis.

How to Get Your HasData API Key

After registering for an account on HasData, you’ll be directed to your Dashboard where you can obtain your API key. Upon activation, you receive 1,000 free API credits, this is enough to analyze approximately 100 keywords.

How to Get Your Google NLP API Key

  1. Go to Google Cloud Platform and sign in to your account.
  2. Create a new project or select an existing one from the dashboard.
  3. Navigate to “APIs & Services” and click on “Enable APIs and Services”. Search for Google Cloud Natural Language API and enable it.
  4. Go to the “Credentials” section and click on “Create Credentials” > “API Key”. If you already have an API key, you can use that instead.

Regarding Limits:
Google NLP pricing is based on the number of Unicode characters processed per request. For detailed pricing information, refer to the Google Cloud Natural Language API Pricing page. Google NLP offers 5,000 free units per month, which is typically enough for initial testing and small projects. If your usage exceeds this limit, you can scale up by moving to a paid plan.

How Our Page Content Gap Analyzer Works

Now that you understand the purpose of our mini-tool let’s explore its mechanics. I’ll detail how the application processes your input and generates valuable SEO insights.

Step 1: Fetching Search Results

Fetching Search Results

Fetching Search Results

We begin by sending a search query through HasData’s SERP API. The app retrieves the top 10 results from Google and displays them in a table with columns for:

  • Position: The page’s rank in the search results.
  • Source: The source or domain.
  • Link: The direct link to the page.
  • Snippet: A brief description of the page’s content.

Step 2: Harvesting Page Content

Harvesting Page Content

Harvesting Page Content

Next, the app extracts the content from each of those top 10 pages. Using HasData’s Web Scraping API, it automatically collects the main text from every page. This results in a table with columns for:

  • Position: The page’s rank.
  • Link: The direct link to the page.
  • Content: The page’s text content.

Step 3: Analyzing Content with Google NLP

Analyzing Content with Google NLP

Analyzing Content with Google NLP

The collected text from each of the top 10 results is then fed into Google’s Natural Language Processing (NLP) API. This service processes the text and extracts key entities, determining their importance and role within the content. You’ll end up with a table showing:

  • Entity: The name of the identified entity.
  • Salience: The significance of the entity within the text.

Step 4: Processing Your Target Page

Processing Your Target Page

Processing Your Target Page

We analyze the content of your target page using Google NLP. This data is then compared to your competitors’ entities to identify potential gaps.

Step 5: Analyzing Entities and Identifying Content Gaps

Analyzing Entities and Identifying Content Gaps

Analyzing Entities and Identifying Content Gaps

Finally, we compare the top 30 entities from your page to those of your competitors. The results are presented in a table showing:

  • Entity: The name of the entity.
  • Count: The number of competitors using that entity.
  • URLs: A list of links to competitor pages where the entity is mentioned.

Entities present in your competitors’ content but missing from yours are highlighted in orange. This visual cue makes it easy to spot content gaps and take actionable steps to enhance your content accordingly.

Bonus: DIY Content Analysis Scripts

For those who have programming skills or enjoy customizing their processes, I’ve prepared a couple of useful scripts. These scripts were the foundation for creating our analyzer and can assist you in conducting your own SEO analyses.

Python Script for Google Colab

This script extracts content from a list of URLs, analyzes entities using Google NLP, and saves the results to a Google Sheet. Unlike our mini-service, it accepts a list of URLs as input, allowing you to analyze more than just the top 10 results but any reasonable number of pages. This is particularly useful for deeper competitor analysis or investigating specific market segments.

🔗 Access the Google Colab Script

JavaScript Script for Google Sheets

This JavaScript script is designed for Google Sheets and compares the entities of your target page with those of your competitors. The results are automatically output to a new sheet, simplifying the comparative analysis process.

Before running the script, make sure to specify your own domain within the script, as it searches for the column containing your domain to locate your entities.

function onOpen() {
  const ui = SpreadsheetApp.getUi();
  ui.createMenu('Custom Scripts')
    .addItem('Compare Entities', 'compareEntities')
    .addToUi();
}

function compareEntities() {
  const sheet = SpreadsheetApp.getActiveSpreadsheet().getActiveSheet();
  const data = sheet.getDataRange().getValues();  // Get all data
  
  const userDomain = 'yourdomain.com';  // Replace with your own domain
  
  let yourEntities = {};  // Object for your entities
  let competitorsColumns = [];  // Array for competitor columns
  let entityUrls = {};  // Object to store URLs where each entity appears
  let highlightedEntities = {};  // Object for entities highlighted in orange

  // Step 1: Find the column with the user's domain and extract top-30 entities
  let yourUrlColumn = -1;
  for (let i = 0; i < data[0].length; i++) {
    if (data[0][i].includes(userDomain)) {
      yourUrlColumn = i;
      break;
    }
  }

  if (yourUrlColumn === -1) {
    SpreadsheetApp.getUi().alert(`URL with "${userDomain}" not found.`);
    return;
  }

  // Extract your entities
  let startRow = -1;
  for (let i = 0; i < data.length; i++) {
    if (data[i][yourUrlColumn] === 'Entity' && data[i][yourUrlColumn + 1] === 'Salience') {
      startRow = i + 1;  // Start from the first data row after headers
      break;
    }
  }

  if (startRow === -1) {
    SpreadsheetApp.getUi().alert('Entity and Salience not found.');
    return;
  }

  // Extract top-30 entities for your URL
  for (let i = startRow; i < startRow + 30; i++) {  // Top-30 entities
    const entity = data[i][yourUrlColumn];
    const salience = data[i][yourUrlColumn + 1];
    if (entity && salience) {
      yourEntities[entity] = salience;  // Save entities with their importance
      // Remember where this entity appears
      if (!entityUrls[entity]) {
        entityUrls[entity] = [userDomain];
      } else {
        entityUrls[entity].push(userDomain);
      }
    }
  }

  // Step 2: Find all columns that contain http, excluding the user's domain
  for (let i = 0; i < data[0].length; i++) {
    if (data[0][i].includes('http') && !data[0][i].includes(userDomain)) {
      competitorsColumns.push(i);  // Save competitor column indices
    }
  }

  // Step 3: Process competitors by iterating through each column
  competitorsColumns.forEach((competitorColumn) => {
    let competitorEntities = [];
    
    // Find the row with "Entity" and "Salience" for the competitor
    let foundEntities = false;
    for (let row = 0; row < data.length; row++) {
      if (data[row][competitorColumn] === 'Entity' && data[row][competitorColumn + 1] === 'Salience') {
        foundEntities = true;

        // After the "Entity" and "Salience" row, competitor data begins
        for (let i = row + 1; i < row + 31; i++) {  // Top-30 entities for the competitor
          const entity = data[i][competitorColumn];
          const salience = data[i][competitorColumn + 1];
          if (entity && salience) {
            competitorEntities.push({
              entity: entity,  // Exact match without normalization
              salience: salience
            });

            // Add competitor URL to the list
            if (!entityUrls[entity]) {
              entityUrls[entity] = [data[0][competitorColumn]];  // Add competitor URL
            } else {
              entityUrls[entity].push(data[0][competitorColumn]);
            }

            // If the entity is not in your list, highlight it in orange
            if (!yourEntities.hasOwnProperty(entity)) {
              if (!highlightedEntities[entity]) {
                highlightedEntities[entity] = [];
              }
              highlightedEntities[entity].push(data[0][competitorColumn]);  // Add competitor URL
            }
          }
        }
        break;
      }
    }
  });

  // Step 4: Highlight entities missing from your list in orange
  Object.keys(highlightedEntities).forEach((entity) => {
    const entityIndexes = highlightedEntities[entity];  // List of competitors where this entity appears
    entityIndexes.forEach((url) => {
      // Iterate through all competitor columns and highlight the corresponding cell
      for (let i = 0; i < data.length; i++) {
        for (let col = 0; col < data[0].length; col++) {
          if (data[i][col] === entity && data[0][col].includes(url)) {
            const range = sheet.getRange(i + 1, col + 1);  // Indexing starts at 1 for rows and columns
            range.setBackground("orange");  // Color the cell orange
          }
        }
      }
    });
  });

  // Notify the user that the analysis is complete
  // SpreadsheetApp.getUi().alert('Entities comparison completed!');
  
  // Generate a summary
  generateSummary(highlightedEntities);
}

function generateSummary(highlightedEntities) {
  const ss = SpreadsheetApp.getActiveSpreadsheet();
  let summarySheet = ss.getSheetByName('summary');
  
  // If the sheet already exists, clear it
  if (summarySheet) {
    summarySheet.clear();
  } else {
    // If the sheet does not exist, create a new one
    summarySheet = ss.insertSheet('summary');
  }

  // Headers for the summary
  summarySheet.appendRow(['Entity', 'Count', 'URLs']);

  // Populate the summary with only highlighted entities
  const entitiesArray = [];
  Object.keys(highlightedEntities).forEach((entity) => {
    const count = highlightedEntities[entity].length;
    const urls = highlightedEntities[entity].join(', ');  // Combine all URLs into a comma-separated string
    entitiesArray.push([entity, count, urls]);
  });

  // Sort by Count in descending order
  entitiesArray.sort((a, b) => b[1] - a[1]);

  // Write the sorted data to the table
  entitiesArray.forEach((entityData) => {
    summarySheet.appendRow(entityData);
  });

  // Sort the table in descending order by the "Count" column
  const range = summarySheet.getDataRange();
  range.sort({column: 2, ascending: false});  // Sort by the second column "Count"
}

These scripts provide basic functionality for content analysis and can be adapted to meet your specific needs. If you’re interested in more advanced features or want to integrate these tools into your workflows, they serve as an excellent starting point.

Conclusion

The SEO tools market is saturated with comprehensive solutions, each offering unique features and capabilities. Our analyzer doesn’t claim to be a universal tool. Instead, it serves as one of many instruments that provide a straightforward and accessible way to identify content gaps. By incorporating it into your toolkit alongside other SEO solutions, you can adopt a more holistic approach to analyzing and enhancing your website, making it more competitive and appealing to users.

I hope this tool proves to be a valuable addition to your SEO workflows and becomes a regular part of your optimization toolkit. If you have any questions or ideas for expanding its functionality, feel free to connect with me on LinkedIn and reach out directly. I’d love to hear your feedback and discuss ways to enhance the tool further. Happy optimizing!

Blog

Might Be Interesting