Web Crawling With Python

Valentina Skakun Valentina Skakun
Last update: 30 Apr 2024

Web scraping and crawling are two different ways to collect data from websites. Scraping is extracting specific data from a website, while crawling is visiting all the pages. Web crawling is used for various purposes, such as indexing a website by search engines or creating a site map for its owners. You can read our comprehensive comparison of them to understand their difference better.

The versatility of the Python language and its community support makes it a popular choice for web application development projects, regardless of whether you are a beginner or an experienced developer. It simplifies the process of creating web scrapers and crawlers and working with the collected data, making it an ideal language for those interested in obtaining and analyzing data from the Internet.

Understanding Python Web Crawlers

As we said earlier, web crawling is the process of collecting web data from all pages on a website. As pages are crawled, all links on each web page are collected. Then, the links that have already been collected are crawled.

Types of Web Crawlers

While all web crawlers have similar goals, they can be divided into several types:

  1. General Purpose Crawlers.

  2. Focused Crawlers.

  3. Incremental Crawlers.

  4. Deep Web Crawlers.

General Purpose Crawlers are the most common type of web crawlers and are used by search engines like Google, Bing, and Yahoo. Their primary purpose is to index web pages across the internet, making them searchable. At the same time, focused crawlers are designed to index a specific subset of websites or web pages.

Incremental crawlers are responsible for regularly updating the indexed data by re-crawling websites to find and index new or updated content. Search engines use incremental crawlers to keep their search results current.

The last type is deep web crawlers. They are designed to access and index content not typically accessible through traditional search engines. They can crawl databases, password-protected sites, and other dynamically generated content.

How Python Web Crawler Work

Web crawlers use two main crawling methods: depth-first and breadth-first. These methods differ in the way they follow links. We will discuss these methods in more detail later. For now, let’s focus on the general principles of web crawling.

Crawling flowchart that illustrates the general principle of crawling all existing pages on a website, saving all unique links to a set.

Basic Structure of Web Crawler

To start, a web crawler needs a starting URL from which it begins crawling. On the starting page, all internal links are found and added to a link pool. Once all links on the web page have been collected, the web crawler follows the next link in the pool that has not yet been crawled. The collection process then repeats for this link. This process continues until all links on the website have been crawled.

Web Crawling Use Cases

Web crawling has various applications in various industries and fields, including search engine indexing, data mining, and content aggregation.

Web crawlers index web pages and build search engine databases, enabling users to search and find information online. They are also used to collect content from various sites and present it in one place. News aggregators, blog aggregators, and content syndication platforms use web crawlers to collect articles and information from different sources.

Researchers and data scientists use web crawling to crawl web data for analysis and research. This could be sentiment analysis of social media posts, tracking the spread of disease through news articles, or collecting information for academic research. Python crawler can collect real-time data, such as stock market prices, weather conditions, sports scores, and live-streaming events.

Depth-First vs. Breadth-First Crawling

Depth-first and breadth-first crawling are two fundamental strategies web crawlers use to navigate and index the web. They differ in how they prioritize and traverse web pages.

A flowchart illustrating the Depth-First Crawling process, where each branch is traversed to the end of its depth before moving on to the next branch.

Depth-First Crawling

In depth-first crawling, the focus is on exploring a single branch of a website’s link structure as profoundly as possible before moving on to another branch. It prioritizes going down one path of links before branching out.

A flowchart illustrating the Breadth-First Crawling process, in which all the branches on one level are traversed before moving on to the next level.

Breadth-First Crawling

Breadth-first crawling, on the other hand, focuses on exploring a wide range of web pages at the same level of depth before going deeper. It prioritizes breadth over depth.  

In practice, many web crawlers use a combination of these strategies, implementing a hybrid approach that balances depth and breadth to achieve the desired results.

Prerequisites for Web Pages Crawling

Web crawling requires a unique library to request various links and extract data from the page. For this purpose, scraping libraries that we have considered earlier are suitable. However, the most common for these purposes are:

  1. Requests with BeautifulSoup. This is a good choice for scraping, even if you are a beginner. However, these libraries have several limitations because they only allow you to make simple requests and parse data. For this case, we will also need urllib library to work with different parts of the link.

  2. Selenium. Allows you to use a headless browser, increasing the chances of avoiding blocking while crawl pages. Using this library allows you to imitate the behavior of a real user, which means that its use reduces the risk of blocking and solves all the problems of the first option.

  3. Scrapy. This is not a library but a whole framework for scraping. It has a set of functions specifically designed to solve this task. However, its use may seem quite tricky compared to other libraries.

In this article, we will consider all three options so you can find the most suitable one for your needs and tasks. To get started, make sure you have Python 3 installed and a text editor, preferably with syntax highlighting  (we recommend using Sublime or Visual Studio Code). While you don’t need a full-fledged IDE to work with Python, they can make coding easier. Now, let’s install the libraries we’ll need:

pip install beautifulsoup4
pip install urllib
pip install selenium
pip install scrapy

The requests library is pre-installed, and you’ll need a web driver to use Selenium. For instructions on installing all of the Selenium components, including the web driver, you can follow our previous article.

Web Crawling using Requests and BeautifulSoup

To make this example more useful and understandable, let’s set a goal. For example, let’s assume that we want to create a web crawler that we will use to create sitemaps. Create a file with the extension .py in which you will work. The first thing we need to do is import all the modules and libraries that we will use in the script.

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

Define a starting URL from which the web crawling process will start, a set to store visited URLs to avoid duplicate urls, and a list to store collected URLs.

start_url = 'https://example.com'
visited_urls = set()
sitemap = []

The simplest way to iterate over all links is to create a separate function that we call for each found link.

def crawl(url):

First, check if the URL has been visited before to avoid duplicates:

    if url in visited_urls:
        return

If the link has not yet been processed, then process it. However, to avoid unexpected errors and interruptions, wrap the processing in a try/except block.

    try:
        # Here will be URL process

    except Exception as e:
        print(f'Error crawling URL: {url}')
        print(e)

Next, we’ll continue in the try block. Follow the link and check that it returns a status code of 200. This means that the site returned a successful response. If the site returns a different status code, we’ll skip that link.

        response = requests.get(url)
        if response.status_code == 200:

If the status code is successful, use BeautifulSoup to parse the web page and extract all links.

            soup = BeautifulSoup(response.text, 'html.parser')
            links = soup.find_all('a')

Extract all URLs from the sitemap and store them in a list. Then, recursively process each URL found.

            for link in links:
                href = link.get('href')
                if href:
                    full_url = urljoin(url, href)
                    sitemap.append(full_url)
                    crawl(full_url)

In the end, after the try/except block, mark the current URL as processed.

    visited_urls.add(url)

We have now finished defining the function. However, we still need to call it. We couldn’t call it before defining it because Python is an interpreted programming language, which means it executes code line by line, not compiling it before execution. So, let’s call the function for the starting link that we declared at the beginning.

crawl(start_url)

With this, the crawling process is complete. You can now do whatever you need with the collected links. For example, let’s save the generated sitemap to a text file and display a message on the screen that the sitemap was successfully generated.

with open('sitemap.txt', 'w') as file:
    for url in sitemap:
        file.write(url + '\n')

print('Sitemap created and saved to sitemap.txt')

Now, when you run the script, it will generate a sitemap.txt file that contains all the URLs from the website. This sitemap can be helpful for search engine optimization (SEO) and website organization.

Web Crawling with Scrapy

Scrapy is an open-source web crawling and scraping Python framework. It provides a powerful and flexible set of tools for extracting data from websites. It allows you to define custom spiders to navigate websites, extract data, and store it in various formats.

Scrapy also handles request throttling, concurrent crawling, and other advanced features. We have already covered the installation and usage of Scrapy in a previous post, so we will not repeat ourselves here. Let’s create a crawler to create a sitemap using Scrapy. Use this command to create a new project:

scrapy startproject my_sitemap

Inside your Scrapy project, create a spider defining how to crawl and collect links from your website. In your spider, you can define the starting URLs and how to follow links:

import scrapy

class SitemapSpider(scrapy.Spider):
    name = 'sitemap'
    allowed_domains = ['example.com']
    start_urls = ['https://example.com']

    def parse(self, response):
        for href in response.css('a::attr(href)'):
            url = response.urljoin(href.extract())
            yield {
                'url': url
            }

To run the spider, use the following command:

scrapy crawl sitemap -o sitemap.json

As a result, you will receive an output file called sitemap.json that contains a sitemap.

Use Selenium for Crawling Dynamic Websites

To use Selenium to traverse all links on a website and gather them for sitemap generation, you can create a recursive function that navigates through the site, collects links, and follows them, as in the first example. First, import the library and configure the headless browser:

from selenium import webdriver

DRIVER_PATH = 'C:\chromedriver.exe' #or any other path of webdriver
options = webdriver.ChromeOptions()
options.add_argument("user-agent=Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36")

driver = webdriver.Chrome(executable_path=DRIVER_PATH, options=options)

Now, specify the starting URL and create a set to store visited URLs to avoid crawling the same page twice.

start_url = 'https://example.com'
visited_urls = set()

Let’s create a function to process URLs that performs a recursive link traversal using Selenium.

def collect_links(url):
    if url in visited_urls:
        return

    visited_urls.add(url)
    driver.get(url)

    links = [a.get_attribute('href') for a in driver.find_elements_by_tag_name('a')]

    for link in links:
        if link and link.startswith('https://example.com'):
            collect_links(link)

All that’s left is to call the function to start the link collection process.

collect_links(start_url)

To complete the script execution, close the browser:

driver.quit()

Selenium is a perfect tool for scraping dynamic websites, but it may require more code and resources than traditional web crawlers. Its ability to interact with web pages and execute JavaScript makes it a go-to choice for scraping content that relies on real-time updates and user interactions.

Avoiding Anti-Bot Measures

Web scrapers and crawlers often face anti-bot measures implemented by websites to prevent automated access. These measures are taken to protect the website from abuse or data theft. Crawler operators must know about these bot protection mechanisms and how to work around them.

Identifying and Dealing with Anti-Bot Mechanisms

Websites can block requests if they appear suspicious. To avoid this, you can use some of the tips that can help you. For example, you can use a user agent that mimics a typical browser. It is best to use a real user agent.

One of the most common anti-bot measures is request blocking from specific IP addresses. To avoid this, use proxy servers or set up your own proxy pool. To avoid triggering rate-limiting mechanisms, control the number of requests over time by using delays between requests.

Websites can use session markers to track user interaction. To avoid detection, mimic user behavior and manage sessions. You can also use headless browsers to mimic real user interaction. Another way for websites is cookies usage to track user sessions. To keep sessions alive, use cookies in your web crawling code.

CAPTCHAs and How to Bypass Them

CAPTCHA is a challenge-response test used in computing to determine whether the user is a human. They are designed to be easy for humans to solve but difficult for bots.

There are a few ways to bypass CAPTCHA. One way is to use a third-party CAPTCHA solver service. These services use various methods to solve CAPTCHAs, including image recognition and machine learning.

Another way to bypass CAPTCHA is to solve the CAPTCHA manually. This is often not feasible for large-scale scraping, but it can be done in some cases.

Websites typically only show CAPTCHAs when a user’s behavior is suspicious. By following the recommendations in the previous section, you can reduce the chance of being shown a CAPTCHA.

Conclusion

Web crawling is an essential tool for collecting data from the internet, and Python provides a robust environment for its implementation. Understanding the principles, types, and methods associated with web crawling will allow you to leverage its potential in various fields, from SEO to research and analysis.

It is important to understand web crawlers, the foundation of web scraping. Web crawlers are categorized into different types, such as general-purpose crawlers, specialized crawlers, incremental crawlers, and deep crawlers. Each type performs a specific task, from indexing the entire internet to accessing content typically not visible through standard search engines.

There are a variety of libraries and tools available for performing web crawling, including Requests with BeautifulSoup, Selenium, and Scrapy. These tools allow developers to extract data from websites efficiently. The choice of tool depends on the specific requirements of the project.

Blog

Might Be Interesting