Web Crawling With Python
Web scraping and crawling are two different ways to collect data from websites. Scraping is extracting specific data from a website, while crawling is visiting all the pages. Web crawling is used for various purposes, such as indexing a website by search engines or creating a site map for its owners. You can read our comprehensive comparison of them to understand their difference better.
The versatility of the Python language and its community support makes it a popular choice for web application development projects, regardless of whether you are a beginner or an experienced developer. It simplifies the process of creating web scrapers and crawlers and working with the collected data, making it an ideal language for those interested in obtaining and analyzing data from the Internet.
Understanding Python Web Crawlers
As we said earlier, web crawling is the process of collecting web data from all pages on a website. As pages are crawled, all links on each web page are collected. Then, the links that have already been collected are crawled.
Types of Web Crawlers
While all web crawlers have similar goals, they can be divided into several types:
General Purpose Crawlers.
Focused Crawlers.
Incremental Crawlers.
Deep Web Crawlers.
General Purpose Crawlers are the most common type of web crawlers and are used by search engines like Google, Bing, and Yahoo. Their primary purpose is to index web pages across the internet, making them searchable. At the same time, focused crawlers are designed to index a specific subset of websites or web pages.
Incremental crawlers are responsible for regularly updating the indexed data by re-crawling websites to find and index new or updated content. Search engines use incremental crawlers to keep their search results current.
The last type is deep web crawlers. They are designed to access and index content not typically accessible through traditional search engines. They can crawl databases, password-protected sites, and other dynamically generated content.
How Python Web Crawler Work
Web crawlers use two main crawling methods: depth-first and breadth-first. These methods differ in the way they follow links. We will discuss these methods in more detail later. For now, let’s focus on the general principles of web crawling.
To start, a web crawler needs a starting URL from which it begins crawling. On the starting page, all internal links are found and added to a link pool. Once all links on the web page have been collected, the web crawler follows the next link in the pool that has not yet been crawled. The collection process then repeats for this link. This process continues until all links on the website have been crawled.
Web Crawling Use Cases
Web crawling has various applications in various industries and fields, including search engine indexing, data mining, and content aggregation.
Web crawlers index web pages and build search engine databases, enabling users to search and find information online. They are also used to collect content from various sites and present it in one place. News aggregators, blog aggregators, and content syndication platforms use web crawlers to collect articles and information from different sources.
Researchers and data scientists use web crawling to crawl web data for analysis and research. This could be sentiment analysis of social media posts, tracking the spread of disease through news articles, or collecting information for academic research. Python crawler can collect real-time data, such as stock market prices, weather conditions, sports scores, and live-streaming events.
Effortlessly extract Google Maps data – business types, phone numbers, addresses, websites, emails, ratings, review counts, and more. No coding needed! Download results in convenient JSON, CSV, and Excel formats.
Discover the easiest way to get valuable SEO data from Google SERPs with our Google SERP Scraper! No coding is needed - just run, download, and analyze your SERP data in Excel, CSV, or JSON formats. Get started now for free!
Depth-First vs. Breadth-First Crawling
Depth-first and breadth-first crawling are two fundamental strategies web crawlers use to navigate and index the web. They differ in how they prioritize and traverse web pages.
In depth-first crawling, the focus is on exploring a single branch of a website’s link structure as profoundly as possible before moving on to another branch. It prioritizes going down one path of links before branching out.
Breadth-first crawling, on the other hand, focuses on exploring a wide range of web pages at the same level of depth before going deeper. It prioritizes breadth over depth.
In practice, many web crawlers use a combination of these strategies, implementing a hybrid approach that balances depth and breadth to achieve the desired results.
Prerequisites for Web Pages Crawling
Web crawling requires a unique library to request various links and extract data from the page. For this purpose, scraping libraries that we have considered earlier are suitable. However, the most common for these purposes are:
Requests with BeautifulSoup. This is a good choice for scraping, even if you are a beginner. However, these libraries have several limitations because they only allow you to make simple requests and parse data. For this case, we will also need urllib library to work with different parts of the link.
Selenium. Allows you to use a headless browser, increasing the chances of avoiding blocking while crawl pages. Using this library allows you to imitate the behavior of a real user, which means that its use reduces the risk of blocking and solves all the problems of the first option.
Scrapy. This is not a library but a whole framework for scraping. It has a set of functions specifically designed to solve this task. However, its use may seem quite tricky compared to other libraries.
In this article, we will consider all three options so you can find the most suitable one for your needs and tasks. To get started, make sure you have Python 3 installed and a text editor, preferably with syntax highlighting (we recommend using Sublime or Visual Studio Code). While you don’t need a full-fledged IDE to work with Python, they can make coding easier. Now, let’s install the libraries we’ll need:
pip install beautifulsoup4
pip install urllib
pip install selenium
pip install scrapy
The requests library is pre-installed, and you’ll need a web driver to use Selenium. For instructions on installing all of the Selenium components, including the web driver, you can follow our previous article.
Zillow Scraper is a powerful and easy-to-use software that allows you to quickly scrape property details from Zillow, such as address, price, beds/baths, square footage and agent contact data. With no coding required, you can get all the data you need in just a few clicks and download it in Excel, CSV or JSON formats.
Scrape and collect data from any Shopify store without writing a single line of code! Download the collected data in Excel, CSV, and JSON formats - with Shopify Scraper, it's never been easier!
Web Crawling using Requests and BeautifulSoup
To make this example more useful and understandable, let’s set a goal. For example, let’s assume that we want to create a web crawler that we will use to create sitemaps. Create a file with the extension .py in which you will work. The first thing we need to do is import all the modules and libraries that we will use in the script.
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
Define a starting URL from which the web crawling process will start, a set to store visited URLs to avoid duplicate urls, and a list to store collected URLs.
start_url = 'https://example.com'
visited_urls = set()
sitemap = []
The simplest way to iterate over all links is to create a separate function that we call for each found link.
def crawl(url):
First, check if the URL has been visited before to avoid duplicates:
if url in visited_urls:
return
If the link has not yet been processed, then process it. However, to avoid unexpected errors and interruptions, wrap the processing in a try/except block.
try:
# Here will be URL process
except Exception as e:
print(f'Error crawling URL: {url}')
print(e)
Next, we’ll continue in the try block. Follow the link and check that it returns a status code of 200. This means that the site returned a successful response. If the site returns a different status code, we’ll skip that link.
response = requests.get(url)
if response.status_code == 200:
If the status code is successful, use BeautifulSoup to parse the web page and extract all links.
soup = BeautifulSoup(response.text, 'html.parser')
links = soup.find_all('a')
Extract all URLs from the sitemap and store them in a list. Then, recursively process each URL found.
for link in links:
href = link.get('href')
if href:
full_url = urljoin(url, href)
sitemap.append(full_url)
crawl(full_url)
In the end, after the try/except block, mark the current URL as processed.
visited_urls.add(url)
We have now finished defining the function. However, we still need to call it. We couldn’t call it before defining it because Python is an interpreted programming language, which means it executes code line by line, not compiling it before execution. So, let’s call the function for the starting link that we declared at the beginning.
crawl(start_url)
With this, the crawling process is complete. You can now do whatever you need with the collected links. For example, let’s save the generated sitemap to a text file and display a message on the screen that the sitemap was successfully generated.
with open('sitemap.txt', 'w') as file:
for url in sitemap:
file.write(url + '\n')
print('Sitemap created and saved to sitemap.txt')
Now, when you run the script, it will generate a sitemap.txt file that contains all the URLs from the website. This sitemap can be helpful for search engine optimization (SEO) and website organization.
Web Crawling with Scrapy
Scrapy is an open-source web crawling and scraping Python framework. It provides a powerful and flexible set of tools for extracting data from websites. It allows you to define custom spiders to navigate websites, extract data, and store it in various formats.
Scrapy also handles request throttling, concurrent crawling, and other advanced features. We have already covered the installation and usage of Scrapy in a previous post, so we will not repeat ourselves here. Let’s create a crawler to create a sitemap using Scrapy. Use this command to create a new project:
scrapy startproject my_sitemap
Inside your Scrapy project, create a spider defining how to crawl and collect links from your website. In your spider, you can define the starting URLs and how to follow links:
import scrapy
class SitemapSpider(scrapy.Spider):
name = 'sitemap'
allowed_domains = ['example.com']
start_urls = ['https://example.com']
def parse(self, response):
for href in response.css('a::attr(href)'):
url = response.urljoin(href.extract())
yield {
'url': url
}
To run the spider, use the following command:
scrapy crawl sitemap -o sitemap.json
As a result, you will receive an output file called sitemap.json that contains a sitemap.
Use Selenium for Crawling Dynamic Websites
To use Selenium to traverse all links on a website and gather them for sitemap generation, you can create a recursive function that navigates through the site, collects links, and follows them, as in the first example. First, import the library and configure the headless browser:
from selenium import webdriver
DRIVER_PATH = 'C:\chromedriver.exe' #or any other path of webdriver
options = webdriver.ChromeOptions()
options.add_argument("user-agent=Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36")
driver = webdriver.Chrome(executable_path=DRIVER_PATH, options=options)
Now, specify the starting URL and create a set to store visited URLs to avoid crawling the same page twice.
start_url = 'https://example.com'
visited_urls = set()
Let’s create a function to process URLs that performs a recursive link traversal using Selenium.
def collect_links(url):
if url in visited_urls:
return
visited_urls.add(url)
driver.get(url)
links = [a.get_attribute('href') for a in driver.find_elements_by_tag_name('a')]
for link in links:
if link and link.startswith('https://example.com'):
collect_links(link)
All that’s left is to call the function to start the link collection process.
collect_links(start_url)
To complete the script execution, close the browser:
driver.quit()
Selenium is a perfect tool for scraping dynamic websites, but it may require more code and resources than traditional web crawlers. Its ability to interact with web pages and execute JavaScript makes it a go-to choice for scraping content that relies on real-time updates and user interactions.
Avoiding Anti-Bot Measures
Web scrapers and crawlers often face anti-bot measures implemented by websites to prevent automated access. These measures are taken to protect the website from abuse or data theft. Crawler operators must know about these bot protection mechanisms and how to work around them.
Identifying and Dealing with Anti-Bot Mechanisms
Websites can block requests if they appear suspicious. To avoid this, you can use some of the tips that can help you. For example, you can use a user agent that mimics a typical browser. It is best to use a real user agent.
One of the most common anti-bot measures is request blocking from specific IP addresses. To avoid this, use proxy servers or set up your own proxy pool. To avoid triggering rate-limiting mechanisms, control the number of requests over time by using delays between requests.
Websites can use session markers to track user interaction. To avoid detection, mimic user behavior and manage sessions. You can also use headless browsers to mimic real user interaction. Another way for websites is cookies usage to track user sessions. To keep sessions alive, use cookies in your web crawling code.
CAPTCHAs and How to Bypass Them
CAPTCHA is a challenge-response test used in computing to determine whether the user is a human. They are designed to be easy for humans to solve but difficult for bots.
There are a few ways to bypass CAPTCHA. One way is to use a third-party CAPTCHA solver service. These services use various methods to solve CAPTCHAs, including image recognition and machine learning.
Another way to bypass CAPTCHA is to solve the CAPTCHA manually. This is often not feasible for large-scale scraping, but it can be done in some cases.
Websites typically only show CAPTCHAs when a user’s behavior is suspicious. By following the recommendations in the previous section, you can reduce the chance of being shown a CAPTCHA.
Conclusion
Web crawling is an essential tool for collecting data from the internet, and Python provides a robust environment for its implementation. Understanding the principles, types, and methods associated with web crawling will allow you to leverage its potential in various fields, from SEO to research and analysis.
It is important to understand web crawlers, the foundation of web scraping. Web crawlers are categorized into different types, such as general-purpose crawlers, specialized crawlers, incremental crawlers, and deep crawlers. Each type performs a specific task, from indexing the entire internet to accessing content typically not visible through standard search engines.
There are a variety of libraries and tools available for performing web crawling, including Requests with BeautifulSoup, Selenium, and Scrapy. These tools allow developers to extract data from websites efficiently. The choice of tool depends on the specific requirements of the project.
Might Be Interesting
Oct 29, 2024
How to Scrape YouTube Data for Free: A Complete Guide
Learn effective methods for scraping YouTube data, including extracting video details, channel info, playlists, comments, and search results. Explore tools like YouTube Data API, yt-dlp, and Selenium for a step-by-step guide to accessing valuable YouTube insights.
- Python
- Tutorials and guides
- Tools and Libraries
Oct 16, 2024
Scrape Etsy.com Product, Shop and Search Results Data
Learn how to scrape Etsy product, shop, and search results data with methods like Requests, BeautifulSoup, Selenium, and web scraping APIs. Explore strategies for data extraction and storage from Etsy's platform.
- E-commerce
- Tutorials and guides
- Python
Sep 9, 2024
How to Scrape Immobilienscout24.de Real Estate Data
Learn how to scrape real estate data from Immobilienscout24.de with step-by-step instructions, covering website analysis, choosing the right tools, and storing the collected data.
- Real Estate
- Use Cases
- Python