Web Scraping with Java: a Comprehensive Guide from Beginner to Expert

Web Scraping with Java: a Comprehensive Guide from Beginner to Expert
Last edit: Apr 30, 2024

Java is one of the most popular programming languages, so it's natural to wonder if it's a good choice for web scraping. Although Java can be used for scraping data, it may not be the most optimal choice for small projects where speed is critical. On the other hand, Java can be an excellent choice for large or scalable projects that require multithreading for data collection and processing.

In this article, we'll provide a comprehensive tutorial on the pros and cons of using Java for web scraping, when to choose Java for scraping, how to install and configure the necessary components, and how to create your first scraper. We'll also cover more advanced techniques, such as creating a full-fledged crawler to crawl all website pages, concurrency and parallelism principles, and using headless browsers in Java.

Using Java for Web Scraping

Java is one of the most popular and oldest programming languages. It is a versatile language that can be used to develop various applications, including web scraping.

Pros of Java for Web Scraping Cons of Java for Web Scraping
Large and active community Relatively complex language
Rich ecosystem and libraries (e.g., Jsoup, HtmlUnit) Requires explicit declaration of data types (static typing)
Extensive documentation, tutorials, and forums Involves boilerplate code, making it more verbose
"Write once, run anywhere" approach Applications may be slower than those in other languages
Reliable support for parallel programming Compilation into bytecode adds an extra step to the development

This table has reviewed a general overview of the pros and cons of web scraping in Java. Now, we will discuss them in more detail.

Advantages of Java for Web Scraping

Java has a large and active community, a rich ecosystem, and many libraries that make web scraping easier, such as Jsoup and HtmlUnit. In case of questions or difficulties, there is always the possibility of turning to a large and active community, leading to extensive documentation, tutorials, and forums. Such support makes it easier for developers to find solutions to problems and share experiences.

Java's "write once, run anywhere" approach allows developers to develop and deploy web scraping applications on various platforms without modification. In addition, thanks to its reliable support for parallel programming, Java is beneficial for solving large-scale web scraping tasks, increasing efficiency and performance.

Disadvantages of Java for Web Scraping

Java is a powerful and versatile programming language with a wide range of applications. However, it also has some drawbacks that should be considered before choosing it as a language to learn.

One of the main challenges of learning Java is that it is a relatively complex language. This is because it is a statically typed language, meaning the data types of variables and expressions must be declared explicitly. This can make it more challenging to learn than dynamically typed languages, such as Python.

Another challenge of Java is that it uses a lot of boilerplate code. Boilerplate code is used repeatedly, such as code, to create objects or access resources. This can make Java code more verbose than code written in other languages.

Finally, Java applications can be slower than applications written in other languages. Java code must be compiled into bytecode before it can be executed. This can add an extra step to the development process and lead to performance issues.

Prerequisites for Web Scraping using Java

Before you start developing a web scraper in Java, you need to set up a Java development environment. This includes the following:

  1. Java LTS. It is recommended to use the latest stable version from the official website. Without this, you will not be able to create Java projects.
  2. Build system. Build systems like Maven and Gradle allow you to compile and run your project on a machine. We will discuss installation later.
  3. Java IDE. This is not required, but it is highly recommended. IDEs make it easier to develop and run projects. The most popular IDE is IntelliJ IDEA, which allows you to compile projects with Gradle or Maven and its own build system. This is the recommended method for beginners.

Once you have these components, you can create your Java web scraper.

Try Our Ready-Made Solutions for Your Needs

Effortlessly extract Google Maps data – business types, phone numbers, addresses, websites, emails, ratings, review counts, and more. No coding needed! Download…

Google SERP Scraper is the perfect tool for any digital marketer looking to quickly and accurately collect data from Google search engine results. With no coding…

How to Install and Use Java Build Systems

The choice of build system is up to you, but if you are a beginner, we recommend using IntelliJ IDEA and its built-in build system. Maven and Gradle can be used without an IDE via the command line.

Maven

To install Maven, download the latest release from the official website and extract the archive to the C: drive. Then, add the bin directory of the extracted Maven folder to your system's PATH environment variable.

A screenshot demonstrating the process of adding the bin directory of the extracted Maven folder to the system's PATH environment variable.
Maven PATH environment variable

To verify that the package is configured correctly, run the following in the command prompt:

mvn -version

If Maven is installed correctly, you should see information about the Maven version and related details.

The image displays the result of the 'mvn -version' command in the terminal, providing details about the installed Maven version
Maven Version Information

To build a project with Maven, you will need a command line. Navigate to your project's directory and run the following command to build the project:

mvn clean install

This command will build and install the project in your local Maven repository. After a successful build, run the following command to run the application:

java -cp target/project_name.jar MainClassName

Note that you must specify the name of the main class, which is the entry point for your application.

Gradle

To use Gradle, you also need to download the archive from the official website and add the path to the bin directory of the extracted archive to your environment variables.

A screenshot illustrating the process of adding the bin directory of the extracted Gradle folder to the system's PATH environment variable
Gradle PATH Environment Variable

Once you have done this, you can verify the installation by running the following command in a terminal:

gradle -v

If Gradle is installed correctly, you should see output similar to the following:

The image displays the result of the 'gradle -v' command, providing details about the installed Gradle version and related information
Gradle version information

To install Gradle with SDKMAN, run the following command:

sdk install gradle 8.4

To build a project, open a command prompt and navigate to the project directory. Then, run the build command in the command prompt:

gradlew build

After a successful build, run the command to launch the application:

java -jar build/libs/your-project.jar

As you can see, using Maven and Gradle is very similar. You can also use them to build from the command line and from IntelliJ, which we will discuss below.

IntelliJ IDEA

To use the IntelliJ build system, you need to download and install IntelliJ IDEA from the official website. Once installed, you can choose the desired programming language and build a system when creating a new project.

Creating a new Java project in IntelliJ IDEA and choosing the build system for seamless development
IntelliJ IDEA new Project

After that, you can launch the project anytime by using the top's relevant buttons.

Libraries for Web Scraping in Java

Java has two libraries that are most commonly used for web scraping: Jsoup and HtmlUnit. Both are suitable for web scraping and HTML parsing but have different purposes, strengths, and weaknesses. Let's look at both Java libraries and choose the more appropriate one for our subsequent examples.

Jsoup

Jsoup is a simple and lightweight Java library for working with HTML and DOM. It's a great choice for beginners and provides a convenient API for extracting and manipulating data from HTML document.

Jsoup uses CSS selectors to query and select HTML elements, which makes it easy to understand. Although using CSS selectors provides a smaller feature set than XPath, it is more popular because of its simplicity.

Unfortunately, this library cannot collect and process data from dynamic web page that load content using JavaScript. It also cannot simulate the behavior of a real user using headless browsers.

Try Our Ready-Made Solutions for Your Needs

Shopify scraper is the ultimate solution to quickly and easily extract data from any Shopify-powered store without needing any knowledge of coding or markup! All…

Zillow Scraper is the tool for real estate agents, investors, and market researchers. Its easy-to-use interface requires no coding knowledge and allows users to…

HtmlUnit

HtmlUnit library, on the other hand, addresses the shortcomings of Jsoup but lacks its lightweight and easy-to-use nature. It provides a headless browser, which allows you to interact with web pages as if with a real browser, simulating a real user's behavior.

Additionally, HtmlUnit has a JavaScript engine, which allows you to run JavaScript on web pages. You can interact with web pages by submitting forms, following links, and navigating pages.

Jsoup vs HtmlUnit

Based on our experience, we recommend using the Jsoup library for scraping simple pages and HtmlUnit if you need to use a headless browser. However, the best library for your needs will depend on your skills, specific goals, and requirements. To simplify the selection process, we have created a table that lists the conditions under which you should choose one library or the other.

Jsoup HtmlUnit
Suitable for parsing static HTML documents. Limited compared to HtmlUnit, not its primary focus.
Limited support for dynamic content rendered by JavaScript. Capable of scraping dynamic content.
Not designed for simulating user interactions on web pages. Suitable for simulating browser actions, submitting forms, and clicking links.
Does not execute JavaScript. Has a built-in JavaScript engine for executing JavaScript on web pages.
No support for navigating through pages like a browser. Provides headless browser capabilities for navigation and interaction.
More suitable for simple parsing tasks. Well-suited for testing scenarios requiring browser interaction.
Lightweight library with a smaller footprint. Larger footprint due to headless browser capabilities.
Simple and easy-to-use API. It is more complex due to additional browser simulation features.

Web Scraping with Java using Jsoup

As we discussed earlier, Jsoup is a good choice for scraping simple pages and for beginners. Therefore, we will use it as our primary example. We will cover HtmlUnit later when we discuss advanced techniques and headless browsers.

Import Library to Java Project

The way to connect a library may vary depending on the build system you use. Gradle and Maven require describing the project and its dependencies in an XML file, while IntelliJ requires directly importing the library's JAR file through the IntelliJ IDEA interface.

To use Maven, you need to create a pom.xml file that specifies the latest Jsoup library as a dependency:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>com.example</groupId>
    <artifactId>web-scraper</artifactId>
    <version>1.0-SNAPSHOT</version>

    <dependencies>
        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.16.2</version>
        </dependency>
    </dependencies>

</project>

To add a library to your project in IntelliJ IDEA, download the JAR file of the library and add it to the project by navigating to File > Project Structure > Libraries > Add.

To add a library to your project in IntelliJ IDEA, download the JAR file of the library and follow these steps: File > Project Structure > Libraries > Add
Step-by-step process to add a library to your project in IntelliJ IDEA

After that, you can import the required library modules into your project:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

In addition, we need to import an input/output module to display the output data on the screen in the future.

import java.io.IOException;

These modules now enable us to scrape any data.

Make a Request to Get HTML

To scrape a website, we first need to define the structure of our project. This includes the main class, which will launch the project initialization and the function that will perform the scraping.

public class Main {
    public static void main(String[] args) {
                  // Here will be code
            }
}

To demonstrate the process of scraping in Java, we will use the OpenCart demo site, which has products we can collect. To find the CSS selectors that describe the titles and links, open DevTools (F12 or right-click and go to Inspect) and select the desired element.

Image illustrating the process of finding CSS selectors for titles and links using DevTools.
Open DevTools

Now, let's get back to our project and get the HTML code for this page:

Document document = null;
try {
    document = Jsoup.connect("https://demo.opencart.com/").get();
} catch (IOException e) {
    throw new RuntimeException(e);
}

We used a try-catch block to prevent our script from crashing and getting error information in case of a failed request. Let's move on to the next part and extract all titles and links.

Data Extraction in Java

Now, let's study the website structure we previously reviewed and parse the HTML structure we obtained. To do this, we will select the necessary elements:

Elements links = document.select("a[href]");

Then, iterate over all the found elements print the title, and link data to the screen:

for (Element link : links) {
    System.out.println("Title: " + link.text());
    System.out.println("Link: " + link.attr("href"));
}

Launch the project and see the results:

Output in IntelliJ IDEA showcasing the successful extraction of Titles and Links from a web page using Java web scraping
IntelliJ IDEA Output

Although displaying data on the screen is a convenient option during development and testing, it is not very convenient for data storage. Therefore, let's consider ways to save data in CSV and JSON.

Export to CSV

To use CSV, you should download and import the opencsv library into your project. You can do this the same way you imported Jsoup previously. In your code, you need to import the following module:

import com.opencsv.CSVWriter;

Also, we need to use a library for working with Lists to store information about all the elements on the page:

import java.util.ArrayList;
import java.util.List;

Next, in the main function, we will add a variable to store the array of elements. To define the column headers in a file, enter the names first:

List<String[]> dataList = new ArrayList<>();
dataList.add(new String[]{"Title", "Link"});

In the for loop, we will add code to store the elements in the variable instead of printing them to the screen:

for (Element link : links) {
    String title = link.text();
    String linkUrl = link.attr("href");
   
    dataList.add(new String[]{title, linkUrl});
}

Save the collected data to a CSV file:

try (CSVWriter writer = new CSVWriter(new FileWriter("data.csv"))) {
    writer.writeAll(dataList);
} catch (IOException e) {
    e.printStackTrace();
}

Running the resulting project will generate a data.csv file in the project folder with the rest of the project files:

Screenshot depicting a data.csv file in the project folder
The resulting table

You can now quickly process data in a file. However, CSV is not a good format for transferring data. JSON is a better format, so let's look at another export option.

Export to JSON

To work with JSON, you need the GSON library. Import the necessary module into the project:

import com.google.gson.Gson;

As a data source, we will use the list we created in the CSV file-saving example. First, we need to create a new GSON object and convert the data from the list to a JSON string:

Gson gson = new Gson();
String json = gson.toJson(dataList);

Then save the data in JSON format:

try (FileWriter writer = new FileWriter("data.json")) {
    writer.write(json);
} catch (IOException e) {
    e.printStackTrace();
}

You can use the data you have received to export to other programs or forward to others.

Advanced Topics in Java Web Scraping

The example we have considered is quite basic and uses only relatively simple functions that even beginners can use. However, Java programming language allows for much more. For example, you can use the skills you have learned to write a full-fledged crawler to create site maps and crawl all pages.

Suppose you want to learn more advanced techniques. In that case, we will tell you about the ones that will be most useful: using concurrency and parallelism to improve performance and using headless browsers to simulate user behavior to control elements on the page and avoid blocking.

Web Crawling Strategies

In previous articles, we have discussed the difference between web crawling and web scraping. We have also covered how to create a web crawler in Python. In this article, we will build the exact web crawler in Java using the skills and knowledge you have learned. First, we must import the necessary modules and Java web scraping libraries.

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;
import java.util.HashSet;
import java.util.LinkedList;
import java.util.Queue;
import java.util.Set;

Next, we'll create a base class and function and specify the initial URL from which we'll start crawling the site.

public class WebCrawler {
    public static void main(String[] args) {
        String startUrl = "https://demo.opencart.com/";

        //Here will be code
    }
}

Then, we'll define variables to track visited URLs, add new URLs to a queue, and store the unique addresses of visited pages.

Set<String> visitedUrls = new HashSet<>();
Queue<String> queue = new LinkedList<>();
queue.add(startUrl);
Set<String> allVisitedPages = new HashSet<>();

After that, we'll create a loop to iterate over the queue of addresses.

while (!queue.isEmpty()) {
    //Here will be code
}

When we visit a page, we'll remove it from the queue.

String currentUrl = queue.poll();

If the URL has already been visited, we'll skip it.

if (visitedUrls.contains(currentUrl)) {
    continue;
}

We'll put the request and data collection in a try/catch block.

try {
    Document document = Jsoup.connect(currentUrl).get();
    //Here will be code

} catch (IOException e) {
    visitedUrls.add(currentUrl);
    System.err.println("Error fetching URL: " + currentUrl);
    e.printStackTrace();
}

We'll add the current page to the list of visited pages, find all links on the page, and add the unique ones to the queue for subsequent visits.

allVisitedPages.add(currentUrl);
Elements links = document.select("a[href]");
for (Element link : links) {
    String linkText = link.text();
    String linkUrl = link.absUrl("href");

    if (!visitedUrls.contains(linkUrl)) {
        queue.add(linkUrl);
    }
}

visitedUrls.add(currentUrl);

Finally, we'll print the list of all visited pages.

System.out.println("\nAll Visited Pages:");
for (String page : allVisitedPages) {
    System.out.println(page);
}

This crawler can be used to solve most tasks, and you can easily add the necessary functionality to it as needed.

Concurrency and Parallelism

Concurrency and parallelism are concepts related to executing multiple tasks or processes in a computing environment. Although these concepts are often used interchangeably, they describe different aspects of multitasking.

Concurrency describes the ability to execute multiple tasks at the same time. However, the tasks are not required to be executed simultaneously, so this can be achieved by rapidly switching between tasks.

Parallelism, conversely, implies the physical execution of multiple tasks simultaneously. Each task is decomposed into smaller subtasks and executed simultaneously. This is typically implemented by distributing data across multiple processors or cores.

To use concurrency, you will need to define such tasks in separate thread classes, for example:

public class MyThread extends Thread {
    public void run() {
        // Do something
    }
}

After that, you can call the specified threads in the main function:

MyThread thread1 = new MyThread();
MyThread thread2 = new MyThread();

thread1.start();
thread2.start();

To use the principles of parallelism, you will need the Fork/Join Framework to create separate objects for parallel execution of tasks:

ForkJoinPool forkJoinPool = new ForkJoinPool();
ForkJoinTask<Integer> task = new MyForkJoinTask();
forkJoinPool.submit(task);

Learning these approaches will improve your Java web scraper's performance and increase data processing speed.

Headless Browsers in Java

Headless browsers are essential for scraping websites with more complex security or that require specific actions to obtain data. For example, if you need to log in before you can extract data or if the data on the website is loaded using JavaScript. Additionally, using headless browsers can help reduce the risk of blocking, such as by simulating the behavior of a real user.

Try Our Ready-Made Solutions for Your Needs

Amazon scraper is a powerful and user-friendly tool that allows you to quickly and easily extract data from Amazon. With this tool, you can scrape product information…

Our Amazon Best Sellers Scraper extracts the top most popular products from the Best Sellers category and downloads all the necessary data, such as product name,…

HtmlUnit

As we mentioned earlier, HtmlUnit is a headless web browser library that provides a wide range of functionality for web scraping. To use it, you must download the library's JAR file and import the necessary modules into your project.

import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import com.gargoylesoftware.htmlunit.html.HtmlAnchor;

Once you have imported the library, you can create a class and function to handle your scraping tasks.

public class HtmlUnitHeadlessExample {
    public static void main(String[] args) {
       // Here will be code
    }
}

To open a web page in a headless browser, you will need to create a new instance of the WebDriver class and configure additional settings, such as the operating mode and display parameters.

        WebClient webClient = new WebClient(BrowserVersion.CHROME);
        webClient.getOptions().setCssEnabled(false); 

Once you have opened the web page, you can use the WebDriver object to interact with the page, such as clicking links, filling out forms, and parsing the HTML.

        try {
            HtmlPage page = webClient.getPage("https://demo.opencart.com/");
            java.util.List<HtmlAnchor> anchors = page.getByXPath("//a");
            for (HtmlAnchor anchor : anchors) {
                System.out.println("Title: " + anchor.asText());
                System.out.println("Link: " + anchor.getHrefAttribute());
            }
        } catch (Exception e) {
            e.printStackTrace();
        }

When you are finished scraping the page, you should close the WebDriver object to release resources.

finally {
            webClient.close();
        }

Using a headless browser is a good option for scraping web pages that do not require a graphical user interface. It can be faster and more efficient than a traditional web browser and can also be easier to automate tasks.

Selenium WebDriver

Selenium is a cross-platform framework that supports most programming languages, including Java. We've already covered its usage and setup in Python, so let's look at how to use it in Java.

To use Selenium, you'll need two things:

  1. The Selenium library. Import it into your project just like any other library.
  2. A WebDriver. It must be the same version as your installed browser. You can download the necessary WebDriver from the official websites of ChromeFirefox, and Edge.

You should import the necessary modules into your project to use Selenium with Java.

import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;

Next, create a main class and function. Then, specify the path to the WebDriver file and set the desired settings and parameters.

public class SeleniumExample {
    public static void main(String[] args) {
        System.setProperty("webdriver.chrome.driver", "path/chromedriver");
        ChromeOptions options = new ChromeOptions();
        options.addArguments("--headless");
        WebDriver driver = new ChromeDriver(options);

        // Extract data
    }
}

Finally, don't forget to close the WebDriver instance when you're finished.

        driver.quit();

Selenium is a compelling library for both scraping and automating tasks. It supports various features for finding and interacting with elements, making it a versatile tool for many applications.

Challenges in Web Scraping with Java

Web scraping is one of the most common tasks for automatically collecting data. However, the process is also associated with some challenges. The challenges of web scraping in Java can be divided into two types:

  • Challenges related to bypassing website protection. These are general web scraping challenges that are not specific to Java. To address them, you can use the following methods: proxies, headless browsers, or ready-made web scraping APIs that take care of these challenges for you.
  • Challenges related to Java. These challenges include the shortcomings of the language itself, which we discussed at the beginning of the article. They include the difficulty of learning Java and the weightiness of resulting web scraping programs. As we said earlier, Java is not a good choice for small projects but can be a good option for large or scalable web scrapers.

Therefore, before creating a web scraper in Java, you must ensure that it is the best solution for your project and that you can address all the challenges involved. If you want to simplify the task and speed up the performance of your program, you may want to consider using a ready-made web scraping API that will handle the data collection for you.

Conclusion and Further Exploration

In this comprehensive guide, we answered all the questions we posed at the beginning of the article. We also demonstrated how to install Java components and  create a scraper and crawler. Additionally, we showed how to retrieve the necessary data and save it in a suitable format.

We hope this Java web scraping tutorial will be useful for both beginners and more experienced Java programmers. Even experienced programmers may find something interesting in the section on advanced techniques.

Using the skills you have learned from this article, you can easily create a scraper to automatically collect data from a website, optimize it, and even simulate the behavior of real users using a headless browser.

Tired of getting blocked while scraping the web?

Try out Web Scraping API with proxy rotation, CAPTCHA bypass, and Javascript rendering.

  • 1,000 Free API Credits
  • No Credit Card Required
  • 30-Day Trial
Try now for free

Collect structured data without any coding!

Our no-code scrapers make it easy to extract data from popular websites with just a few clicks.

  • CSV, XLSX, and JSON Formats
  • No Coding or Software Required
  • Save Time and Effort
Scrape with No Code
Valentina Skakun

I'm a technical writer who believes that data parsing can help in getting and analyzing data. I'll tell about what parsing is and how to use it.