jsoup - Basic web crawler example

A Web Crawler is a program that navigates the Web and finds new or updated pages for indexing. The Crawler starts with seed websites or a wide range of popular URLs (also known as the frontier) and searches in depth and width for hyperlinks to extract.

A Web Crawler must be kind and robust. Kindness for a Crawler means that it respects the rules set by the robots.txt and avoids visiting a website too often. Robustness refers to the ability to avoid spider traps and other malicious behavior. Other good attributes for a Web Crawler is distributivity amongst multiple distributed machines, expandability, continuity and ability to prioritize based on page quality.

1. Steps to create web crawler

The basic steps to write a Web Crawler are:

Pick a URL from the frontier
Fetch the HTML code
Parse the HTML to extract links to other URLs
Check if you have already crawled the URLs and/or if you have seen the same content before

If not add it to the index

For each extracted URL

Confirm that it agrees to be checked (robots.txt, crawling frequency)

Truth be told, developing and maintaining one Web Crawler across all pages on the internet is… Difficult if not impossible, considering that there are over 1 billion websites online right now. If you are reading this article, chances are you are not looking for a guide to create a Web Crawler but a Web Scraper. Why is the article called ‘Basic Web Crawler’ then? Well… Because it’s catchy… Really! Few people know the difference between crawlers and scrapers so we all tend to use the word “crawling” for everything, even for offline data scraping. Also, because to build a Web Scraper you need a crawl agent too. And finally, because this article intends to inform as well as provide a viable example.

2. The skeleton of a crawler

For HTML parsing we will use jsoup. The examples below were developed using jsoup version 1.10.2.

pom.xml


        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.10.2</version>
        </dependency>

So let’s start with the basic code for a Web Crawler.

BasicWebCrawler.java


package com.mkyong.basicwebcrawler;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;
import java.util.HashSet;

public class BasicWebCrawler {

    private HashSet<String> links;

    public BasicWebCrawler() {
        links = new HashSet<String>();
    }

    public void getPageLinks(String URL) {
        //4. Check if you have already crawled the URLs 
        //(we are intentionally not checking for duplicate content in this example)
        if (!links.contains(URL)) {
            try {
                //4. (i) If not add it to the index
                if (links.add(URL)) {
                    System.out.println(URL);
                }

                //2. Fetch the HTML code
                Document document = Jsoup.connect(URL).get();
                //3. Parse the HTML to extract links to other URLs
                Elements linksOnPage = document.select("a[href]");

                //5. For each extracted URL... go back to Step 4.
                for (Element page : linksOnPage) {
                    getPageLinks(page.attr("abs:href"));
                }
            } catch (IOException e) {
                System.err.println("For '" + URL + "': " + e.getMessage());
            }
        }
    }

    public static void main(String[] args) {
        //1. Pick a URL from the frontier
        new BasicWebCrawler().getPageLinks("https://mkyong.com/");
    }

}

Note
Don’t let this code run for too long. It can take hours without ending.

Sample Output:


https://mkyong.com/
Android Tutorial

Android Tutorial

Java I/O Tutorial

Java I/O Tutorial

Java XML Tutorial

Java XML Tutorial

Java JSON Tutorials

Java JSON Tutorials

Java Regular Expression Tutorial

Java Regular Expression Tutorial

Java JDBC Tutorials

...(+ many more links)

Like we mentioned before, a Web Crawler searches in width and depth for links. If we imagine the links on a web site in a tree-like structure, the root node or level zero would be the link we start with, the next level would be all the links that we found on level zero and so on.

3. Taking crawling depth into account

We will modify the previous example to set depth of link extraction. Notice that the only true difference between this example and the previous is that the recursive getPageLinks() method has an integer argument that represents the depth of the link which is also added as a condition in the if...else statement.

WebCrawlerWithDepth.java


package com.mkyong.depthwebcrawler;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;
import java.util.HashSet;

public class WebCrawlerWithDepth {
    private static final int MAX_DEPTH = 2;
    private HashSet<String> links;

    public WebCrawlerWithDepth() {
        links = new HashSet<>();
    }

    public void getPageLinks(String URL, int depth) {
        if ((!links.contains(URL) && (depth < MAX_DEPTH))) {
            System.out.println(">> Depth: " + depth + " [" + URL + "]");
            try {
                links.add(URL);

                Document document = Jsoup.connect(URL).get();
                Elements linksOnPage = document.select("a[href]");

                depth++;
                for (Element page : linksOnPage) {
                    getPageLinks(page.attr("abs:href"), depth);
                }
            } catch (IOException e) {
                System.err.println("For '" + URL + "': " + e.getMessage());
            }
        }
    }

    public static void main(String[] args) {
        new WebCrawlerWithDepth().getPageLinks("https://mkyong.com/", 0);
    }
}

Note
Feel free to run the above code. It only took a few minutes on my laptop with depth set to 2. Please keep in mind, the higher the depth the longer it will take to finish.

Sample Output:


...
>> Depth: 1 [https://docs.gradle.org/current/userguide/userguide.html]
>> Depth: 1 [http://hibernate.org/orm/]
>> Depth: 1 [https://jax-ws.java.net/]
>> Depth: 1 [http://tomcat.apache.org/tomcat-8.0-doc/index.html]
>> Depth: 1 [http://www.javacodegeeks.com/]
For 'http://www.javacodegeeks.com/': HTTP error fetching URL
>> Depth: 1 [http://beust.com/weblog/]
>> Depth: 1 [https://dzone.com]
>> Depth: 1 [https://wordpress.org/]
>> Depth: 1 [http://www.liquidweb.com/?RID=mkyong]
>> Depth: 1 [https://mkyong.com/privacy-policy/]

4. Data Scraping vs. Data Crawling

So far so good for a theoretical approach on the matter. The fact is that you will hardly ever build a generic crawler, and if you want a “real” one, you should use tools that already exist. Most of what the average developer does is an extraction of specific information from specific websites and even though that includes building a Web Crawler, it’s actually called Web Scraping.

There is a very good article by Arpan Jha for PromptCloud on Data Scraping vs. Data Crawling which personally helped me a lot to understand this distinction and I would suggest reading it.

To summarize it with a table taken from this article:

Data Scraping	Data Crawling
Involves extracting data from various sources including the web	Refers to downloading pages from the web
Can be done at any scale	Mostly done at a large scale
Deduplication is not necessarily a part	Deduplication is an essential part
Needs crawl agent and parser	Needs only crawl agent

Time to move out of theory and into a viable example, as promised in the intro. Let’s imagine a scenario in which we want to get all the URLs for articles that relate to Java 8 from mkyong.com. Our goal is to retrieve that information in the shortest time possible and thus avoid crawling through the whole website. Besides, this approach will not only waste the server’s resources, but our time as well.

5. Case Study – Extract all articles for ‘Java 8’ on mkyong.com

5.1 First thing we should do is look at the code of the website. Taking a quick look at mkyong.com we can easily notice the paging at the front page and that it follows a /page/xx pattern for each page.

That brings us to the realization that the information we are looking for is easily accessed by retrieving all the links that include /page/. So instead of running through the whole website, we will limit our search using document.select("a[href^=\"https://mkyong.com/page/\"]"). With this css selector we collect only the links that start with https://mkyong.com/page/.

5.2 Next thing we notice is that the titles of the articles -which is what we want- are wrapped in <h2></h2> and <a href=""></a> tags.

So to extract the article titles we will access that specific information using a css selector that restricts our select method to that exact information: document.select("h2 a[href^=\"https://mkyong.com/\"]");

5.3 Finally, we will only keep the links in which the title contains ‘Java 8’ and save them to a file.

Extractor.java


package com.mkyong.extractor;

package com.mkyong;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.FileWriter;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;

public class Extractor {
    private HashSet<String> links;
    private List<List<String>> articles;

    public Extractor() {
        links = new HashSet<>();
        articles = new ArrayList<>();
    }

    //Find all URLs that start with "https://mkyong.com/page/" and add them to the HashSet
    public void getPageLinks(String URL) {
        if (!links.contains(URL)) {
            try {
                Document document = Jsoup.connect(URL).get();
                Elements otherLinks = document.select("a[href^=\"https://mkyong.com/page/\"]");

                for (Element page : otherLinks) {
                    if (links.add(URL)) {
                        //Remove the comment from the line below if you want to see it running on your editor
                        System.out.println(URL);
                    }
                    getPageLinks(page.attr("abs:href"));
                }
            } catch (IOException e) {
                System.err.println(e.getMessage());
            }
        }
    }

    //Connect to each link saved in the article and find all the articles in the page
    public void getArticles() {
        links.forEach(x -> {
            Document document;
            try {
                document = Jsoup.connect(x).get();
                Elements articleLinks = document.select("h2 a[href^=\"https://mkyong.com/\"]");
                for (Element article : articleLinks) {
                    //Only retrieve the titles of the articles that contain Java 8
                    if (article.text().matches("^.*?(Java 8|java 8|JAVA 8).*$")) {
                        //Remove the comment from the line below if you want to see it running on your editor, 
                        //or wait for the File at the end of the execution
                        //System.out.println(article.attr("abs:href"));

                        ArrayList<String> temporary = new ArrayList<>();
                        temporary.add(article.text()); //The title of the article
                        temporary.add(article.attr("abs:href")); //The URL of the article
                        articles.add(temporary);
                    }
                }
            } catch (IOException e) {
                System.err.println(e.getMessage());
            }
        });
    }

    public void writeToFile(String filename) {
        FileWriter writer;
        try {
            writer = new FileWriter(filename);
            articles.forEach(a -> {
                try {
                    String temp = "- Title: " + a.get(0) + " (link: " + a.get(1) + ")\n";
                    //display to console
                    System.out.println(temp);
                    //save to file
                    writer.write(temp);
                } catch (IOException e) {
                    System.err.println(e.getMessage());
                }
            });
            writer.close();
        } catch (IOException e) {
            System.err.println(e.getMessage());
        }
    }

    public static void main(String[] args) {
        Extractor bwc = new Extractor();
        bwc.getPageLinks("https://mkyong.com");
        bwc.getArticles();
        bwc.writeToFile("Java 8 Articles");
    }
}

Output:

References

9 Comments

Most Voted

Newest Oldest

Inline Feedbacks

View all comments

Anna

8 years ago

Hello, what if i dont know the MAX DEPTH of the website?

Kiran

I am getting java.net.SocketException: Connection reset error when crawling https website

Kyle

3 years ago

Can you make another example of this, but multithreaded?

Mat

7 years ago

I get “connect timed out” errors for many, but not all of the URLs, which this program attempts to connect to using this code.
Any idea why?
Thanks in advance!

Prateek

Hello,
Really nice article, it helped us to get starting.

Is there any way to scrape dynamic website. A website which decides through javascript code about what cotent needs to be displayed and do not use href kind of structure for navigation?

thanks
-Prateek

I am using jdk 7 and jsoup 1.11.3

Dinesh

Really Helpful and well explained. 🙂

Oren Hoffman

Thank you very much for all help Marilena !

Frank

for Document document = Jsoup.connect(URL)..get(); I cannot get the full HTML elements.
Even I do something like below I still cannot get the full elements. any idea on this?
Document document = Jsoup.connect(URL).header(“Accept-Encoding”, “gzip, deflate”)
.userAgent(“Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36”)
.maxBodySize(0)
.timeout(600000)
.get();