Java – Check if web request is from Google crawler

google-Bot

If a web request is coming from Google crawler or Google bot, the requested “user agent” should look similar like this :


Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
or
(rarely used): Googlebot/2.1 (+http://www.google.com/bot.html)

Source : Google crawlers

1. Java Example

In Java, you can get the “user agent” from HttpServletRequest.

Example : Service hosted at abcdefg.com

	@Autowired
	private HttpServletRequest request;

	//...
	String userAgent =  request.getHeader("user-agent");
		
	System.out.println("User Agent : " + userAgent);
		
	if(!StringUtils.isEmpty(userAgent)){
		if(userAgent.toLowerCase().contains("googlebot")){
			System.out.println("This is Google bot");
		}else{
			System.out.println("Not from Google");
		}
	
	}
Note
Above solution works well, but failed to detect the fake or spoof user agent.

2. Fake User Agent

It’s easy to create a fake/spoof user agent request. For example :

Example : Send a fake user agent request to abcdefg.com

package com.mkyong.web;

import org.apache.http.HttpResponse;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.HttpClientBuilder;

public class test {

	public static void main(String[] args) throws Exception {

		HttpClient client = HttpClientBuilder.create().build();
		HttpGet request = new HttpGet("abcdefg.com");
		request.setHeader("user-agent", "fake googlebot");
		HttpResponse response = client.execute(request);

	}

}

Output at abcdefg.com.


User Agent : fake googlebot

This is Google bot

3. Verifying Googlebot

To verify the real Googlebot, you can use “reverse DNS lookup” manually like this :


> host 66.249.66.1
1.66.249.66.in-addr.arpa domain name pointer 
crawl-66-249-66-1.googlebot.com.

> host crawl-66-249-66-1.googlebot.com
crawl-66-249-66-1.googlebot.com has address 66.249.66.1

Source : Verifying Googlebot

4. Verifying Googlebot – Java Example

Based on the above theory, we can simulate the 1st part of the “reverse DNS lookup”. Use host command to determine where is the requested IP point to.

If the request is coming from Googlebot, it will display this pattern : xx *.googlebot.com..

P.S host command is available at *nix system only.

Example : Detect fake user agent

	@Autowired
	private HttpServletRequest request;

	//...
	String requestIp = getRequestIp();
	String userAgent = request.getHeader("user-agent");
		
	System.out.println("User Agent : " + userAgent);
		
	if(!StringUtils.isEmpty(userAgent)){
		
		if(userAgent.toLowerCase().contains("googlebot")){
				
			//check fake user agent
			String output = executeCommand("host " + requestIp);
			System.out.println("Output : " + output);
				
			if(output.toLowerCase().contains("googlebot.com")){
				System.out.println("This is Google bot");
			}else{
				System.out.println("This is fake user agent");
			}
				
		}else{
			System.out.println("Not from Google");
		}
	}
		
	//get requested IP
	private String getRequestIp() {
		String ipAddress = request.getHeader("X-FORWARDED-FOR");
		if (ipAddress == null) {
			ipAddress = request.getRemoteAddr();
		}
		return ipAddress;
	}

	// execute external command
	private String executeCommand(String command) {

		StringBuffer output = new StringBuffer();

		Process p;
		try {
			p = Runtime.getRuntime().exec(command);
			p.waitFor();
			BufferedReader reader = 
				new BufferedReader(new InputStreamReader(p.getInputStream()));

			String line = "";			
			while ((line = reader.readLine())!= null) {
				output.append(line + "\n");
			}

		} catch (Exception e) {
			e.printStackTrace();
		}

		return output.toString();

	}

Try the “step 2” fake user agent example again. Now, you get this output:


Output : Host 142.1.168.192.in-addr.arpa. not found: 3(NXDOMAIN) //this output may vary.

User Agent : fake googlebot
This is fake user agent
Note
This simple solution may not able to stop the fake/spoof user agent 100%, but this extra security layer should be able to stop most of the basic user agent spoofing attacks.

If you have a better solution, do share below, thanks.

References

  1. Verifying Googlebot
  2. Google crawlers/a>
  3. Execute shell command from Java
author image

mkyong

Founder of Mkyong.com, love Java and open source stuff. Follow him on Twitter. If you like my tutorials, consider make a donation to these charities. Read all published posts by

Comments

avatar
3000
newest oldest most voted
Andrei
Guest
Andrei

I wouldn’t use X-FORWARDED-FOR because anyone can spoof it. If you do use it, it shouldn’t be passed directly to .exec() as a string. What would happen if X-FORWARDED-FOR is ” bla | rm -rf /” ?

Teeriq
Guest
Teeriq

Definitely a good catch on your part, Andrei. I think a good solution to prevent that from happening would be to use a regex and to capture only the part that matched a correctly formed IP address.

Andrei
Guest
Andrei

«Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.» 😉

* i wouldn’t start an executable for every request (and the result should be of course cached)
* I would use X-Forwarded-For only if is set on my own network (I doubt Google will use (visible) proxies to make the crawling)
* The format of X-Forwarded-For can contain more proxies, “X-Forwarded-For: client, proxy1, proxy2” . What happens when one of those is IPv6? What happens if there is a port appended (not very common it seems, but may happen)
* The code misses the second part of the verification:
> host crawl-66-249-66-1.googlebot.com
crawl-66-249-66-1.googlebot.com has address 66.249.66.1
One can spoof the reverse DNS lookup ( http://en.wikipedia.org/wiki/Forward-confirmed_reverse_DNS )

Teeriq
Guest
Teeriq

All I meant was that it would prevent someone from running an arbitrary command on your system shell and limit the input to the correct form. Sometimes a regex is not appropriate, but matching an arbitrary IP is simple enough.

So now the new question at hand, how would you go about actually verifying a google bot? Are you saying that it is not possible in any case at all?

Andrei
Guest
Andrei

You can. I said that the steps two and three are missing (steps from wikipedia):

1. First a reverse DNS lookup (PTR query) is performed on the IP address, which returns a list of zero or more PTR records.
2. For each domain name returned in the PTR query results, a regular ‘forward’ DNS lookup (type A or AAAA query) is then performed on that domain name.
3. Any A or AAAA record returned by the second query is then compared against the original IP address, and if there is a match, then the FCrDNS check passes.

Or from original article from Google: “You can verify that a bot accessing your server really is Googlebot (or another Google user-agent) by using a reverse DNS lookup, verifying that the name is in the googlebot.com domain, **and then doing a forward DNS lookup using that googlebot name** ”

I don’t know if it’s such a good idea to verify if the bot comes from Google though.

Levan
Guest
Levan

Nice catch! The author still mentions that this is just a SIMPLE solution (not to stop fake user agents 100%). I guess it’s more to showcase the idea… But again your example is really nice ))

MUSA JOSEPH
Guest
MUSA JOSEPH

This really help, but i wanted to know how to write a java code to detect fake websites.

Levan
Guest
Levan

Thanks for this short and easy to understand (as always) tutorial ))

Alagusundar
Guest
Alagusundar

nice. Thanks for this short and easy to understand tutorial. Thanks for sharing this information.
http://www.dreamdestinations.in/