jsoup : Send search query to Google
This example shows you how to use jsoup to send a search query to Google.
Document doc = Jsoup
.connect("https://www.google.com/search?q=mario");
.userAgent("Mozilla/5.0")
.timeout(5000).get();
Unusual traffic from your computer network
Don’t use this example to spam Google, you will get above message from Google, read this Google answer.
Don’t use this example to spam Google, you will get above message from Google, read this Google answer.
1. jsoup example
Example to send a “mario” search query to Google, parse the search result and filters out the domain name.
FunnyCrawler.java
package com.mkyong;
import java.io.IOException;
import java.util.HashSet;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class FunnyCrawler {
private static Pattern patternDomainName;
private Matcher matcher;
private static final String DOMAIN_NAME_PATTERN
= "([a-zA-Z0-9]([a-zA-Z0-9\\-]{0,61}[a-zA-Z0-9])?\\.)+[a-zA-Z]{2,6}";
static {
patternDomainName = Pattern.compile(DOMAIN_NAME_PATTERN);
}
public static void main(String[] args) {
FunnyCrawler obj = new FunnyCrawler();
Set<String> result = obj.getDataFromGoogle("mario");
for(String temp : result){
System.out.println(temp);
}
System.out.println(result.size());
}
public String getDomainName(String url){
String domainName = "";
matcher = patternDomainName.matcher(url);
if (matcher.find()) {
domainName = matcher.group(0).toLowerCase().trim();
}
return domainName;
}
private Set<String> getDataFromGoogle(String query) {
Set<String> result = new HashSet<String>();
String request = "https://www.google.com/search?q=" + query + "&num=20";
System.out.println("Sending request..." + request);
try {
// need http protocol, set this as a Google bot agent :)
Document doc = Jsoup
.connect(request)
.userAgent(
"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)")
.timeout(5000).get();
// get all links
Elements links = doc.select("a[href]");
for (Element link : links) {
String temp = link.attr("href");
if(temp.startsWith("/url?q=")){
//use regex to get domain name
result.add(getDomainName(temp));
}
}
} catch (IOException e) {
e.printStackTrace();
}
return result;
}
}
Output
Sending request...https://www.google.com/search?q=mario&num=20
www.imdb.com
www.mariobatali.com
www.freemario.org
www.mariogames.be
mario.wikia.com
stabyourself.net
webcache.googleusercontent.com
www.youtube.com
www.huffingtonpost.com
www.mariowiki.com
mario.lancashire.gov.uk
amirulhafiz.deviantart.com
www.mariohugo.com
mariofoods.com
mario.nintendo.com
www.mario2u.com
www.botta.ch
en.wikipedia.org
www.mariotestino.com
www.hubmario.com
www.mariolemieux.org
pouetpu.pbworks.com
23
Hi, thank you so much for the useful program. I am curious to know if jsoup returns the result in the same order in which an incognito search would return? So that while we are iterating, we get the count at which a particular link was found and that could be equivalent to it’s page rank.
Great
Hi,
org.jsoup.HttpStatusException: HTTP error fetching URL. Status=400 i get this error all the time
Hi,
Its very interesting . But i want to ask you question if its possible. I want to get a result search from google search by using an arabic key word with java api but i have an error when i want to run the programm i think ts due to charset utf-8 don’t work. Can you help me thanks in advance
Hi,
Thank for sharing this. But i am dilemma whether to use their interface like “/search” or not as according to google it’s considered as illegal.
I have also checked there robot.txt file: http://www.google.com/robot.txt
interface: /search is not allowed.
So if I use this interface for 10Millions times in my java program, it will definitely create network congestion for google (particularly on this exposed interface) and then problem to me. Isn’t?
But before that please assist me in screen scraping activity i am doing.
I am trying to fetch data provided by google up-front like for word-meaning using jSoup:
https://www.google.in/?gws_rd=ssl$#q=pretend+meaning
Thanks in Anticipation
Apparently what you’re looking for is http://www.faroo.com. Maybe not as good as google but at least it’s free and 1 million queries/month.
Very nice and comprehensive, I have one question though at the line code
String request = “https://www.google.com/search?q=” + query + “&num=20”;
the `num=20` is the number of retrieved urls I assume yet when I insert 3 it brings 11,why there
is a no direct analogy?and how could I retrieve only 3 urls.Thnx in advance.
I found the reason , is that multiple Url’s are transfered , try to minise the output by saying to jsoup
Elements tag = doc.getElementsByTag(“h3”);
Elements links = tag.select(“a[href]”)
beacuse google uses h3 tag for each title in that way you will get the exact urls you are looking for.
How do you retrieve the full link from the search? Eg( https://mkyong.com/java/jsoup-send-search-query-to-google/) instead of http://www.mkyong.com. ? Appreciate any light on the matter 😀
I use this but it does not work for all websites
switch the getDomainName with this:
public String getDomainName(String url) {
String domainName = url.replace(“/url?q=”, “”);
int d = domainName.indexOf(“&”);
domainName = domainName.substring(0, d);
return domainName;
}
Hi,
Im unable to use the jsoup.jar file with the source code above.
I have downloaded jsoup and have it in the desktop along with the class file.
Would you be so kind as to outline the steps in using the library with this code?
Kind regards.
if you are using net beans just right click on libraries folder and then click add JAV/folder
Very nice google scrape of a specific number of results with jsoup java. Thanks!
Work’s Great