Get Google results in a list of clean URLs

I wrote a perl script to perform certain search in Google, parse the results and save all the harvested URLs in a text file. After a few improvements, I finally made a PHP Google scrapper that allows us, with an HTML parser in their core, to get unlimited Google results to then apply data mining techniques and obtain valuable information for SEO and business intelligence.

This is extremely useful for a lot of things, for example, with e-marketing and SEO related purposes, you can get huge amount of Google results from different keywords to then analyze PageRank, SERP positions, competition companies/domains and much more.

One time I prepare a search string for Google to find sites that have a security vulnerability, then I run an exploit to all this sites and I founded all the vulnerable sites, only with research purposes, another interesting use of this online SEO tool.

This are only a couple of examples… if you use your imagination… you will see a lot of things you can do…

It’s basically a parser of the Google results, so I can get Google results in any format.

Now I write the algorithm(a google parser) in PHP and publish online, you can use it under online SEO tools section and see other interesting tools… or go directly to the Google parser online tool (GooParser).

    • Chii
    • September 23rd, 2007

    would you be able to post the source code too? I m interested in how it was done =)

    ps. please ignore this if you already did (i hope i didnt miss it).

    • Mr.Kyle Sho
    • September 24th, 2007

    hi

    that’s really great google parser tool and easily find to any related topic link

    thanks a lot

    • goohackle
    • October 10th, 2007

    Thanks for your comments! You’re welcome.

    Chii, now I’m extending the tool to parse more data and return the results in different formats… when I finish I’m gonna write a post and upload the upgraded online tool.

    I’m not thinking in publish the source code right now… but it isn’t difficult nor complex, it just parse the HTML.
    The only tricky part is to avoid Google ban my server IP but with a basic knowledge of the HTML protocol, HTML headers, browsers, web pages and people behavior it’s an easy task.

  1. Good idea, any news about source code or regex used?
    check out another way to get it with Javascript using Google’s API
    http://gadelkareem.com/2007/01/28/using-google-ajax-api-as-an-array/

  2. First I parse Google results with a regex but now I get Google results using an an XML Parser.

    I tried the Google AJAX API too, but with it I can get only the first results, not all of them.

    • The big one
    • January 14th, 2008

    Hey man,
    you are talking to much just paste the code or shut up! Don’t waste our time.

  3. Hey “The big one”, I’m not gonna paste the code right now just because you want it.
    It’s very easy to write if you have basic knowledge on HTTP protocol and basic programming skills, read some basic Perl tutorial, or C, or whatever language you want… READ, LEARN and write it.

    If you have any questions or want to talk about the google parser, the method used, the language, etc… we can talk here but if you want someone else to do your job go somewhere else… and don’t waste YOUR time. Nobody here is working for you.

    • John
    • January 17th, 2008

    hallo goohackle

    tnx for the post

    i’m writing my own google parser on php
    i tried to use list of data centers, but i’ve got ban

    could you pleeez explain about

    >> HTML protocol, HTML headers, browsers, web pages and people behavior

    i mean … how to use it ? how to make useable code with it ?
    … explain please

    tnx a lot

    ps update please your online oarser, so it could give full results on one page

    for example 1000 urls for word “web”

    tnx

    • Google ban
    • February 1st, 2008

    Yes. The Google ban is the only tricky part.

    • ValleyGeek
    • February 8th, 2008

    Nice tool.

    I did not understand your example of finding sites with vulnerabilities. Is this search string you came up with something that appears in html or some script?

    Have you thought about extending your tool to include links in sponsored links also. That would probably be interesting to all those advertising on Goog.

  4. That search string is to find servers with Webmin using Google. You can refine it more to avoid false positive results.

    Works because, generally, Webmin is in 10000 port and this appears in the URL and the other words are in the html body of the login page.

    • Sly Stone
    • February 22nd, 2008

    Hello, i managed easily enough to do the same thing as gooParser.
    I know that i can avoid google ban when running in a “stealthy” mode
    i usually wait for random period between queries so my script looks like human. but the problem is that i want to parse many things and unfortunately this will take a lot of time.

    I want to tell me if there is a way to speedup the whole process, i do not want to tell me the way, but if there is a way

    • John11
    • February 23rd, 2008

    I myself need the urls of the Google search results. I was reading through the AJAX API and didnt find it to be of use. Is such parsing in anyway illegal? Have you considered using some other approach or API? And how are you avoiding the ban, using proxies?

    Would appreciate a reply.

  5. Hi Sly Stone, in the beginning I used the google parser to get a lot of data without being banned and without doing nothing strange, just doing the requests just like your browser does and 1 or 2 seconds between requests. With it I can get thousands of search results without being banned.

    A very useful tip: use tcpdump, wireshark or something similar to view if your scripts is really sending the requests just like your browser. I solved several errors with it.

    After that you can use several IPs, randomize a lot of things to appear like human requests, etc, but with really a lot of requests by minute, with this methods you can only slow down the google ban depending the amount of requests that you have.

    So, if you want a lot of more results, a lot more, I still don’t have a totally automatic method. Now the online google parser shows the google captcha and let the user write the captcha when google ban it, then it can continue after bypassing the ban with the captcha.

    You catch me inspired to write ;)

  6. John11, after reading the google terms of use I think that only parsing their results is permitted but it also says that doing automatic requests is not permited… but you probably need to read that to try to figure if what you are going to do is “illegal” or not…

    I think that I didn’t do automatic requests with this online google parser, the users do the requests… you can be more or less strict in the “automatic requests” interpretation…

    I didn’t found the google APIs to do this useful and the best approach that I found was this.

    And regarding the ban… I wrote a lot in my previous comment.

    • Sly Stone
    • February 26th, 2008

    Thank you, you are very helpful i will check the way with wireshark :).
    I appreciate that you answer.

    Best Regards,
    Sly ;)

    • mobyhunr
    • March 7th, 2008

    How can I make it country specific. I want to use it for marketing research. I need this tool. seo elite doesn’t do search engine by country. Thanks. nice tool.

  7. mobyhunr,

    If you need some particular development or tool, I can help you, it isn’t difficult modify my Google parser to make the searches country specific, just contact me at the mail here: http://goohackle.com/contact/

    If you can wait(I haven’t too much spare time for this these days) I can upgrade my tool to make it country specific, it’s a good idea. Thanks.

    • toxy
    • July 14th, 2008

    hi i can not parse google any more but as i see you still can

    i am using the code

    <?php
    header(‘Content-type:text/html; charset=utf-8′);
    $s = file_get_contents(“http://www.google.com.tr/search?hl=tr&q=test&meta=”);
    preg_match(“/(.*?)/Us”,$s,$d);
    echo “”;print_r($d);
    ?>

    what is wrong here ? can you please help me

    • Ryan
    • September 3rd, 2008

    Definitely a cool idea, but I agree it’s pointless without source code. What is it, a government secret or something??

  8. toxy, some weeks ago google changes some details of your html structure, check your regular expressions ;)

  9. haha… thanks for the first part of your comment Ryan… and for the second too ;) … I think the point is that I explain how I did it and when I have time I answer all the questions here.

    The point is “teach” how to do this parser or basically any parser of web pages… not just write the source code here.

    Is simple enough and all the functions that you could need are very well documented in php.net.

    • Merimac
    • October 14th, 2008

    Hello goohackle,
    Nice job you have done.

    I’m interesting to do something like you for my website but I don’t see the difference between your method and using the Google API.

    What are the advantages and disadvantages of each method ?

    Thanks

  10. Hi Merimac,

    When I wrote this tool the Google API had important limitations, like you can only get the first 30 results for any search and number of searches per day limited too.
    I think that now still has this limitations but you can check it anyway.

    The Google API is easy and you can quickly do small things that looks good with it but if you want more freedom to parse anything without limitations it isn’t very useful.

    On the other hand, if you write your own Google parser you can get any amount of Google search results without any limitation and parse all the URLs or information that you need.
    This method requires more develop at the beginning and if some day Google changes your HTML code then you need to modify your parser code too. But if you write a good code, this will be very easy.

    Thanks for your comment.

  11. Hello,

    I think this parser is good to get list of clean url. But i want to know how much time it takes to crawl that url and I would like to see this parser(source code) if you are agree and it is free.

    Thanks,
    Viren

    • Effy
    • December 1st, 2008

    I’m trying to use Perl to browse google but google’s robots.txt Forbid me from doing it how do I bypass that?

  12. Your parser doesn’t handle specific URLs that are sometimes returned by Google when backtracking is enabled.

    e.g. the search for UCHSCP “University of Colorado Health Science Center Police” gave me the following URL:

    /url?q=http://www.uchsc.edu/police/&ei=sxSmSbrJBonOsAOgybDaDw&sa=X&oi=spellmeleon_result&resnum=1&ct=result&cd=1&usg=AFQjCNGIvG8PCF7LZKN6xD6VXLLm18CNlg

  13. Does your parser work with regional Google results? If not, are you planning to add this?

    • Jitesh Sachdeva
    • May 2nd, 2009

    Are you sure your site works. Check it. It’s not working and giving some error code. It’s really not easy to parse Google. I’m doing it for last 1 week and not being able to do so. Lets see whether u can do that or not.

  14. Hi Jitesh, thanks for your comment.

    Now is working fine.
    I had a bot attack and that’s the reason causing the errors.

    Now I solved this attack but it isn’t the first one and it will not be the last one so feel free to tell me when problems like this happens.

    Thanks

    • glj12
    • July 10th, 2009

    So, are you going to post the source, or not? Are you willing to e-mail it?

    • rami
    • July 19th, 2009

    Thank you for sharing, but what do you mean by bot attack is this by Google trying to prevent you from parsing their results.

    • Dheeraj
    • July 20th, 2009

    Hi Goohackle,
    I wrote a similar tool using Java and using regular expressions to parse the results and get them in text file. It was fine till Google found out the automated query and I was getting error 503. I tried different methods like changing user_agents, putting sleep() in my program,proxy servers but of no avail. With no solution I had to use Google API. Can u tell me how did u avoid the error 503?(sorry.google.com your request seems to be like automated request).

    • Dheeraj
    • July 20th, 2009

    I tried using your tool. Its giving me error code and giving a captcha.

    • phil74
    • September 9th, 2009

    it worked last week but I’m assuming your site was attacked again. Will it be fixed soon?

    Thanks,
    Phil

    • sandrar
    • September 10th, 2009

    Hi! I was surfing and found your blog post… nice! I love your blog. :) Cheers! Sandra. R.

  15. Hi!

    The problems that you are experienced with my tool is, like phil74 said, caused by the “bot attack” again… isn’t a new one, is the same of the last time with some modifications.

    But nowadays I had no time to deal with that.
    When I have some time I’m going to try to solve this.

    Thanks all.

    • Yasser
    • October 5th, 2009

    Can I have the source code of the html parser?

  16. Any word on how to get around 503 for URL: _http://sorry.google.com/sorry

    I find if I mimic the my firefox headers it work perfect for a basic search. But if I put in a complex search sting 503

  17. For those who are asking about a way to get around the Sorry Google page and the 503 error, I wrote an article about how I did that on:
    How to break/bypass Google 503 Sorry error Captcha

    I expect that helps!

    • Amber Carver
    • July 27th, 2011

    a link to this blog site was at Christian Dillstrom’s list of recommended web pages, you are doing an amazing job as mobile + social media marketing mastermind provides a link to you.

  1. October 13th, 2007
  2. May 14th, 2009