Google Crawling Behavior Exposed: 16 Different Googlebots?

Must!.....Document!.....Googles!.....Crawlers!

Must.....Document.....Google's.....Crawlers!

Of the major disciplines in SEO, architecture is perhaps the most important of all.  This is because if you don’t take into account Google crawling behavior and Google can’t properly spider your website and discover your content, then it won’t matter if it’s the right content, whether it’s optimized, or whether you have obtained any links to it.   It may as well not exist if Google can’t discover it.

You’re probably familiar with Google’s spider program, Googlebot, but there are many other ways that Google visits your website.  This posting will attempt to pull them all together, and raises some open questions about Google crawling behavior for the community because some aspects of Google’s numerous spiders are still relatively mysterious.

Google Crawling Behavior Problem: Whitelisting User-Agents

A client of mine had some issues recently with Google Instant Preview; they had been whitelisting browsers based on user-agent detection, and I was surprised to see that at some point, Google’s bot that grabs the preview of their web pages must have triggered the whitelist.

Their SERP preview images clearly showed a message in the image “Browser not supported”; this was the message displayed by their website, which had to have been triggered by Google’s bot.  Even worse, Google Instant Preview was highlighting this large message as the key text on the page.  This lends some credence, in my opinion, to Joshua Giardino’s theory that Google is using Chrome as a rendering front-end  to their spiders – and it could simply be that for a short time, the Instant  Preview bot’s user-agent was reporting something that was not on my client’s whitelist.

What is This Google Instant Preview Crawler Anyway?

Digging into this, I found that Google Instant Preview (although according to Google it is essentially Googlebot)  reports a different user-agent, and in addition to  running automatically, it appears to update (or schedule an update perhaps) on demand when users request previews.  I knew Google had a separate crawler for Google News and a few other things, but had never really thought much about Google crawling behavior as it relates to Instant Preview.  To me, if something is reporting a separate user-agent and is behaving differently, then it merits classification as a separate spider.  In fact, Google could easily just have one massive program with numerous if-then statements at the start and technically call it one crawler, but the essential functions would still be somewhat different.

This got me wondering…just how many ways does Google crawl your website?

“How do I crawl thee?  Let me count the ways…”

It turns out – quite a few.  Google lists around 8 or 9 in a few places itself, but I was able to identify *16* different ways Google could potentially grab content from your website, with a variety of different user agents and behaviors.

Arguably the single page fetch ones are technically not complete “crawlers” or “spiders” but they do register in your web server log with a user-agent or referrer string, they pull data from your website, and in my opinion can be considered at least partially human-powered crawlers, since humans are initiating the fetch on demand.

Take “Fetch as Googlebot” for instance – even though it returns the same user-agent as Googlebot, it appears to be a different program, by Google’s own admission:

“While the tool reflects what Googlebot sees, there may be some differences. For example, Fetch as Googlebot does not follow redirects”

One could argue that Google translate is more of a proxy than a spider, but Google is fetching your web page – who knows what it is doing with it?  Google clearly states that the translated content will not be indexed, but that doesn’t mean the retrieved web page isn’t perhaps utilized in some way – maybe even indexed again in Google’s main index, faster than it would have been.

Table 1 details what I found.  For “SERP Change Timeframe”, items in quotes are from various Google statements and others are rough estimates from my own experience:

    Table 1 - All of Google's Crawlers, or Crawler-like Programs and Services *click to enlarge*

Table 1 - All of Google's Crawlers, or Crawler-like Programs and Services *click to enlarge*

The links referencing background information on each of these is below in Table 2, in case you’re interested in drilling into any particular area of Google crawling behavior.  It’s important to note that Video, Mobile, News, and Images all have their own sitemap formats.  If you have significant amounts of content in these areas (i.e. on the order of thousands of images or hundreds of videos) it’s advisable to consider creating separate sitemaps to disclose that content to Google’s crawlers in the most efficient way possible, rather than relying on having them discover content on their own:

Crawler Name Further Details from Google
Googlebot http://support.google.com/webmasters/bin/answer.py?hl=en&answer=182072
Googlebot News http://support.google.com/news/publisher/bin/answer.py?hl=en&answer=68331
http://support.google.com/webmasters/bin/topic.py?hl=en&topic=10078
Googlebot Images http://support.google.com/webmasters/bin/answer.py?hl=en&answer=35308
http://www.google.com/support/forum/p/Webmasters/thread?tid=780c9832c4012ded&hl=en
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=178636
Googlebot Video http://support.google.com/webmasters/bin/answer.py?hl=en&answer=80472
http://www.youtube.com/watch?v=rdPnRn_qQHw
http://support.google.com/webmasters/bin/topic.py?hl=en&topic=10079
Google Mobile http://googlewebmastercentral.blogspot.com/2011/12/introducing-smartphone-googlebot-mobile.html
Google AdSense http://support.google.com/adsense/bin/answer.py?hl=en&answer=99376
http://support.google.com/adsense/bin/answer.py?hl=en&answer=9715
Google Mobile AdSense http://support.google.com/adsense/bin/answer.py?hl=en&answer=68731
Google AdsBot http://www.google.com/adsbot.html
Google Feedfetcher for iGoogleGadgets http://support.google.com/webmasters/bin/answer.py?hl=en&answer=178852
Google Feedfetcher for RSS http://support.google.com/webmasters/bin/answer.py?hl=en&answer=178852
Google Translate Could not locate a statement from Google clarifying the string; others have noted it.
Google Site Verification http://support.google.com/webmasters/bin/answer.py?hl=en&answer=35179
Google Instant Preview http://support.google.com/webmasters/bin/answer.py?hl=en&answer=1062498
Google Wireless Transcoder Could not locate a statement from Google clarifying the string; others have noted it.
Google Cache Could not locate a statement from Google clarifying the string; others have noted it.
Fetch as Googlebot http://support.google.com/webmasters/bin/answer.py?hl=en&answer=158587

 Table 2 – Background Information

Some Interesting Learnings I Took Away From This Research

1.) You Can Easily Provide Your Website in Other Languages

Totally incidental to this posting but *really* neat – in researching the Google Translate behavior, I discovered that Google Translate can provide you widget code to put on your blog – look at the rightmost part of the TOP-NAV menu – Coconut Headphones is now available in other languages!  You can get the code snippet for this here if you ‘d like to do the same:

http://translate.google.com/translate_tools

2.) Whitelisting User-Agents is Probably a *Bad* Idea

One takeaway is that, if you think about it, whitelisting is a *very* bad idea, although it’s easy to see why people are oftent tempted to do so.  One might use whitelists with the intent of excluding old versions of IE, or perhaps a few difficult mobile clients.   One could also easily envision large company QA groups arbitrarily imposing whitelists based on their testing schedule (i.e. for CYA purposes – since it’s impossible to truly test them all browsers).

But I think it’s *far* better to blacklist user-agents than to whitelist them, simply because it’s impossible to predict what Google will use for user-agent strings in the future.  Google could change its user-agent strings or crawling behavior at any time, or come out with a new type of crawler, and you wouldn”t want to be left behind when that happens.

It’s important to “future-proof” your website against Google changes – algorithmic and otherwise, so I think on the whole blacklisting is a wiser approach.

Some Interesting Questions Still Unanswered

1.) Is it possible to “game” Google using this information?

…by using the on-demand services to get indexed more quickly? I have seen *many* SEO bloggers talking about how fetch as Googlebot will speed your indexing, but I have found no evidence of anyone who actually performed an A/B test on it with two new sites or pages.  What about instant preview, Google Cache, and Sumbit a URL?  Does submitting a single URL to Google result in any speedier indexing, or is it only helpful for alerting Google to new websites?

Note: A few days after this article was originally published, Matt Cutts announced new features in “Fetch As Googlebot”; you’re now allowed to instruct Google bot to more quickly index up to 50 individual pages a week, and 10 pages & all the pages they point to per month.  So it appears that rather than having people figure out how to “game” the system, Google has given us a direct line into the indexing functionality.  Whether any of the other methods listed above have any affect still remains to be seen.

See video here:
http://www.searchenginejournal.com/googles-matt-cutts-explains-fetch-as-googlebot-in-submit-to-index-via-google-webmaster-tools/40452/

2.) How are cached pages retrieved?

…is it similar to  Instant Image Previews, where in some cases it’s automatically retrieved based on a crawl, but sometimes it’s initiated on demand when users click on a Cache entry?  Is it just Googlebot itself or something else?  There is little to nothing on the web or on Google’s support pages about this, other than people noting or complaining about the listed referrer string in their web server logs, and Google doesn’t seem to acknowledge much about this at all, from what I could find.

You’d think that Google’s function of essentially copying the entire web (1 trillion+ documents) and storing them for display from their own website would be a little better documented, or at least, would have been investigated by the SEO community!  (I’m hoping I’ve missed something on this one – straighten me out on this point if you know anything, by all means).

Conclusion

Clearly Google moves, sometimes, in mysterious ways.  If you run across any additional ways that Google examines your website let me know and I’ll update the tables above.  Also -  if you know of any quality research or even if you’ve just done some anecdotal tests of your own any of the issues above regarding Google crawling behavior, by all means – let us know your thoughts below!

14 Comments

  1. Adam Audette says:

    Good stuff, Ted. Good timing, too. We recently debugged a weird issue with Adwords instant previews for our client National Geographic and found the culprit was “Google Web Preview.” More here: http://www.rimmkaufman.com/blog/google-preview-user-agent-redirection-curl-command-user-agent/09022012/

    Until now, we are the only ones who’ve done any research here. Thank you for adding even more insight here.

  2. IncrediBILL says:

    Whitelisting is a *GREAT* idea, far from bad. Been doing it since 2006 with no backlash whatsoever. You just need to monitor activity and add to your whitelist now and then which is no different than monitoring activity to add to your blacklist. Blacklists are the bad idea, you can never protect your content or server resources if you wait until after the damage has happened, it’s REACTIVE opposed to whitelisting which is PROACTIVE and stops the problems before they start.

  3. Ted Ives says:

    Ah, you’re the CrawlWall guy, I ran across your website recently.

    I wonder whether it’s wise though, since Google clearly tries to figure out if you’re presenting different content to users versus different content to Googlebot; I would think it must have a special version of Googlebot that quietly goes in and pretends to be a normal user, to double-check things.

    Whitelisting could perhaps risk blocking that (presumably with some sort of penalty resulting). Then again, the Fantomaster guy has been around for a long time, maybe there is something to it…

    Feel free to drop a commercial-sounding pitch in a comment here, because I for one certainly don’t “get it” and could probably use some “edumication” on this front.

  4. IncrediBILL says:

    You could be right, I never claimed to know everything Google does, but I’ve never seen anyone penalized for whitelisting *yet*.

    Maybe when 10s of thousands of sites big and small are all whitelisting we’ll find out for sure when push comes to shove what Google truly does because it’ll become more obvious.

    However, the way I whitelist is I just whitelist and validate the things claiming to be spiders. Other things coming from Google, and there are proxy services involved, are treated on a case by case basis, browser or crawler, so I’m not treating everything exactly the same which could mean avoiding a penalty by blocking the wrong thing.

    Besides, wouldn’t you think if Google were truly checking for bad behavior that they wouldn’t use actual Google IPs would they? I would think Google would check from outside their official network just so it wouldn’t be so obvious.

  5. Yousaf says:

    This is a brilliant post! Very insightful and to be honest if I were Google I would use Chrome as a renderer, it makes complete sense specially if you take their latest “layout” signal.

  6. Ben Acheson says:

    Thanks Ted, this is a fantastic post. Google often experiments and mixes things up, especially for mobile devices, etc.

    You have to think carefully about how you treat user agents and the best word to keep in mind is “test!”

  7. Hi this is kind of of off topic but I was wanting to
    know if blogs use WYSIWYG editors or if you have to
    manually code with HTML. I’m starting a blog soon but
    have no coding skills so I wanted to get guidance from someone
    with experience. Any help would be enormously appreciated!

  8. I am actually pleased to read this website posts which consists of tons of
    useful data, thanks for providing these information.

  9. Hello There. I found your weblog the usage of
    msn. That is a very neatly written article. I’ll be sure to bookmark
    it and return to learn more of your useful info. Thank you for the
    post. I’ll definitely comeback.

    fl lemon law شركة تنظيف fl lemon law افضل شركة تنظيف مجالس بجدة افضل شركة نقل اثاث بجدة fl lemon law
    افضل شركة نظافة منازل بجدة شركة تنظيف بيوت
    افضل شركة تسليك مجارى بجدة شركة مكافحة حشرات بجدة شركة
    مكافحة البق بجدة افضل شركة عزل اسطح بجدة شركة عزل خزانات شركة تنظيف خزانات المياه
    بجدة شركة تنظيف بجدة افضل شركة
    نقل اثاث بجدة
    شركة كشف تسربات المياه بجدة fl lemon law fl lemon law شركة تسليك مجارى بجدة شركة عزل اسطح بجدة شركة تنظيف موكيت بجدة تنظيف خزانات المياه شركة تنظيف مسابح بجدة
    fl lemon law شركة تنظيف فلل بجدة fl
    lemon law fl lemon law افضل شركة تنظيف موكيت بجدة
    افضل شركة كشف تسربات المياه بجدة تنظيف مجالس fl lemon law
    شراء اثاث مستعمل افضل شركة
    نقل اثاث بجدة شركة تنظيف fl lemon law شركة تسليك مجارى بجدة افضل شركة
    تنظيف قصور بجدة شركة ترميم منازل بجدة شركة عزل اسطح بجدة
    شركة شراء اثاث مستعمل شركة مكافحة
    حشرات fl lemon law افضل شركة نظافة منازل
    بجدة افضل شركة نظافة منازل بجدة شركة تنظيف ستائر بجدة شركة
    تنظيف فلل افضل شركة عزل خزانات
    بجدة
    افضل شركة تنظيف سجاد بجدة fl lemon law افضل شركة تنظيف
    قصور بجدة fl lemon law شراء اثاث مستعمل افضل
    شركة نقل اثاث بجدة عزل خزانات شركة تنظيف ستائر بجدة
    تنظيف منازل شركة تسليك مجارى بجدة افضل شركة نقل اثاث بجدة {افضل
    شركة نظافة فلل بجدة شركة تنظيف خزانات
    بجدة fl lemon law شركة كشف تسربات المياه تنظيف قصور

  10. What i don’t understood is in reality how
    you’re no longer actually much more neatly-liked than you may be right now.
    You are so intelligent. You recognize therefore considerably in the case of this matter, made me individually imagine it from so many varied angles.
    Its like women and men don’t seem to be fascinated unless it is something to do with Woman gaga!

    Your personal stuffs outstanding. At all times maintain it
    up!

Leave a Reply

Pingbacks & Trackbacks

  1. SearchCap: The Day In Search, February 21, 2012 | Market 7 - Pingback on 2012/02/21
  2. Whitelisting Googlebot and Others « Rube Reality - Pingback on 2012/03/01
  3. The Silverback Marketing Round-up for March 26-March 30, 2012 - Pingback on 2012/03/30
  4. Übersicht zu Googlebots – SEO.at - Pingback on 2012/07/10