How to Get Googlebot to “Teach You” Advanced SEO

by Hamlet Batista | January 30, 2012 | 9 Comments

I recently worked on an enterprise-level client’s non-SEO related project where the goal was to confirm or deny that their new product:
1)  Was not doing anything that could be considered black hat.
2)  Was providing any SEO benefit for their clients.
The problems you face with projects like this is that Google doesn’t provide enough information, and you cannot post corner-case questions like this in public Webmaster forums. To do so would violate your NDA, and potentially reveal your client’s intellectual property. So, what option do you have left? Well, you set up a honeypot!
A honeypot is a term that comes from the information security industry. Honeypots are a set of files that, to an automated program, appear like regular files, but they allow for the monitoring and “capturing” of specific viruses, e-mail harvesters, etc. In our case, we set up a honeypot with the purpose of detecting and tracking search engine bot behavior in specific circumstances. We also wanted to track the outcome (positive, neutral or negative) in the search engine results pages (SERPs).
Let me walk you trough a few ways you can learn advanced SEO by using a honeypot.
Goals of the honeypot
First, let’s define the goals in terms of questions for which we don’t have public answers. Here are some interesting questions you and I might have:
1. Which search bots support the if-modified-since and/or the if-unmodified-since headers?
2. Is Googlebot really a headless browser?
3. Which search bots crawl AJAX URLs? Which ones support Google’s crawlable scheme?
4. Does Google follow links inside PDFs? Do they count for indexation and rankings?
5. Does the in-page canonical tag carries more weight that the canonical link header?
Add your own questions to this list. For the purpose of this post, I’m going to explain how you go about answering first question. The recent work I did for a client was related to AJAX style fragment URLs. Unfortunately, I can’t share any details.
Setting up the Honeypot
The first thing you need to do is understand the problem really well. In our case, if-modified-since is a header that browsers and bots can send to a webserver, and the webserver will avoid resending a resource (image, document, page, video, etc.) if it hasn’t changed since the last time it was requested. The primary goal is to save bandwidth.
If-unmodified-since does the opposite. It returns the resource if it hasn’t changed.
There is technical protocol that HTTP clients and servers must obey, and a typical conversation looks like this:
CLIENT/BOT Request:
[RAW]GET / HTTP/1.1
Host: hamletbatista.com
If-Modified-Since: Thu, 26 Jan 2012 17:32:59 GMT[/RAW] SERVER Response:
[RAW]HTTP/1.1 304 Not modified
Date: Thu, 26 Jan 2012 17:32:59 GMT[/RAW] CLIENT/BOT Request:
[RAW]GET / HTTP/1.1
Host: hamletbatista.com
If-Unmodified-Since: Thu, 26 Jan 2012 17:32:59 GMT[/RAW] SERVER Response:
[RAW] HTTP/1.1 412 Precondition failed
[/RAW] You can learn more about this here.
The most common way to follow these conversations between servers and bots is to setup and analyze traffic logs. However, the typical format of a traffic log does not store ‘if-modified-since’ header information. Sometimes it is practical to set up a custom log to track this information, but other times it isn’t.
Here is how a typical log entry looks like for valid Googlebot request .
[RAW] 66.249.67.9 – – [26/Jan/2011:02:29:32 -0500] “GET / HTTP/1.1” 200 157 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”
[/RAW] Getting the answers
One simple alternative is to look for the response code. In the case of a request that includes the ‘if-modified-since’ header, the web server will return status code 200 if the page changed, and status code 304 if it hasn’t changed. On the other hand, it will return 412 if the resource changed, and the client sent an ‘if-unmodified-since’.
Because 200 is a code that can be returned when the ‘if-modified-since’/’if-unmodified-since’ headers are not sent, the most reliable way to tell if a request included the header we want to check, is to track responses that returned 304 (the response that say nothing changed) or 412 (something changed).
You also want to make sure your webserver support the corresponding headers. You can use Firebug for this.

As you should have guess by now, it is easy to check if Googlebot supports this header by checking the traffic log for entries coming from Googlebot and seeing if the responses include the 304 or 412 status codes.
I wrote a simple log parsing script in Python to look for response codes 304 or 412 and see if any entry came for Googlebot. In order to make it work, you will need the excellent Python log parser, apachelog.
[python] import apachelog, sys, glob
format = r’%{X-Forwarded-For}i %l %u %t \”%r\” %>s %b \”%{Referer}i\” \”%{User-agent}i\”‘
files = glob.glob(“access_log*”)
p = apachelog.parser(format)
for log in files:
for line in open(log):
try:
data = p.parse(line)
status = data[‘%>s’] ua = data[‘%{User-agent}i’] rq = data[‘%r’] referrer = data[‘%{Referer}i’] if rq.indexof(‘/feed/’) < 0 and ( status == ‘304’ or status == ‘412’):
#print referrer
print rq
print ua
print status
except:
#sys.stderr.write(“Unable to parse %s” % line)
pass[/python] This is the partial output.
[RAW]GET /feed/ HTTP/1.1
Netvibes (http://www.netvibes.com/; 58 subscribers; feedID: 1503582)
304
GET /feed/ HTTP/1.1
Alltop/1.1
304
GET /feed/ HTTP/1.1
Xianguo.com 1 Subscribers
304
GET /feed/ HTTP/1.1
Alltop/1.1
304
GET /feed/ HTTP/1.1
Alltop/1.1
304
GET /feed/ HTTP/1.1
Alltop/1.1
304
GET /feed/ HTTP/1.1
NetNewsWire/3.2.15 (Mac OS X; http://netnewswireapp.com/mac/; gzip-happy)
304[/RAW] All entries came from newsreaders and related bots. There wasn’t a single entry from Googlebot or any other search bot.
Conclusion: No evidence of support.
I know I said I would only cover one example, but I feel like I need to give you a little bit more to get you really excited about this stuff.
Let’s say you didn’t think about looking at the response codes to track ‘if-modified-since’ or that you need to track which search bots support the canonical header element or that you want to know if Googlebot requests compression when making requests. In order to track this easily, you need to log extra header information that is not part of the typical log setup.
This is how you do it:

  1. You create a separate log file so you don’t mess up the ability to use log analysis tools that rely on standard log formats.
  2. You filter this separate log so it only records the traffic you want to track. In our case, we want to track search bot traffic.
  3. You change the log format so it records the additional fields

Here is the partial configuration I used for to perform tests for this post:
[RAW]SetEnvIf User-Agent “.*Googlebot/2.1.*” gbot
LogFormat “%{X-Forwarded-For}i %l %u %t \”%r\” %>s %b \”%{Referer}i\” \”%{User-agent}i\” \”%{Accept-encoding}i\”” proxy2
# I use CloudFlare to speed up this blog, so I need to record the X-Forwarded-For instead of the reverse proxy IP address
CustomLog “|/usr/sbin/rotatelogs -l /var/www/hamletbatista/logs/googlebot_log.%Y-%m-%d 86400” proxy2 env=gbot[/RAW] You don’t need to wait for Googlebot to come to the site to test your honeypot. You can use Google Webmaster Tool’s ‘Fetch as Googlebot,’ and Googlebot will come right away. The main difference I’ve seen using this method is that if you provide a URL with a redirect, Googlebot won’t follow it. The regular Googlebot crawler, however, will.
This post is just scratching the surface of all the possible insights you can gain by setting up honeypots to answer your more complex technical SEO questions. If you use this approach and get some really useful results, please make sure to share them in the comments.

Hamlet Batista

Chief Executive Officer

Hamlet Batista is CEO and founder of RankSense, an agile SEO platform for online retailers and manufacturers. He holds US patents on innovative SEO technologies, started doing SEO as a successful affiliate marketer back in 2002, and believes great SEO results should not take 6 months

9

REPLIES

Try our SEO automation tool for free!

RankSense automatically creates search snippets using advanced natural language generation. Get your free trial today.

OUR BLOG

Latest news and tactics

What do you do when you’re losing organic traffic and you don’t know why?

Book Summary: Sales Engagement by Medina, Altschuler, and Kosoglow

Book Summary: Sales Engagement by Medina, Altschuler, and Kosoglow

My name is Jason Levin and I am a Marketing Consultant for RankSense. I help our users achieve their goals with our technology and educational resources.  I am a big fan of business books, and enjoy sharing my findings with others.  Last month, I covered The Sales Acceleration Formula by Mark Roberge. In this article,...

READ POST

Tutorial: Implementing “View All” Canonical Tags to Fix Pagination Issues

Whether it’s used on an e-commerce site or a forum, pagination is the process of dividing category pages into several smaller sets in order to display information in a more organized manner. While pagination is a great and efficient technique, SEO issues can often arise within Content Management Systems (CMS), specifically with canonicalizing these paginated...

READ POST

Tutorial: How to Use Unavailable_After Tags to Increase SEO Crawling Efficiency

Table of Contents Introduction to Unavailable_After Practical Uses Implementing “Unavailable_After” Using the RankSense SEO Tool Using the RankSense SEO tool to verify your changes Introduction to Unavailable_After Put simply, unavailable_after tags send a signal to search engines letting them know that the contents of a page should not be crawled after a certain date. Unlike...

READ POST