How to Get Googlebot to “Teach You” Advanced SEO

by Hamlet Batista | January 30, 2012 | 9 Comments

I recently worked on an enterprise-level client’s non-SEO related project where the goal was to confirm or deny that their new product:
1)  Was not doing anything that could be considered black hat.
2)  Was providing any SEO benefit for their clients.
The problems you face with projects like this is that Google doesn’t provide enough information, and you cannot post corner-case questions like this in public Webmaster forums. To do so would violate your NDA, and potentially reveal your client’s intellectual property. So, what option do you have left? Well, you set up a honeypot!
A honeypot is a term that comes from the information security industry. Honeypots are a set of files that, to an automated program, appear like regular files, but they allow for the monitoring and “capturing” of specific viruses, e-mail harvesters, etc. In our case, we set up a honeypot with the purpose of detecting and tracking search engine bot behavior in specific circumstances. We also wanted to track the outcome (positive, neutral or negative) in the search engine results pages (SERPs).
Let me walk you trough a few ways you can learn advanced SEO by using a honeypot.
Goals of the honeypot
First, let’s define the goals in terms of questions for which we don’t have public answers. Here are some interesting questions you and I might have:
1. Which search bots support the if-modified-since and/or the if-unmodified-since headers?
2. Is Googlebot really a headless browser?
3. Which search bots crawl AJAX URLs? Which ones support Google’s crawlable scheme?
4. Does Google follow links inside PDFs? Do they count for indexation and rankings?
5. Does the in-page canonical tag carries more weight that the canonical link header?
Add your own questions to this list. For the purpose of this post, I’m going to explain how you go about answering first question. The recent work I did for a client was related to AJAX style fragment URLs. Unfortunately, I can’t share any details.
Setting up the Honeypot
The first thing you need to do is understand the problem really well. In our case, if-modified-since is a header that browsers and bots can send to a webserver, and the webserver will avoid resending a resource (image, document, page, video, etc.) if it hasn’t changed since the last time it was requested. The primary goal is to save bandwidth.
If-unmodified-since does the opposite. It returns the resource if it hasn’t changed.
There is technical protocol that HTTP clients and servers must obey, and a typical conversation looks like this:
CLIENT/BOT Request:
[RAW]GET / HTTP/1.1
Host: hamletbatista.com
If-Modified-Since: Thu, 26 Jan 2012 17:32:59 GMT[/RAW] SERVER Response:
[RAW]HTTP/1.1 304 Not modified
Date: Thu, 26 Jan 2012 17:32:59 GMT[/RAW] CLIENT/BOT Request:
[RAW]GET / HTTP/1.1
Host: hamletbatista.com
If-Unmodified-Since: Thu, 26 Jan 2012 17:32:59 GMT[/RAW] SERVER Response:
[RAW] HTTP/1.1 412 Precondition failed
[/RAW] You can learn more about this here.
The most common way to follow these conversations between servers and bots is to setup and analyze traffic logs. However, the typical format of a traffic log does not store ‘if-modified-since’ header information. Sometimes it is practical to set up a custom log to track this information, but other times it isn’t.
Here is how a typical log entry looks like for valid Googlebot request .
[RAW] 66.249.67.9 – – [26/Jan/2011:02:29:32 -0500] “GET / HTTP/1.1” 200 157 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”
[/RAW] Getting the answers
One simple alternative is to look for the response code. In the case of a request that includes the ‘if-modified-since’ header, the web server will return status code 200 if the page changed, and status code 304 if it hasn’t changed. On the other hand, it will return 412 if the resource changed, and the client sent an ‘if-unmodified-since’.
Because 200 is a code that can be returned when the ‘if-modified-since’/’if-unmodified-since’ headers are not sent, the most reliable way to tell if a request included the header we want to check, is to track responses that returned 304 (the response that say nothing changed) or 412 (something changed).
You also want to make sure your webserver support the corresponding headers. You can use Firebug for this.

As you should have guess by now, it is easy to check if Googlebot supports this header by checking the traffic log for entries coming from Googlebot and seeing if the responses include the 304 or 412 status codes.
I wrote a simple log parsing script in Python to look for response codes 304 or 412 and see if any entry came for Googlebot. In order to make it work, you will need the excellent Python log parser, apachelog.
[python] import apachelog, sys, glob
format = r’%{X-Forwarded-For}i %l %u %t \”%r\” %>s %b \”%{Referer}i\” \”%{User-agent}i\”‘
files = glob.glob(“access_log*”)
p = apachelog.parser(format)
for log in files:
for line in open(log):
try:
data = p.parse(line)
status = data[‘%>s’] ua = data[‘%{User-agent}i’] rq = data[‘%r’] referrer = data[‘%{Referer}i’] if rq.indexof(‘/feed/’) < 0 and ( status == ‘304’ or status == ‘412’):
#print referrer
print rq
print ua
print status
except:
#sys.stderr.write(“Unable to parse %s” % line)
pass[/python] This is the partial output.
[RAW]GET /feed/ HTTP/1.1
Netvibes (http://www.netvibes.com/; 58 subscribers; feedID: 1503582)
304
GET /feed/ HTTP/1.1
Alltop/1.1
304
GET /feed/ HTTP/1.1
Xianguo.com 1 Subscribers
304
GET /feed/ HTTP/1.1
Alltop/1.1
304
GET /feed/ HTTP/1.1
Alltop/1.1
304
GET /feed/ HTTP/1.1
Alltop/1.1
304
GET /feed/ HTTP/1.1
NetNewsWire/3.2.15 (Mac OS X; http://netnewswireapp.com/mac/; gzip-happy)
304[/RAW] All entries came from newsreaders and related bots. There wasn’t a single entry from Googlebot or any other search bot.
Conclusion: No evidence of support.
I know I said I would only cover one example, but I feel like I need to give you a little bit more to get you really excited about this stuff.
Let’s say you didn’t think about looking at the response codes to track ‘if-modified-since’ or that you need to track which search bots support the canonical header element or that you want to know if Googlebot requests compression when making requests. In order to track this easily, you need to log extra header information that is not part of the typical log setup.
This is how you do it:

  1. You create a separate log file so you don’t mess up the ability to use log analysis tools that rely on standard log formats.
  2. You filter this separate log so it only records the traffic you want to track. In our case, we want to track search bot traffic.
  3. You change the log format so it records the additional fields

Here is the partial configuration I used for to perform tests for this post:
[RAW]SetEnvIf User-Agent “.*Googlebot/2.1.*” gbot
LogFormat “%{X-Forwarded-For}i %l %u %t \”%r\” %>s %b \”%{Referer}i\” \”%{User-agent}i\” \”%{Accept-encoding}i\”” proxy2
# I use CloudFlare to speed up this blog, so I need to record the X-Forwarded-For instead of the reverse proxy IP address
CustomLog “|/usr/sbin/rotatelogs -l /var/www/hamletbatista/logs/googlebot_log.%Y-%m-%d 86400” proxy2 env=gbot[/RAW] You don’t need to wait for Googlebot to come to the site to test your honeypot. You can use Google Webmaster Tool’s ‘Fetch as Googlebot,’ and Googlebot will come right away. The main difference I’ve seen using this method is that if you provide a URL with a redirect, Googlebot won’t follow it. The regular Googlebot crawler, however, will.
This post is just scratching the surface of all the possible insights you can gain by setting up honeypots to answer your more complex technical SEO questions. If you use this approach and get some really useful results, please make sure to share them in the comments.

Hamlet Batista

Chief Executive Officer

Hamlet Batista is CEO and founder of RankSense, an agile SEO platform for online retailers and manufacturers. He holds US patents on innovative SEO technologies, started doing SEO as a successful affiliate marketer back in 2002, and believes great SEO results should not take 6 months

9

REPLIES

Install our free SEO monitoring app today!

RankSense can detect traffic-killing SEO issues in real time, and send instant notifications to your e-mail, phone or Slack channel. You have full control of the type of alerts you receive by severity and the frequency of alerts.

OUR BLOG

Latest news and tactics

What do you do when you’re losing organic traffic and you don’t know why?

Python for SEOs

Python for SEOs

Programming can be thought of as a communication vehicle among ourselves, developers, and computers to solve difficult problems, but oftentimes we outsource our coding needs.  Why would we want to code if we can hire someone to do it for us? In reality, coding is not just an extra skill to have under your belt;...

READ POST

Webinar Recap: Faster SEO Results with Agile SEO

For this RankSense webinar, we were very excited to be joined by Dale Bertrand, President of Fire&Spark, to talk about technologies and techniques to accelerate the SEO timeline. In his presentation, Dale explored the difference between the Traditional SEO timeline and Agile SEO timeline, and how to utilize Agile SEO with RankSense to further speed...

READ POST

Python for Data-Driven Storytelling

Marketers? Common. Engineers? Common. But a marketer and engineer in one? Now that’s a rare breed. Our CEO, Hamlet Batista, gave a captivating presentation at the Inbound Conference where he shared the impact that Python can have for data-driven storytelling and how it can create compelling content to make you stand out among competitors. The...

READ POST