How to Get Googlebot to “Teach You” Advanced SEO

by Hamlet Batista | January 30, 2012 | 9 Comments

I recently worked on an enterprise-level client’s non-SEO related project where the goal was to confirm or deny that their new product:
1)  Was not doing anything that could be considered black hat.
2)  Was providing any SEO benefit for their clients.
The problems you face with projects like this is that Google doesn’t provide enough information, and you cannot post corner-case questions like this in public Webmaster forums. To do so would violate your NDA, and potentially reveal your client’s intellectual property. So, what option do you have left? Well, you set up a honeypot!
A honeypot is a term that comes from the information security industry. Honeypots are a set of files that, to an automated program, appear like regular files, but they allow for the monitoring and “capturing” of specific viruses, e-mail harvesters, etc. In our case, we set up a honeypot with the purpose of detecting and tracking search engine bot behavior in specific circumstances. We also wanted to track the outcome (positive, neutral or negative) in the search engine results pages (SERPs).
Let me walk you trough a few ways you can learn advanced SEO by using a honeypot.
Goals of the honeypot
First, let’s define the goals in terms of questions for which we don’t have public answers. Here are some interesting questions you and I might have:
1. Which search bots support the if-modified-since and/or the if-unmodified-since headers?
2. Is Googlebot really a headless browser?
3. Which search bots crawl AJAX URLs? Which ones support Google’s crawlable scheme?
4. Does Google follow links inside PDFs? Do they count for indexation and rankings?
5. Does the in-page canonical tag carries more weight that the canonical link header?
Add your own questions to this list. For the purpose of this post, I’m going to explain how you go about answering first question. The recent work I did for a client was related to AJAX style fragment URLs. Unfortunately, I can’t share any details.
Setting up the Honeypot
The first thing you need to do is understand the problem really well. In our case, if-modified-since is a header that browsers and bots can send to a webserver, and the webserver will avoid resending a resource (image, document, page, video, etc.) if it hasn’t changed since the last time it was requested. The primary goal is to save bandwidth.
If-unmodified-since does the opposite. It returns the resource if it hasn’t changed.
There is technical protocol that HTTP clients and servers must obey, and a typical conversation looks like this:
CLIENT/BOT Request:
[RAW]GET / HTTP/1.1
Host: hamletbatista.com
If-Modified-Since: Thu, 26 Jan 2012 17:32:59 GMT[/RAW] SERVER Response:
[RAW]HTTP/1.1 304 Not modified
Date: Thu, 26 Jan 2012 17:32:59 GMT[/RAW] CLIENT/BOT Request:
[RAW]GET / HTTP/1.1
Host: hamletbatista.com
If-Unmodified-Since: Thu, 26 Jan 2012 17:32:59 GMT[/RAW] SERVER Response:
[RAW] HTTP/1.1 412 Precondition failed
[/RAW] You can learn more about this here.
The most common way to follow these conversations between servers and bots is to setup and analyze traffic logs. However, the typical format of a traffic log does not store ‘if-modified-since’ header information. Sometimes it is practical to set up a custom log to track this information, but other times it isn’t.
Here is how a typical log entry looks like for valid Googlebot request .
[RAW] 66.249.67.9 – – [26/Jan/2011:02:29:32 -0500] “GET / HTTP/1.1” 200 157 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”
[/RAW] Getting the answers
One simple alternative is to look for the response code. In the case of a request that includes the ‘if-modified-since’ header, the web server will return status code 200 if the page changed, and status code 304 if it hasn’t changed. On the other hand, it will return 412 if the resource changed, and the client sent an ‘if-unmodified-since’.
Because 200 is a code that can be returned when the ‘if-modified-since’/’if-unmodified-since’ headers are not sent, the most reliable way to tell if a request included the header we want to check, is to track responses that returned 304 (the response that say nothing changed) or 412 (something changed).
You also want to make sure your webserver support the corresponding headers. You can use Firebug for this.

As you should have guess by now, it is easy to check if Googlebot supports this header by checking the traffic log for entries coming from Googlebot and seeing if the responses include the 304 or 412 status codes.
I wrote a simple log parsing script in Python to look for response codes 304 or 412 and see if any entry came for Googlebot. In order to make it work, you will need the excellent Python log parser, apachelog.
[python] import apachelog, sys, glob
format = r’%{X-Forwarded-For}i %l %u %t \”%r\” %>s %b \”%{Referer}i\” \”%{User-agent}i\”‘
files = glob.glob(“access_log*”)
p = apachelog.parser(format)
for log in files:
for line in open(log):
try:
data = p.parse(line)
status = data[‘%>s’] ua = data[‘%{User-agent}i’] rq = data[‘%r’] referrer = data[‘%{Referer}i’] if rq.indexof(‘/feed/’) < 0 and ( status == ‘304’ or status == ‘412’):
#print referrer
print rq
print ua
print status
except:
#sys.stderr.write(“Unable to parse %s” % line)
pass[/python] This is the partial output.
[RAW]GET /feed/ HTTP/1.1
Netvibes (http://www.netvibes.com/; 58 subscribers; feedID: 1503582)
304
GET /feed/ HTTP/1.1
Alltop/1.1
304
GET /feed/ HTTP/1.1
Xianguo.com 1 Subscribers
304
GET /feed/ HTTP/1.1
Alltop/1.1
304
GET /feed/ HTTP/1.1
Alltop/1.1
304
GET /feed/ HTTP/1.1
Alltop/1.1
304
GET /feed/ HTTP/1.1
NetNewsWire/3.2.15 (Mac OS X; http://netnewswireapp.com/mac/; gzip-happy)
304[/RAW] All entries came from newsreaders and related bots. There wasn’t a single entry from Googlebot or any other search bot.
Conclusion: No evidence of support.
I know I said I would only cover one example, but I feel like I need to give you a little bit more to get you really excited about this stuff.
Let’s say you didn’t think about looking at the response codes to track ‘if-modified-since’ or that you need to track which search bots support the canonical header element or that you want to know if Googlebot requests compression when making requests. In order to track this easily, you need to log extra header information that is not part of the typical log setup.
This is how you do it:

  1. You create a separate log file so you don’t mess up the ability to use log analysis tools that rely on standard log formats.
  2. You filter this separate log so it only records the traffic you want to track. In our case, we want to track search bot traffic.
  3. You change the log format so it records the additional fields

Here is the partial configuration I used for to perform tests for this post:
[RAW]SetEnvIf User-Agent “.*Googlebot/2.1.*” gbot
LogFormat “%{X-Forwarded-For}i %l %u %t \”%r\” %>s %b \”%{Referer}i\” \”%{User-agent}i\” \”%{Accept-encoding}i\”” proxy2
# I use CloudFlare to speed up this blog, so I need to record the X-Forwarded-For instead of the reverse proxy IP address
CustomLog “|/usr/sbin/rotatelogs -l /var/www/hamletbatista/logs/googlebot_log.%Y-%m-%d 86400” proxy2 env=gbot[/RAW] You don’t need to wait for Googlebot to come to the site to test your honeypot. You can use Google Webmaster Tool’s ‘Fetch as Googlebot,’ and Googlebot will come right away. The main difference I’ve seen using this method is that if you provide a URL with a redirect, Googlebot won’t follow it. The regular Googlebot crawler, however, will.
This post is just scratching the surface of all the possible insights you can gain by setting up honeypots to answer your more complex technical SEO questions. If you use this approach and get some really useful results, please make sure to share them in the comments.

Hamlet Batista

Chief Executive Officer

Hamlet Batista is CEO and founder of RankSense, an agile SEO platform for online retailers and manufacturers. He holds US patents on innovative SEO technologies, started doing SEO as a successful affiliate marketer back in 2002, and believes great SEO results should not take 6 months

9

REPLIES

Try our SEO automation tool for free!

RankSense automatically creates search snippets using advanced natural language generation. Get your free trial today.

OUR BLOG

Latest news and tactics

What do you do when you’re losing organic traffic and you don’t know why?

Getting Started with NLP and Python for SEO [Webinar]

Custom Python scripts are much more customizable than Excel spreadsheets.  This is good news for SEOs — this can lead to optimization opportunities and low-hanging fruit.  One way you can use Python to uncover these opportunities is by pairing it with natural language processing. This way, you can match how your audience searches with your...

READ POST
Making it easier to implement SEO changes on your website

Changes to the RankSense SEO rules interface

As we continue to improve the RankSense app for Cloudflare, we are always working to make the app more intuitive and easy to use. I'm pleased to share that we have made significant changes to our SEO rules interface in the settings tab of our app. It is now easier to publish multiple rules sheets and to see which changes have not yet been published to production.

READ POST

How to Find Content Gaps at Scale: Atrapalo vs Skyscanner

For the following Ranksense Webinar, we were joined by Antoine Eripret, who works at Liligo as an SEO lead. Liligo.com is a travel search engine which instantly searches all available flight, bus and train prices on an exhaustive number of travel sites such as online travel agencies, major and low-cost airlines and tour-operators. In this...

READ POST

Exciting News!
seoClarity acquires RankSense

X