Controlling Your Robots: Using the X-Robots-Tag HTTP header with Googlebot

by Hamlet Batista | August 01, 2007 | 10 Comments

robopet.jpgWe have discussed before how to control Googlebot via robots.txt and meta robot tags. Both methods have limitations. With robots.txt you can block the crawling of any page or directory, but you cannot control the indexing, caching or snippets. With the robots meta tag you can control crawling, caching and snippets but you can only do that for HTML files, as the tag is embedded in the files themselves. You have no granular control for binary and non-HTML files.

Until now. Google recently introduced another clever solution to this problem. You can now specify robot meta tags via an HTTP header. The new header is the X-Robots-Tag, and it behaves and supports the same directives as the regular robots meta tag: index/noindex, archive/noarchive, snippet/nosnippet and the new unavailable_after directive. This new technique makes it possible to have granular control over crawling, caching, and other functions for any page on your website, no matter the type of content it has—PDF, Word doc, Excel file, zip files, etc.This is all possible because we will be using an HTTP header instead of a meta tag. For non-technical readers, let me use an analogy to explain this better.

A web crawler basically behaves very similar to a web browser: it opens pages hosted on web servers using a communications method called Hyper Text Transfer Protocol (HTTP). Each HTTP request and response has two elements: 1) the headers and 2) the content (a web page for example). Think of each request/response like an e-mail, where the headers are the envelope that contains, among other things, the address of the requested page or the status of the request.

Here are a couple of examples of how an HTTP request and response look like. You normally don't see this, but it is a routine conversation your browser has every time you request a page.

Request->
 

GET / HTTP/1.1
Host: hamletbatista.com
User-Agent: Mozilla/5.0 … Gecko/20070713 Firefox/2.0.0.5

Connection: close

Response-> 

HTTP/1.1 200 OK
Date: Wed, 01 Aug 2007 00:41:47 GMT
Server:·Apache
X-Robots-Tag: index,archive
X-Powered-By: PHP/5.0.3
X-Pingback: http://hamletbatista.com/xmlrpc.php
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html; charset=UTF-8

There are many standard headers, and the beauty of the HTTP protocol is that you can define your own proprietary headers. You only need to make sure you start them with X- to avoid name collision with future standard names. This is the approach Google takes and it is a wise one.

How can you implement this?

This is the interesting part. You know I love this.

The simplest way to add the header is to have all your pages written in a dynamic language, such as PHP, and include one line of code at the top that sets the X-Robots-Tag header. For example:

<?php header('X-Robots-Tag: index,archive'); ?>

In order to work, that code needs to be at the very top of the dynamic page, before anything is outputted to the browser.

Unfortunately, this strategy does not help us much, as we want to add the headers to non-text files, like PDFs, Word documents, and so on. I think I have a better solution.

Using Apache's mod_headers and mod_setenvif, we can control which files we add the header to as easily as we do with mod_rewrite for controlling redirects. Here is the trick.

SetEnvIf Request_URI “*\.pdf$” is_pdf=yes

Header add X-Robots-Tag “index, noarchive” env=is_pdf

The first line sets an environment variable if the file requested is a PDF file. We can check any requested header and we can use any regular expression to match the files we want to add to the header.

The second line adds the header only if the environment variable is_pdf (you can name the variable anything you want) is set. We can add these rules to our .htaccess file. And voilà: we can now control which files we add the header very easily.

There are a lot of real-world uses for this technique. Let's say you offer a free PDF e-book on your site, but users have to subscribe to your feed to get it. It is very likely that Google will be able to reach the file and smart visitors will pull the e-book from the Google cache to avoid subscribing. One way to avoid this is to let Google index the file but not provide the cache: index, noarchive. This is not possible to control with robots.txt, and we can’t implement robot meta tags because the e-book is a PDF file.

This is only one example, but I am sure users out there have plenty of other practical applications for this. Please share some other uses you can think of.

Hamlet Batista

Chief Executive Officer

Hamlet Batista is CEO and founder of RankSense, an agile SEO platform for online retailers and manufacturers. He holds US patents on innovative SEO technologies, started doing SEO as a successful affiliate marketer back in 2002, and believes great SEO results should not take 6 months

10

REPLIES

Try our SEO automation tool for free!

RankSense automatically creates search snippets using advanced natural language generation. Get your free trial today.

OUR BLOG

Latest news and tactics

What do you do when you’re losing organic traffic and you don’t know why?

Getting Started with NLP and Python for SEO [Webinar]

Custom Python scripts are much more customizable than Excel spreadsheets.  This is good news for SEOs — this can lead to optimization opportunities and low-hanging fruit.  One way you can use Python to uncover these opportunities is by pairing it with natural language processing. This way, you can match how your audience searches with your...

READ POST
Making it easier to implement SEO changes on your website

Changes to the RankSense SEO rules interface

As we continue to improve the RankSense app for Cloudflare, we are always working to make the app more intuitive and easy to use. I'm pleased to share that we have made significant changes to our SEO rules interface in the settings tab of our app. It is now easier to publish multiple rules sheets and to see which changes have not yet been published to production.

READ POST

How to Find Content Gaps at Scale: Atrapalo vs Skyscanner

For the following Ranksense Webinar, we were joined by Antoine Eripret, who works at Liligo as an SEO lead. Liligo.com is a travel search engine which instantly searches all available flight, bus and train prices on an exhaustive number of travel sites such as online travel agencies, major and low-cost airlines and tour-operators. In this...

READ POST

Exciting News!
seoClarity acquires RankSense

X