Request your SEO Monitoring Invitation

By submitting your email address, you agree to receive follow up emails about RankSense’s products and services. You can opt out at any time by clicking the link in the footer of our emails. We share your information with our customer relationship management partners. For information about our privacy practices, please see our privacy policy

Controlling Your Robots: Using the X-Robots-Tag HTTP header with Googlebot

by Hamlet Batista | August 01, 2007 | 10 Comments

robopet.jpgWe have discussed before how to control Googlebot via robots.txt and meta robot tags. Both methods have limitations. With robots.txt you can block the crawling of any page or directory, but you cannot control the indexing, caching or snippets. With the robots meta tag you can control crawling, caching and snippets but you can only do that for HTML files, as the tag is embedded in the files themselves. You have no granular control for binary and non-HTML files.

Until now. Google recently introduced another clever solution to this problem. You can now specify robot meta tags via an HTTP header. The new header is the X-Robots-Tag, and it behaves and supports the same directives as the regular robots meta tag: index/noindex, archive/noarchive, snippet/nosnippet and the new unavailable_after directive. This new technique makes it possible to have granular control over crawling, caching, and other functions for any page on your website, no matter the type of content it has—PDF, Word doc, Excel file, zip files, etc.This is all possible because we will be using an HTTP header instead of a meta tag. For non-technical readers, let me use an analogy to explain this better.

A web crawler basically behaves very similar to a web browser: it opens pages hosted on web servers using a communications method called Hyper Text Transfer Protocol (HTTP). Each HTTP request and response has two elements: 1) the headers and 2) the content (a web page for example). Think of each request/response like an e-mail, where the headers are the envelope that contains, among other things, the address of the requested page or the status of the request.

Here are a couple of examples of how an HTTP request and response look like. You normally don't see this, but it is a routine conversation your browser has every time you request a page.

Request->
 

GET / HTTP/1.1
Host: hamletbatista.com
User-Agent: Mozilla/5.0 … Gecko/20070713 Firefox/2.0.0.5

Connection: close

Response-> 

HTTP/1.1 200 OK
Date: Wed, 01 Aug 2007 00:41:47 GMT
Server:·Apache
X-Robots-Tag: index,archive
X-Powered-By: PHP/5.0.3
X-Pingback: http://hamletbatista.com/xmlrpc.php
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html; charset=UTF-8

There are many standard headers, and the beauty of the HTTP protocol is that you can define your own proprietary headers. You only need to make sure you start them with X- to avoid name collision with future standard names. This is the approach Google takes and it is a wise one.

How can you implement this?

This is the interesting part. You know I love this.

The simplest way to add the header is to have all your pages written in a dynamic language, such as PHP, and include one line of code at the top that sets the X-Robots-Tag header. For example:

<?php header('X-Robots-Tag: index,archive'); ?>

In order to work, that code needs to be at the very top of the dynamic page, before anything is outputted to the browser.

Unfortunately, this strategy does not help us much, as we want to add the headers to non-text files, like PDFs, Word documents, and so on. I think I have a better solution.

Using Apache's mod_headers and mod_setenvif, we can control which files we add the header to as easily as we do with mod_rewrite for controlling redirects. Here is the trick.

SetEnvIf Request_URI “*\.pdf$” is_pdf=yes

Header add X-Robots-Tag “index, noarchive” env=is_pdf

The first line sets an environment variable if the file requested is a PDF file. We can check any requested header and we can use any regular expression to match the files we want to add to the header.

The second line adds the header only if the environment variable is_pdf (you can name the variable anything you want) is set. We can add these rules to our .htaccess file. And voilà: we can now control which files we add the header very easily.

There are a lot of real-world uses for this technique. Let's say you offer a free PDF e-book on your site, but users have to subscribe to your feed to get it. It is very likely that Google will be able to reach the file and smart visitors will pull the e-book from the Google cache to avoid subscribing. One way to avoid this is to let Google index the file but not provide the cache: index, noarchive. This is not possible to control with robots.txt, and we can’t implement robot meta tags because the e-book is a PDF file.

This is only one example, but I am sure users out there have plenty of other practical applications for this. Please share some other uses you can think of.

Hamlet Batista

Chief Executive Officer

Hamlet Batista is CEO and founder of RankSense, an agile SEO platform for online retailers and manufacturers. He holds US patents on innovative SEO technologies, started doing SEO as a successful affiliate marketer back in 2002, and believes great SEO results should not take 6 months

10

REPLIES

Leave a Reply

Want to join the discussion? Feel free to contribute!

Install our free SEO monitoring app today!

RankSense can detect traffic-killing SEO issues in real time, and send instant notifications to your e-mail, phone or Slack channel. You have full control of the type of alerts you receive by severity and the frequency of alerts.

OUR BLOG

Latest news and tactics

What do you do when you’re losing organic traffic and you don’t know why?

Site Mergers and Rebranding Without Losing SEO Traffic

Similar to site migrations, site mergers and rebrandings are usually problematic for businesses. Whether it’s a large corporation or a burgeoning startup, changing urls without comprehensive redirects and missing migration steps can result in a dramatic drop-off in site traffic. That being said, the risks associated with traffic loss and the time it takes to...

READ POST

When SEO is Not Enough to Grow Your Business

At RankSense, businesses come to us with SEO needs because they realize that search engine optimization is one of the best ways to increase visibility for their company and sales of their products. For many of our clients, that is the case and we are happy to help them grow. However, we have other clients...

READ POST

SEO Tactic #7: AB Testing Your Organic Search Snippets

SEO Tactic #7: AB testing your Organic Search Snippets. AB Testing Your Organic Search Snippets Meta descriptions, the little snippets that show up when you Google something…what business really needs ‘em, right? Before you answer, let’s walk a moment in a hypothetical online shopper’s shoes… Summer’s around the corner, and let’s say that 2018 is...

READ POST