Similar to site migrations, site mergers and rebrandings are usually problematic for businesses. Whether it’s a large corporation or a burgeoning startup, changing urls without comprehensive redirects and missing migration steps can result in a dramatic drop-off in site traffic. That being said, the risks associated with traffic loss and the time it takes to...READ POST
We have discussed before how to control Googlebot via robots.txt and meta robot tags. Both methods have limitations. With robots.txt you can block the crawling of any page or directory, but you cannot control the indexing, caching or snippets. With the robots meta tag you can control crawling, caching and snippets but you can only do that for HTML files, as the tag is embedded in the files themselves. You have no granular control for binary and non-HTML files.
Until now. Google recently introduced another clever solution to this problem. You can now specify robot meta tags via an HTTP header. The new header is the X-Robots-Tag, and it behaves and supports the same directives as the regular robots meta tag: index/noindex, archive/noarchive, snippet/nosnippet and the new unavailable_after directive. This new technique makes it possible to have granular control over crawling, caching, and other functions for any page on your website, no matter the type of content it has—PDF, Word doc, Excel file, zip files, etc.This is all possible because we will be using an HTTP header instead of a meta tag. For non-technical readers, let me use an analogy to explain this better.
A web crawler basically behaves very similar to a web browser: it opens pages hosted on web servers using a communications method called Hyper Text Transfer Protocol (HTTP). Each HTTP request and response has two elements: 1) the headers and 2) the content (a web page for example). Think of each request/response like an e-mail, where the headers are the envelope that contains, among other things, the address of the requested page or the status of the request.
Here are a couple of examples of how an HTTP request and response look like. You normally don't see this, but it is a routine conversation your browser has every time you request a page.
GET / HTTP/1.1
User-Agent: Mozilla/5.0 … Gecko/20070713 Firefox/184.108.40.206
HTTP/1.1 200 OK
Date: Wed, 01 Aug 2007 00:41:47 GMT
Content-Type: text/html; charset=UTF-8
There are many standard headers, and the beauty of the HTTP protocol is that you can define your own proprietary headers. You only need to make sure you start them with X- to avoid name collision with future standard names. This is the approach Google takes and it is a wise one.
How can you implement this?
This is the interesting part. You know I love this.
The simplest way to add the header is to have all your pages written in a dynamic language, such as PHP, and include one line of code at the top that sets the X-Robots-Tag header. For example:
<?php header('X-Robots-Tag: index,archive'); ?>
In order to work, that code needs to be at the very top of the dynamic page, before anything is outputted to the browser.
Unfortunately, this strategy does not help us much, as we want to add the headers to non-text files, like PDFs, Word documents, and so on. I think I have a better solution.
SetEnvIf Request_URI “*\.pdf$” is_pdf=yes
Header add X-Robots-Tag “index, noarchive” env=is_pdf
The first line sets an environment variable if the file requested is a PDF file. We can check any requested header and we can use any regular expression to match the files we want to add to the header.
The second line adds the header only if the environment variable is_pdf (you can name the variable anything you want) is set. We can add these rules to our .htaccess file. And voilà: we can now control which files we add the header very easily.
There are a lot of real-world uses for this technique. Let's say you offer a free PDF e-book on your site, but users have to subscribe to your feed to get it. It is very likely that Google will be able to reach the file and smart visitors will pull the e-book from the Google cache to avoid subscribing. One way to avoid this is to let Google index the file but not provide the cache: index, noarchive. This is not possible to control with robots.txt, and we can’t implement robot meta tags because the e-book is a PDF file.
This is only one example, but I am sure users out there have plenty of other practical applications for this. Please share some other uses you can think of.