Robots.txt 101

by Hamlet Batista | June 04, 2007 | 9 Comments

First let me thank my beloved reader SEO blog.

Thanks to him I got a really nice bump in traffic and several new RSS subscribers.

It is really funny how people that don’t know you, start questioning your knowledge, calling you names, etc. I am glad that I don’t take things personal. For me it was a great opportunity to get my new blog some exposure.

I did not try intentionally, to be controversial. I did ran a back link check on John’s site and found those interesting results I reported. I am still more inclined to believe that my theory has more grounds than SEO Blog’s. Please keep reading to learn why.

His theory is that John fixed the problem, by making some substantial changes to his robots.txt file. I am really glad that he finally decided to dig for evidence. This is far more professional than calling people, you don’t know, names.

I thoughtfully checked both robots.txt files and here is what John removed in the new version:

# Disallow all monthly archive pages  Disallow: /2005/12  Disallow: /2006/01  Disallow: /2006/02  Disallow: /2006/03  Disallow: /2006/04  Disallow: /2006/05  Disallow: /2006/06  Disallow: /2006/07  Disallow: /2006/08  Disallow: /2006/09  Disallow: /2006/10  Disallow: /2006/11  Disallow: /2006/12  Disallow: /2007/01  Disallow: /2007/02  Disallow: /2007/03  Disallow: /2007/04  Disallow: /2007/05    # The Googlebot is the main search bot for google  User-agent: Googlebot   # Disallow all files ending with these extensions  Disallow: /*.php$  Disallow: /*.js$  Disallow: /*.inc$  Disallow: /*.css$  Disallow: /*.gz$  Disallow: /*.wmv$  Disallow: /*.tar$  Disallow: /*.tgz$  Disallow: /*.cgi$  Disallow: /*.xhtml$   # Disallow Google from parsing indididual post feeds and trackbacks..  Disallow: */feed/  Disallow: */trackback/   # Disallow all files with ? in url  Disallow: /*?*  Disallow: /*?   # Disallow all archived monthlies  Disallow: /2006/0*  Disallow: /2007/0*  Disallow: /2005/1*  Disallow: /2006/1*  Disallow: /2007/1*

In English, this means, he is now letting Google crawl and index his archived articles, dynamic pages,
and files ending with “.php”, “.js”,”.inc”, “.css”, etc. Note that in none of the robots.txt files, John is preventing the crawler from accessing his home page or the regular posts. WordPress uses PHP, but regular posts and the home page can be accessed without “.php”.

If this was the change that fixed the problem, it might have been because removing those internal pages from the spider view might have weaken his internal link structure. His claim is not without merit.

Now, here is one tiny little detail that my friend is missing. To prove his point, he used Google’s cache to show the different version of the robots.txt file. If Google still has that version on their cache, what makes him think that Google is already using the new one? Google should be caching the new version not the old one. That is why I am still not convinced that this is the reason for the fix.

John says he is not telling, because a reader said Google might change their algorithm and drop him again. How does the changes John did to his robot.txt file , have anything to do with algorithm changes? I am just curious.

In reality, we can theorize all we can, but the only ones who can tell for sure is the guys at the Googleplex. John probably tried many different things and one or several of them worked. He is probably not even sure which one did.

How did I learn SEO?

SEO Blog suggests I visit his forum to learn SEO. Here is the problem with that. I am a technical guy, I can not take gut feelings or opinions as truth. I do visit some forums and blogs every now and then, but my experience is that the noise to signal ratio is too high. I prefer to learn and get my insights from the source: search engine research papers, search engine representatives blogs or my own experiments.

I learned SEO back in 2002 when I read this paper. Back then, nobody was even talking about Google bombs, anchor text, etc. Read the paper, it is all there.

Hamlet Batista

Chief Executive Officer

Hamlet Batista is CEO and founder of RankSense, an agile SEO platform for online retailers and manufacturers. He holds US patents on innovative SEO technologies, started doing SEO as a successful affiliate marketer back in 2002, and believes great SEO results should not take 6 months

9

REPLIES

Try our SEO automation tool for free!

RankSense automatically creates search snippets using advanced natural language generation. Get your free trial today.

OUR BLOG

Latest news and tactics

What do you do when you’re losing organic traffic and you don’t know why?

Getting Started with NLP and Python for SEO [Webinar]

Custom Python scripts are much more customizable than Excel spreadsheets.  This is good news for SEOs — this can lead to optimization opportunities and low-hanging fruit.  One way you can use Python to uncover these opportunities is by pairing it with natural language processing. This way, you can match how your audience searches with your...

READ POST
Making it easier to implement SEO changes on your website

Changes to the RankSense SEO rules interface

As we continue to improve the RankSense app for Cloudflare, we are always working to make the app more intuitive and easy to use. I'm pleased to share that we have made significant changes to our SEO rules interface in the settings tab of our app. It is now easier to publish multiple rules sheets and to see which changes have not yet been published to production.

READ POST

How to Find Content Gaps at Scale: Atrapalo vs Skyscanner

For the following Ranksense Webinar, we were joined by Antoine Eripret, who works at Liligo as an SEO lead. Liligo.com is a travel search engine which instantly searches all available flight, bus and train prices on an exhaustive number of travel sites such as online travel agencies, major and low-cost airlines and tour-operators. In this...

READ POST

Exciting News!
seoClarity acquires RankSense

X