Preventing duplicate content issues via robots.txt and .htaccess

by Hamlet Batista | June 07, 2007 | 2 Comments

Rand of SEOmoz.org posted an interesting article on duplicate content issues. He uses the typical blog to show different examples.

In a blog, every post can appear in the home page, pagination, archives, feeds, etc.

Rand suggests the use of the meta robots tag “no-index”, or the potentially risky use of cloaking, to redirect the robots to the original source.

Joost the Valk recommends WordPress users change some lines in the source code to address these problems.

There are a few items I would like to add to the problem and to the proposed solution.

As willcritchlow asks, there is also the problem of multiple URLs leading to the same content (ie.: www.site.com, site.com, site.com/index.html, etc.). This can be fixed by using HTTP redirects and by telling Google what is our preferred domain via webmaster central.

Reader roadies, recalls reading about a robots.txt and .htaccess solution somewhere. That gave me the inspiration to write this post.

After carefully reviewing Google’s official response to the duplicate content issue, it occurred to me that the problem might not be as bad as we think.

What does Google do about it?
During our crawling and when serving search results, we try hard to index and show pages with distinct information. This filtering means, for instance, that if your site has articles in “regular” and “printer” versions and neither set is blocked in robots.txt or via a noindex meta tag, we’ll choose one version to list. In the rare cases in which we perceive that duplicate content may be shown with intent to manipulate our rankings and deceive our users, we’ll also make appropriate adjustments in the indexing and ranking of the sites involved. However, we prefer to focus on filtering — rather than ranking adjustments … so in the vast majority of cases, the worst thing that’ll befall webmasters is to see the “less desired” version of a page shown in our index.

Basically, Google says that unless we are trying to do something purposely ill intended (like ‘borrowing’ content from other sites), they will only toss out duplicate pages. They explain that their algorithm automatically detects the ‘right’ page and uses that to return results.

The problem is that we might not want Google to choose the ‘right’ page for us. Maybe they are choosing the printer-friendly page and we want them to choose the page that includes our sponsors’ ads! That is one of the main reasons, in my opinion, to address the duplicate content issue. Another thing is that those tossed out pages will likely end up in the infamous supplemental index. Nobody wants them there :-).

One important addition to Rand’s article is the use of robots.txt to address the issue. One advantage, this has over the use of the meta robots tag “no-index”, is in the case of RSS feeds. Web robots index them, they contain duplicate content but the meta tag is intended for HTML/XHTML content and the feeds are XML content.

If you read my post on John Chow’s robots.txt file, you probably noticed that some of the changes he did to his file, were precisely to address duplicate content issues.

Now, let me explain how you can address duplicate content via robots.txt.

One of the nice things about Google’s bot is that it supports pattern matching. This is not part of the robots exclusion standard. Other web bots probably don’t support it.

As I am a little bit lazy, I will use Googlebot for the example as it will require less typing.

User-Agent: Googlebot   #Prevents Google’s robot from accessing paginated pages
Disallow: /page/*  Disallow: /*?* #Some blogs use dynamic URLs for pagination.  
#For example: http://www.seomoz.org/blog?show=5    
#Prevents Googlebot from accessing the archived posts  
Disallow: /2007/05  Disallow: /2007/06  # It is not a good idea to use * here, like /2007/*,  
# because that will prevent access to the post as well. ie.:/2007/06/06/advanced-link-cloaking-techniques/    
#Prevents Googlebot from accessing the feeds   
Disallow: /feed/

To address print-friendly pages duplication, I think the best solution is to use CSS styles.

Now, let’s see how you can address the problem of the same content accessible from multiple URLs, by using .htaccess and permanent redirects. This assumes you use Apache and mod_alias. More complex manipulation can be achieved via mod_rewrite.

You just need to create a .htaccess file in your website’s root folder with this content:

RedirectPermanent /index.php http://www.site.com/
Or alternatively:    
Redirect 301 /index.php http://www.site.com/
 Or, in the event that you plan to use regular expressions, try this:
RedirectMatch 301 /[Ii]ndex.php$ http://www.site.com/  # this matches Index.php and index.php

Google allows you to tell them what is your preferred canonical name (ie.:site.com vs www.site.com) via Webmaster Central, so this step is no longer necessary. At least, if your only concern is Google.

To force all access to your site include www in the URL (ie.: http://www.site.com instead of http://site.com). You can use redirection via .htaccess file.

RewriteEngine On   
RewriteBase /   RewriteCond %{HTTP_HOST} !^www.site.com [NC]  
# Redirect http://site.com to http://www.site.com   
RewriteRule ^(.*) http://www.site.com/$1 [L,R=301]  

As
I said. These additional lines are probably unnecessary, but it doesn’t hurt to do add them.

Update: Reader identity correctly pointed out that secure pages (https) can cause duplicate content problems. I was able to confirm that at least Google is indexing secure pages.

To solve this, I removed the redirection lines from the .htaccess file and I recommend you use a separate robots.txt for https://www.site.com with these few lines:

User-Agent: Googlebot   #Prevents Google’s robot from accessing all pages   
Disallow: /
 

Hamlet Batista

Chief Executive Officer

Hamlet Batista is CEO and founder of RankSense, an agile SEO platform for online retailers and manufacturers. He holds US patents on innovative SEO technologies, started doing SEO as a successful affiliate marketer back in 2002, and believes great SEO results should not take 6 months

2

REPLIES

Try our SEO automation tool for free!

RankSense automatically creates search snippets using advanced natural language generation. Get your free trial today.

OUR BLOG

Latest news and tactics

What do you do when you’re losing organic traffic and you don’t know why?

Getting Started with NLP and Python for SEO [Webinar]

Custom Python scripts are much more customizable than Excel spreadsheets.  This is good news for SEOs — this can lead to optimization opportunities and low-hanging fruit.  One way you can use Python to uncover these opportunities is by pairing it with natural language processing. This way, you can match how your audience searches with your...

READ POST
Making it easier to implement SEO changes on your website

Changes to the RankSense SEO rules interface

As we continue to improve the RankSense app for Cloudflare, we are always working to make the app more intuitive and easy to use. I'm pleased to share that we have made significant changes to our SEO rules interface in the settings tab of our app. It is now easier to publish multiple rules sheets and to see which changes have not yet been published to production.

READ POST

How to Find Content Gaps at Scale: Atrapalo vs Skyscanner

For the following Ranksense Webinar, we were joined by Antoine Eripret, who works at Liligo as an SEO lead. Liligo.com is a travel search engine which instantly searches all available flight, bus and train prices on an exhaustive number of travel sites such as online travel agencies, major and low-cost airlines and tour-operators. In this...

READ POST

Exciting News!
seoClarity acquires RankSense

X