Duplicate content is one of the most common causes of concern among webmasters. We work hard to provide original and useful content, and all it takes is a malicious SERP (Search Engine Results Page) hijacker to copy our content and use it for his or her own. Not nice.
More troubling still is the way that Google handles the issue. In my previous post about cgi hijacking, was clear that the main problem with hijacking and content scraping is that search engines do not reliably determine who is the owner of the content and, therefore, which page should stay in the index. When faced with multiple pages that have exactly the same or nearly the same content, Google's filters flag them as duplicates. Google's usual course of action is that only one of the pages — the one with the higher PageRank — makes it to the index. The rest are tossed out. Unless there is enough evidence to show that the owner or owners are trying to do something manipulative, there is no need to worry about penalties.
Recently, regular reader Jez asked me a thought-provoking question. I'm paraphrasing here, but essentially he wanted to know: "Why doesn’t Google consider the age of the content to determine the original author?” I responded that the task is not as trivial as it may seem at first, and I promised a more thorough explanation. Here it is.
A similar suggestion was proposed at SMX Advanced. This is a quote from Google's Webmaster Central blog:
Providing a way to authenticate ownership of content
This would provide search engines with extra information to help ensure we index the original version of an article, rather than a scraped or syndicated version. Note that we do a pretty good job of this now and not many people in the audience mentioned this to be a primary issue. However, the audience was interested in a way of authenticating content as an extra protection. Some suggested using the page with the earliest date, but creation dates aren't always reliable. Someone also suggested allowing site owners to register content, although that could raise issues as well, as non-savvy site owners wouldn't know to register content and someone else could take the content and register it instead. We currently rely on a number of factors such as the site's authority and the number of links to the page. If you syndicate content, we suggest that you ask the sites who are using your content to block their version with a robots.txt file as part of the syndication arrangement to help ensure your version is served in results.
Why not use the creation date to tell who the owner of the content is?
The creation date can be easily forged. A scraper can easily modify his or her content management system to backdate any content they scrape from RSS feeds. As soon as you hit publish, you are also syndicating your content and any scraper using your feed to power his or her site will have your content. It’s frustrating, but that is unfortunate reality.
The problem with Google's approach
As I mentioned before and as you can read in the above quote, Google favors the page with the most links and authority. This is good for most cases, but let's take an example close to home: my blog.
My blog is relatively new. All my content is original, but I have very few incoming links and any scraper that has been doing it for a while will probably have far more links than I do. Where does that leave me? Many of my pages will be considered duplicate content and be tossed out of the index.
Their suggestion to request syndicating sites exclude their pages is good—as long as they are syndicating with your permission. Policing scrapers over the use of your content is a completely different matter.
One simple solution that I decided to use for this blog is to add a copyright notice to my feed with a link back to my blog. The benefit is that every site scraping my content will be linking back and improving my blog’s authority and incoming links, reducing the overall probability that my pages will be dropped from the index. This is not perfect, as there might be instances were scrapers have more links, but it is better than what I had before. The other benefit is that readers will see the notice and know where to find the original source. Hopefully some scrapers will be discouraged and will look for content to steal elsewhere. Removing the text in the footer is trivial, but I know most scrapers are lazy. (Otherwise they would create their own content, right?) You can check my copyright message in your feed reader.
My proposed solution
I think the only way to authenticate content ownership is by somehow registering it at a trusted place with digital signatures. I know your immediate response. As mentioned in the quote, the problem is that non-savvy users might not know how to register their content and some smart scraper can register it themselves. But I have an answer for that as well.
What if content management systems and blogging platform providers integrate this on their software? When you hit publish your blogging software pings technorati.com and other services to alert them of your new content. Why not use the same framework to alert a content registration service? When you hit publish the publishing software would ping the content registration service and identify your content as original with a content checksum, digital signature and registration date.
If somebody changes your content slightly, they will register it at a later date. Luckily, search engines are good at detecting near duplicates. They would only need to check the service when they find duplicates and the service could tell them who the original author is. Search engine companies cooperated on the sitemaps initiative; they certainly can do the same for something like this.
For regular websites, things are admittedly a little bit trickier. But creating modules that detect new content and ping the registration service should not be a big challenge.
What do you think of this solution? I’m opening up the floor to your criticism and your ideas.
July 20, 2007 at 12:48 am
Hi Hamlet, An interesting article. When I mentioned using dates, I was thinking of the date Google first found that content, as opposed to the creation date given by the site. Your point about srapers is a good one though, a site scraping fresh content has a good chance of being indexed before the origial source. Also your copyright notice may work on some syndicating sites, but could be stripped out by a more determined SE spammer. Your idea of "first ping" is a good one. It would also be possible to time delay the rss feed, putting a greater time-distance between your ping and that of a scraper. If this were introduced though, there could be a gold-rush as SE Spammers raced to re-publish all the content on the internet using this method :-)
July 20, 2007 at 2:18 pm
Jez, The crawled date is unreliable too. The crawler could visit the scrapper site before yours. <blockquote> It would also be possible to time delay the rss feed, putting a greater time-distance between your ping and that of a scraper.</blockquote> I like your idea of introducing a delay or not releasing the feed until the content is registered. Well done! <blockquote> If this were introduced though, there could be a gold-rush as SE Spammers raced to re-publish all the content on the internet using this method</blockquote> I guess there would need to be some legal consecuences for people registering content that is not theirs.
July 20, 2007 at 3:31 am
I like where you are going with this, but what about sites like Wikipedia that license content to other sites like Answers.com? I'm not trying to punch any holes in your idea. I hope something like this comes about. To do so, lots of angles have to be considered. It sounds like you already have a great start.
July 20, 2007 at 5:17 am
Geoff - This is just a raw idea. I want you to find holes and hopefully together we can come up with a stronger plan. For authorized syndication parters I favor Google's proposal: <blockquote> If you syndicate content, we suggest that you ask the sites who are using your content to block their version with a robots.txt file as part of the syndication arrangement to help ensure your version is served in results. </blockquote>
July 20, 2007 at 9:12 am
Something I mentioned earlier was 'geographic spam pages' - pages with duplicate content but use different places names (towns, counties etc). These are always a real pain and are extremely common pactice for web design comapnies, but I have noticed them for various industries. Looking at a few seaches now, the same companies are still ranking for the same geographic terms using geographic spam pages. Do you have any input on this? Another culprit of geographic SERP littering are business directories that just have lists of businesses on them. P.S. is it possible to use nl2br on the comments (if you can do that with wordpress) or is the br element enabled?
July 20, 2007 at 2:21 pm
Mutiny - please send me an email with some examples of 'geographic spam pages'. I will try to digg deeper to give a more accurate response. there are a couple of bugs in the comments system that I want to solve. Let me see if can find some time early next week
July 21, 2007 at 2:33 am
I'm not sure that speaking these facts to the wide public is a good idea. Or not - c'mon Hamlet, spread it. Let's get more people interested in spamming and help to put sites of people not interested in SEO straight into the flames of spammy net business. I guess your blog will became more popular ]:->
October 14, 2007 at 11:08 am
my two cents: (to be the devil's advocate here) It is an interesting proposition to allow registering of content. The problem I see with this approach is the likeliness of false positives. (identifying scrapers as legitimate content) However, this is a step in establishing trust with a small set of publishers. I would guess that only after some form of algo. or human review would a site have access to this mechanism. - In that sense, any DMCA or copy-spam complaint would revoke current and future access to such a system. Said system would not be effective in the "pure wild" ;) Filtering and sorting duplicate content in the "wild" level will need another approach. -I'll comment on <a href="http://preview.hamletbatista.com/2007/10/11/like-flies-to-project-honeypot-revisiting-the-cgi-proxy-hijack-problem/" rel="nofollow">http://preview.hamletbatista.com/2007/10/11/like-flies-to...</a> about my suggestions for dealing with the "wild" internet.
October 14, 2007 at 3:19 pm
<blockquote>The problem I see with this approach is the likeliness of false positives. (identifying scrapers as legitimate content)</blockquote> Thanks, Richard. Please note that the scrappers access the content via RSS, etc. If the blogging software performs the registration before publishing the content or RSS feed the content author will have a head start. There would be no way for scrappers to access the content before it is registered.
October 15, 2007 at 1:50 am
If you think like a bad guy: scraping the content and registering it, stealing from a site or blog that had not implemented this process. This will create a false positive and a disadvantage to all sites that do not subscribe to this technology. The possibility of every single legitimate content owner using a uniform method is slim - even today we have sites which don't conform to w3c standards or don't even work in various browsers. It is possible that this might make it easier for illegtimate sites to steal content.
October 15, 2007 at 5:36 am
October 15, 2007 at 5:42 am
But don't give up hope entirely. With some human review, it could be possible to whitelist some sites for inclusion in this system. It would only work if the whitelist was limited to these human reviews - or a similar method of content ownership verification. Definitely not a dead idea. But it must have clear boundaries and limits.