Duplicate content is one of the most common causes of concern among webmasters. We work hard to provide original and useful content, and all it takes is a malicious SERP (Search Engine Results Page) hijacker to copy our content and use it for his or her own. Not nice.
More troubling still is the way that Google handles the issue. In my previous post about cgi hijacking, was clear that the main problem with hijacking and content scraping is that search engines do not reliably determine who is the owner of the content and, therefore, which page should stay in the index. When faced with multiple pages that have exactly the same or nearly the same content, Google's filters flag them as duplicates. Google's usual course of action is that only one of the pages — the one with the higher PageRank — makes it to the index. The rest are tossed out. Unless there is enough evidence to show that the owner or owners are trying to do something manipulative, there is no need to worry about penalties.
Recently, regular reader Jez asked me a thought-provoking question. I'm paraphrasing here, but essentially he wanted to know: "Why doesn’t Google consider the age of the content to determine the original author?” I responded that the task is not as trivial as it may seem at first, and I promised a more thorough explanation. Here it is.
Providing a way to authenticate ownership of content
This would provide search engines with extra information to help ensure we index the original version of an article, rather than a scraped or syndicated version. Note that we do a pretty good job of this now and not many people in the audience mentioned this to be a primary issue. However, the audience was interested in a way of authenticating content as an extra protection. Some suggested using the page with the earliest date, but creation dates aren't always reliable. Someone also suggested allowing site owners to register content, although that could raise issues as well, as non-savvy site owners wouldn't know to register content and someone else could take the content and register it instead. We currently rely on a number of factors such as the site's authority and the number of links to the page. If you syndicate content, we suggest that you ask the sites who are using your content to block their version with a robots.txt file as part of the syndication arrangement to help ensure your version is served in results.
Why not use the creation date to tell who the owner of the content is?
The creation date can be easily forged. A scraper can easily modify his or her content management system to backdate any content they scrape from RSS feeds. As soon as you hit publish, you are also syndicating your content and any scraper using your feed to power his or her site will have your content. It’s frustrating, but that is unfortunate reality.
The problem with Google's approach
As I mentioned before and as you can read in the above quote, Google favors the page with the most links and authority. This is good for most cases, but let's take an example close to home: my blog.
My blog is relatively new. All my content is original, but I have very few incoming links and any scraper that has been doing it for a while will probably have far more links than I do. Where does that leave me? Many of my pages will be considered duplicate content and be tossed out of the index.
Their suggestion to request syndicating sites exclude their pages is good—as long as they are syndicating with your permission. Policing scrapers over the use of your content is a completely different matter.
One simple solution that I decided to use for this blog is to add a copyright notice to my feed with a link back to my blog. The benefit is that every site scraping my content will be linking back and improving my blog’s authority and incoming links, reducing the overall probability that my pages will be dropped from the index. This is not perfect, as there might be instances were scrapers have more links, but it is better than what I had before. The other benefit is that readers will see the notice and know where to find the original source. Hopefully some scrapers will be discouraged and will look for content to steal elsewhere. Removing the text in the footer is trivial, but I know most scrapers are lazy. (Otherwise they would create their own content, right?) You can check my copyright message in your feed reader.
My proposed solution
I think the only way to authenticate content ownership is by somehow registering it at a trusted place with digital signatures. I know your immediate response. As mentioned in the quote, the problem is that non-savvy users might not know how to register their content and some smart scraper can register it themselves. But I have an answer for that as well.
What if content management systems and blogging platform providers integrate this on their software? When you hit publish your blogging software pings technorati.com and other services to alert them of your new content. Why not use the same framework to alert a content registration service? When you hit publish the publishing software would ping the content registration service and identify your content as original with a content checksum, digital signature and registration date.
If somebody changes your content slightly, they will register it at a later date. Luckily, search engines are good at detecting near duplicates. They would only need to check the service when they find duplicates and the service could tell them who the original author is. Search engine companies cooperated on the sitemaps initiative; they certainly can do the same for something like this.
For regular websites, things are admittedly a little bit trickier. But creating modules that detect new content and ping the registration service should not be a big challenge.
What do you think of this solution? I’m opening up the floor to your criticism and your ideas.