Content is King, but Duplicate Content is a Royal Pain.

Hamlet Batista

Chief Executive Officer

Hamlet Batista is CEO and founder of RankSense, an agile SEO platform for online retailers and manufacturers. He holds US patents on innovative SEO technologies, started doing SEO as a successful affiliate marketer back in 2002, and believes great SEO results should not take 6 months

Try our SEO automation tool for free!

RankSense automatically creates search snippets using advanced natural language generation. Get your free trial today.

Getting Started with NLP and Python for SEO [Webinar]

Custom Python scripts are much more customizable than Excel spreadsheets. This is good news for SEOs — this can lead to optimization opportunities and low-hanging fruit. One way you can use Python to uncover these opportunities is by pairing it with natural language processing. This way, you can match how your audience searches with your...

READ POST

Making it easier to implement SEO changes on your website

Changes to the RankSense SEO rules interface

As we continue to improve the RankSense app for Cloudflare, we are always working to make the app more intuitive and easy to use. I'm pleased to share that we have made significant changes to our SEO rules interface in the settings tab of our app. It is now easier to publish multiple rules sheets and to see which changes have not yet been published to production.

READ POST

How to Find Content Gaps at Scale: Atrapalo vs Skyscanner

For the following Ranksense Webinar, we were joined by Antoine Eripret, who works at Liligo as an SEO lead. Liligo.com is a travel search engine which instantly searches all available flight, bus and train prices on an exhaustive number of travel sites such as online travel agencies, major and low-cost airlines and tour-operators. In this...

READ POST

Jez

July 20, 2007 at 12:48 am

Hi Hamlet, An interesting article. When I mentioned using dates, I was thinking of the date Google first found that content, as opposed to the creation date given by the site. Your point about srapers is a good one though, a site scraping fresh content has a good chance of being indexed before the origial source. Also your copyright notice may work on some syndicating sites, but could be stripped out by a more determined SE spammer. Your idea of "first ping" is a good one. It would also be possible to time delay the rss feed, putting a greater time-distance between your ping and that of a scraper. If this were introduced though, there could be a gold-rush as SE Spammers raced to re-publish all the content on the internet using this method :-)
- Hamlet Batista
  
  July 20, 2007 at 2:18 pm
  
  Jez, The crawled date is unreliable too. The crawler could visit the scrapper site before yours. <blockquote> It would also be possible to time delay the rss feed, putting a greater time-distance between your ping and that of a scraper.</blockquote> I like your idea of introducing a delay or not releasing the feed until the content is registered. Well done! <blockquote> If this were introduced though, there could be a gold-rush as SE Spammers raced to re-publish all the content on the internet using this method</blockquote> I guess there would need to be some legal consecuences for people registering content that is not theirs.
Geoff

July 20, 2007 at 3:31 am

I like where you are going with this, but what about sites like Wikipedia that license content to other sites like Answers.com? I'm not trying to punch any holes in your idea. I hope something like this comes about. To do so, lots of angles have to be considered. It sounds like you already have a great start.
- Hamlet Batista
  
  July 20, 2007 at 5:17 am
  
  Geoff - This is just a raw idea. I want you to find holes and hopefully together we can come up with a stronger plan. For authorized syndication parters I favor Google's proposal: <blockquote> If you syndicate content, we suggest that you ask the sites who are using your content to block their version with a robots.txt file as part of the syndication arrangement to help ensure your version is served in results. </blockquote>
Mutiny Design

July 20, 2007 at 9:12 am

Something I mentioned earlier was 'geographic spam pages' - pages with duplicate content but use different places names (towns, counties etc). These are always a real pain and are extremely common pactice for web design comapnies, but I have noticed them for various industries. Looking at a few seaches now, the same companies are still ranking for the same geographic terms using geographic spam pages. Do you have any input on this? Another culprit of geographic SERP littering are business directories that just have lists of businesses on them. P.S. is it possible to use nl2br on the comments (if you can do that with wordpress) or is the br element enabled?
- Hamlet Batista
  
  July 20, 2007 at 2:21 pm
  
  Mutiny - please send me an email with some examples of 'geographic spam pages'. I will try to digg deeper to give a more accurate response. there are a couple of bugs in the comments system that I want to solve. Let me see if can find some time early next week
Florchakh

July 21, 2007 at 2:33 am

I'm not sure that speaking these facts to the wide public is a good idea. Or not - c'mon Hamlet, spread it. Let's get more people interested in spamming and help to put sites of people not interested in SEO straight into the flames of spammy net business. I guess your blog will became more popular ]:->
Richard Chmura

October 14, 2007 at 11:08 am

my two cents: (to be the devil's advocate here) It is an interesting proposition to allow registering of content. The problem I see with this approach is the likeliness of false positives. (identifying scrapers as legitimate content) However, this is a step in establishing trust with a small set of publishers. I would guess that only after some form of algo. or human review would a site have access to this mechanism. - In that sense, any DMCA or copy-spam complaint would revoke current and future access to such a system. Said system would not be effective in the "pure wild" ;) Filtering and sorting duplicate content in the "wild" level will need another approach. -I'll comment on <a href="http://preview.hamletbatista.com/2007/10/11/like-flies-to-project-honeypot-revisiting-the-cgi-proxy-hijack-problem/" rel="nofollow">http://preview.hamletbatista.com/2007/10/11/like-flies-to...</a> about my suggestions for dealing with the "wild" internet.
Hamlet Batista

October 14, 2007 at 3:19 pm

<blockquote>The problem I see with this approach is the likeliness of false positives. (identifying scrapers as legitimate content)</blockquote> Thanks, Richard. Please note that the scrappers access the content via RSS, etc. If the blogging software performs the registration before publishing the content or RSS feed the content author will have a head start. There would be no way for scrappers to access the content before it is registered.
Richard Chmura

October 15, 2007 at 1:50 am

If you think like a bad guy: scraping the content and registering it, stealing from a site or blog that had not implemented this process. This will create a false positive and a disadvantage to all sites that do not subscribe to this technology. The possibility of every single legitimate content owner using a uniform method is slim - even today we have sites which don't conform to w3c standards or don't even work in various browsers. It is possible that this might make it easier for illegtimate sites to steal content.
Hamlet Batista

October 15, 2007 at 5:36 am

I agree.
Richard Chmura

October 15, 2007 at 5:42 am

But don't give up hope entirely. With some human review, it could be possible to whitelist some sites for inclusion in this system. It would only work if the whitelist was limited to these human reviews - or a similar method of content ownership verification. Definitely not a dead idea. But it must have clear boundaries and limits.

Hamlet Batista

Jez

Hamlet Batista

Geoff

Hamlet Batista

Mutiny Design

Hamlet Batista

Florchakh

Richard Chmura

Hamlet Batista

Richard Chmura

Hamlet Batista

Richard Chmura

Try our SEO automation tool for free!

OUR BLOG

Latest news and tactics

Getting Started with NLP and Python for SEO [Webinar]

Making it easier to implement SEO changes on your website

Changes to the RankSense SEO rules interface

How to Find Content Gaps at Scale: Atrapalo vs Skyscanner

Request your SEO Monitoring Invitation

Content is King, but Duplicate Content is a Royal Pain.

Hamlet Batista

Jez

Hamlet Batista

Geoff

Hamlet Batista

Mutiny Design

Hamlet Batista

Florchakh

Richard Chmura

Hamlet Batista

Richard Chmura

Hamlet Batista

Richard Chmura

Try our SEO automation tool for free!

OUR BLOG

Latest news and tactics

Getting Started with NLP and Python for SEO [Webinar]

Making it easier to implement SEO changes on your website

Changes to the RankSense SEO rules interface

How to Find Content Gaps at Scale: Atrapalo vs Skyscanner