The Never Ending SERPs Hijacking Problem: Is there a definite solution?

by Hamlet Batista | July 03, 2007 | 6 Comments

hijacker.jpgIn 2005 it was the infamous 302, temporary redirect page hijacking. That was supposedly fixed, according to Matt Cutts. Now there is a new interesting twist. Hijackers have found another exploitable hole in Google: the use of cgi proxies to hijack search engine rankings.

The problem is basically the same. Two URLs pointing to the same content. Google's duplicate content filters kick in and drop one of the URLs. They normally drop the page with the lower PageRank. That is Google's core problem. They need to find a better way to identify the original author of the page.

When someone blatantly copies your content and hosts it on their site, you can take the offending page down by sending a DMCA complaint to Google, et al. The problem with 302 redirects and cgi proxies is that there is no content being copied. They are simply tricking the search engine into believing there are multiple URLs hosting the same content.

What is a cgi proxy anyway? Glad you asked. I love explaining technical things 🙂

A cgi proxy is a type of proxy server that is accessible via URLs. Anonymizer.com is a well known example. Normal proxy servers are configured on your browser advanced options. I explained proxy servers briefly on another post.

I am sure most hijackers are using a very well known and public cgi proxy called CGIProxy by James Marshall. I have used it in the past (not for hijacking) and I can say that it is very easy to setup and use. The SSL and connection code is a little bit complex if you are not familiar with socket programming and encryption, but the fact that it is written in Perl makes a lot of things easy.

The cgi proxy in the hijacking context works like this:

  1. Googlebot finds and fetches a proxied URL (http://cgiproxyserver/http/yoursite.com)

  2. The CGI proxy script pulls the page from your site (http://yoursite.com )

  3. The CGI script replaces all your internal links (http://yoursite.com/page1.htm to http://cgiproxyserver/http/yoursite.com/page1.htm ). This is to make your the search engine bot continues to request the pages from the cgy proxy.

  4. The page Google is fetching via the cgi proxy is identical to the page on your site.

  5. Duplicate content filters kick in and the rest is history

Should you care? If you don't mind losing your search engine rankings to a hijacker, there is nothing to worry about. If you do, please keep reading.

How can I prevent this from happening to me?

One solution that I've seen in the forums and that Google recommends is to verify the IP saying that it is Googlebot, actually is. Let me explain briefly the solution they propose as well as the drawbacks of such solution.

The solution is what is known as reverse-forward DNS check. Email severs have used this for a while to detect valid, non-spamming SMTP hosts.

The detection code does a reverse DNS lookup (IP(a) -> host(a)) , followed by a forward DNS lookup (host(a)->IP(b)). IP(a) and IP(b) must be the same and host name must include the robot's domain name (google.com). This is similar to the double-optin process to validate email leads in email marketing.

This is the best solution I could take from what I read on the forums. Unfortunately there is a problem.

Doing this for every single hit or new IP address to your site is not a good idea. The server would be down on its knees. The proposed solution recommends you identify the bots by user agent. Unfortunately that information is very easy to fake for the hijackers.

$USER_AGENT ~= s/Googlebot/Mozilla/ if $USER_AGENT == 'Googlebot';

The solution needs to be strengthened with IP address detection too. It is a little bit complicated but here is the main idea.

  1. Collect all the hijacking proxy servers IPs by setting up honeypots or similar traps. I will talk about this in more detail in a later post.

  2. Verify each IP against that database. For efficiency it is better to use a DNS server to maintain the IP database.

This is probably better to do on a public server with volunteers similar to email anti-spam efforts.

Happy anti-hijacking, as we wait for the ideal solution, which is for Google to fix it.

 

Hamlet Batista

Chief Executive Officer

Hamlet Batista is CEO and founder of RankSense, an agile SEO platform for online retailers and manufacturers. He holds US patents on innovative SEO technologies, started doing SEO as a successful affiliate marketer back in 2002, and believes great SEO results should not take 6 months

6

REPLIES

Try our SEO automation tool for free!

RankSense automatically creates search snippets using advanced natural language generation. Get your free trial today.

OUR BLOG

Latest news and tactics

What do you do when you’re losing organic traffic and you don’t know why?

Making it easier to implement SEO changes on your website

Changes to the RankSense SEO rules interface

As we continue to improve the RankSense app for Cloudflare, we are always working to make the app more intuitive and easy to use. I'm pleased to share that we have made significant changes to our SEO rules interface in the settings tab of our app. It is now easier to publish multiple rules sheets and to see which changes have not yet been published to production.

READ POST

How to Find Content Gaps at Scale: Atrapalo vs Skyscanner

For the following Ranksense Webinar, we were joined by Antoine Eripret, who works at Liligo as an SEO lead. Liligo.com is a travel search engine which instantly searches all available flight, bus and train prices on an exhaustive number of travel sites such as online travel agencies, major and low-cost airlines and tour-operators. In this...

READ POST

How to Build a Simple HTTP Code Checker in Python with Streamlit

In this RankSense Webinar, we were joined by Charly Wargnier who is a member of the Streamlit Creators Program. Streamlit is a Python open-source library used to make data apps quickly. Charly takes us on his Python journey from starting out in digital marketing with image editing software like Photoshop all the way to working...

READ POST

Exciting News!
seoClarity acquires RankSense

X