In 2005 it was the infamous 302, temporary redirect page hijacking. That was supposedly fixed, according to Matt Cutts. Now there is a new interesting twist. Hijackers have found another exploitable hole in Google: the use of cgi proxies to hijack search engine rankings.
The problem is basically the same. Two URLs pointing to the same content. Google's duplicate content filters kick in and drop one of the URLs. They normally drop the page with the lower PageRank. That is Google's core problem. They need to find a better way to identify the original author of the page.
When someone blatantly copies your content and hosts it on their site, you can take the offending page down by sending a DMCA complaint to Google, et al. The problem with 302 redirects and cgi proxies is that there is no content being copied. They are simply tricking the search engine into believing there are multiple URLs hosting the same content.
What is a cgi proxy anyway? Glad you asked. I love explaining technical things 🙂
A cgi proxy is a type of proxy server that is accessible via URLs. Anonymizer.com is a well known example. Normal proxy servers are configured on your browser advanced options. I explained proxy servers briefly on another post.
I am sure most hijackers are using a very well known and public cgi proxy called CGIProxy by James Marshall. I have used it in the past (not for hijacking) and I can say that it is very easy to setup and use. The SSL and connection code is a little bit complex if you are not familiar with socket programming and encryption, but the fact that it is written in Perl makes a lot of things easy.
The cgi proxy in the hijacking context works like this:
Googlebot finds and fetches a proxied URL (http://cgiproxyserver/http/yoursite.com)
The CGI proxy script pulls the page from your site (http://yoursite.com )
The CGI script replaces all your internal links (http://yoursite.com/page1.htm to http://cgiproxyserver/http/yoursite.com/page1.htm ). This is to make your the search engine bot continues to request the pages from the cgy proxy.
The page Google is fetching via the cgi proxy is identical to the page on your site.
Duplicate content filters kick in and the rest is history
Should you care? If you don't mind losing your search engine rankings to a hijacker, there is nothing to worry about. If you do, please keep reading.
How can I prevent this from happening to me?
One solution that I've seen in the forums and that Google recommends is to verify the IP saying that it is Googlebot, actually is. Let me explain briefly the solution they propose as well as the drawbacks of such solution.
The solution is what is known as reverse-forward DNS check. Email severs have used this for a while to detect valid, non-spamming SMTP hosts.
The detection code does a reverse DNS lookup (IP(a) -> host(a)) , followed by a forward DNS lookup (host(a)->IP(b)). IP(a) and IP(b) must be the same and host name must include the robot's domain name (google.com). This is similar to the double-optin process to validate email leads in email marketing.
This is the best solution I could take from what I read on the forums. Unfortunately there is a problem.
Doing this for every single hit or new IP address to your site is not a good idea. The server would be down on its knees. The proposed solution recommends you identify the bots by user agent. Unfortunately that information is very easy to fake for the hijackers.
$USER_AGENT ~= s/Googlebot/Mozilla/ if $USER_AGENT == 'Googlebot';
The solution needs to be strengthened with IP address detection too. It is a little bit complicated but here is the main idea.
Collect all the hijacking proxy servers IPs by setting up honeypots or similar traps. I will talk about this in more detail in a later post.
Verify each IP against that database. For efficiency it is better to use a DNS server to maintain the IP database.
This is probably better to do on a public server with volunteers similar to email anti-spam efforts.
Happy anti-hijacking, as we wait for the ideal solution, which is for Google to fix it.