In 2005 it was the infamous 302, temporary redirect page hijacking. That was supposedly fixed, according to Matt Cutts. Now there is a new interesting twist. Hijackers have found another exploitable hole in Google: the use of cgi proxies to hijack search engine rankings.
The problem is basically the same. Two URLs pointing to the same content. Google's duplicate content filters kick in and drop one of the URLs. They normally drop the page with the lower PageRank. That is Google's core problem. They need to find a better way to identify the original author of the page.
When someone blatantly copies your content and hosts it on their site, you can take the offending page down by sending a DMCA complaint to Google, et al. The problem with 302 redirects and cgi proxies is that there is no content being copied. They are simply tricking the search engine into believing there are multiple URLs hosting the same content.
What is a cgi proxy anyway? Glad you asked. I love explaining technical things 🙂
A cgi proxy is a type of proxy server that is accessible via URLs. Anonymizer.com is a well known example. Normal proxy servers are configured on your browser advanced options. I explained proxy servers briefly on another post.
I am sure most hijackers are using a very well known and public cgi proxy called CGIProxy by James Marshall. I have used it in the past (not for hijacking) and I can say that it is very easy to setup and use. The SSL and connection code is a little bit complex if you are not familiar with socket programming and encryption, but the fact that it is written in Perl makes a lot of things easy.
The cgi proxy in the hijacking context works like this:
-
Googlebot finds and fetches a proxied URL (http://cgiproxyserver/http/yoursite.com)
-
The CGI proxy script pulls the page from your site (http://yoursite.com )
-
The CGI script replaces all your internal links (http://yoursite.com/page1.htm to http://cgiproxyserver/http/yoursite.com/page1.htm ). This is to make your the search engine bot continues to request the pages from the cgy proxy.
-
The page Google is fetching via the cgi proxy is identical to the page on your site.
-
Duplicate content filters kick in and the rest is history
Should you care? If you don't mind losing your search engine rankings to a hijacker, there is nothing to worry about. If you do, please keep reading.
How can I prevent this from happening to me?
One solution that I've seen in the forums and that Google recommends is to verify the IP saying that it is Googlebot, actually is. Let me explain briefly the solution they propose as well as the drawbacks of such solution.
The solution is what is known as reverse-forward DNS check. Email severs have used this for a while to detect valid, non-spamming SMTP hosts.
The detection code does a reverse DNS lookup (IP(a) -> host(a)) , followed by a forward DNS lookup (host(a)->IP(b)). IP(a) and IP(b) must be the same and host name must include the robot's domain name (google.com). This is similar to the double-optin process to validate email leads in email marketing.
This is the best solution I could take from what I read on the forums. Unfortunately there is a problem.
Doing this for every single hit or new IP address to your site is not a good idea. The server would be down on its knees. The proposed solution recommends you identify the bots by user agent. Unfortunately that information is very easy to fake for the hijackers.
$USER_AGENT ~= s/Googlebot/Mozilla/ if $USER_AGENT == 'Googlebot';
The solution needs to be strengthened with IP address detection too. It is a little bit complicated but here is the main idea.
-
Collect all the hijacking proxy servers IPs by setting up honeypots or similar traps. I will talk about this in more detail in a later post.
-
Verify each IP against that database. For efficiency it is better to use a DNS server to maintain the IP database.
This is probably better to do on a public server with volunteers similar to email anti-spam efforts.
Happy anti-hijacking, as we wait for the ideal solution, which is for Google to fix it.
Jez
July 3, 2007 at 11:50 pm
Does Google not consider the age of a page also though? If my content is 6 months old can someone still hijack it using this method? Furthermore, if it is based on PR, then they would need to get the PR of their proxy URL's up, which implies they also need a spam network to point at these URL's. In fact, as they are not aware of the links, would they not need to spider their own proxy to find the links to report back to their spam sites? Also, I assume they also have to cloak to avoid DMCA complaint for these pages? If they were left accessible via a URL then you would still be able to view and report the scraped content... I am just guessing here but it seems to me that setting up the proxy is the easy part!!
Hamlet Batista
July 4, 2007 at 8:13 am
<blockquote> Does Google not consider the age of a page also though? If my content is 6 months old can someone still hijack it using this method? </blockquote> Jez, Yes, Google does, but hijacking is still taking place as you can read on the forums. This is not as trivial as it looks and requires a longer explanation. Excellent idea for another post :-) <blockquote> Furthermore, if it is based on PR, then they would need to get the PR of their proxy URL’s up, which implies they also need a spam network to point at these URL’s. In fact, as they are not aware of the links, would they not need to spider their own proxy to find the links to report back to their spam sites? </blockquote> They can rent pages on high PageRank websites to achieve this. This is a very well known technique by black hatters. <blockquote> Also, I assume they also have to cloak to avoid DMCA complaint for these pages? If they were left accessible via a URL then you would still be able to view and report the scraped content… </blockquote> The problem is that the content is not being copied <blockquote> I am just guessing here but it seems to me that setting up the proxy is the easy part!! </blockquote> It definitely is
Heather Paquinas
July 4, 2007 at 7:46 am
Project Honey Pot's http:BL is trying to accomplish this same thing, <a href="http://www.google.com/search?&q=site:digg.com" rel="nofollow">http://www.google.com/search?&q=site:digg.com</a> web honeypot
Hamlet Batista
July 4, 2007 at 8:27 am
Heather, Exactly! It is the same idea, but they would need to detect cgi proxy hijackers, instead of comment spammers. Great find!
Web Design Newcastle
September 25, 2007 at 1:52 am
I read about this technique somewhere else yesterday (can't remember where?) and to be honest my first thought was that of Jez above - surely Google can tell which version it found first? However, after looking into it somemore it does appear to be happening a lot currently and I've read accounts of it being used to wipe out a competing site by simply waiting until the site is hijacked and then removing the content delivered by the proxy. Obviously Google will eventually reinstate the original site but this could take a while. With the team Google has - this type of problem should be easily wiped out, and I'm confident it will be. The problem is there is always a new threat around the corner.
Web Design Newcastle
September 25, 2007 at 1:56 am
<blockquote>One solution that I've seen in the forums and that Google recommends is to verify the IP saying that it is Googlebot, actually is. </blockquote> Oh - and this is a bit of a cop-out by Google. It's their problem to fix, not ours.