Custom Python scripts are much more customizable than Excel spreadsheets. This is good news for SEOs — this can lead to optimization opportunities and low-hanging fruit. One way you can use Python to uncover these opportunities is by pairing it with natural language processing. This way, you can match how your audience searches with your...READ POST
CGI proxy hijacking appears to be getting worse. I am pretty sure that Google is well aware of it by now, but it seems they have other things higher on their priority list. If you are not familiar with the problem, take a look at these for some background information:
Basically negative SEOs are causing good pages to drop from the search engine results by pointing CGI proxy servers’ URLs to a victim’s domain, and then linking to those URLs so that search engine bots find them and the duplicate content filters drop one of the pages—inevitably the one with the lowest PageRank, the victim’s page.
As I mentioned in a previous post, it is very likely that this would be an ongoing battle, but that doesn’t mean we have to lay down and do nothing. Existing solutions require the injection of a meta robots noindex tag on all web pages if the visitor is not a search engine. In this way search engines won’t index the proxy-hijacked page. Unfortunately, the proxies are already altering the content before passing it to the search engine. I am going to present a solution I think can drastically reduce the effectiveness of such attacks.
Case in point: I got an e-mail last week from an old friend I hadn’t heard from in a while. Richard operates the popular traffic stats web site GoStats and has been having some serious problems fighting CGI hijackers:
…In related Search Engine talk, I did post some information on DP about
some anti-proxy hijacking techniques:
Perhaps this can be of interest to your anti-proxy research.
One thing I’m interested in learning more about is dealing with the bad
proxies who strip the meta noindex tags and cache the content (or
otherwise ignore the 403 forbidden message). Sending a DMCA fax to
Google for all of the offenders seems like a very time consuming way to
deal with a straight forward web-spam problem. What is your though[t] on
that? Do you know of any alternatives to expedite the proxy-hijack
problem when proxies are not behaving?
It is interesting that the flaws I mentioned in my comments to Dan Thies’s post are already being exploited. I think this is a good time for me to resume my research and see how I can contribute something useful back to the community.
A solution built on experience
My idea is not 100% bulletproof, but it is definitely a good start. It builds upon the experience that the e-mail anti-spam community has accumulated through years of fighting spam.
First, a little bit of background. Most responsible ISPs implement strong anti-spam filters on their mail servers; if they didn’t you would receive far more than you currently do. One of the techniques used is to query public databases of blacklisted IP addresses. Users who receive spam can report the e-mail to a service such as SpamCop and the service detects and submits the IP for inclusion in a shared blacklist. Anti-spam researchers also place decoy e-mails in public places (known as honeypots) to trap e-mail harvesters, and any spam they get at those addresses gets flagged and the source is blacklisted. For efficiency, the blacklist databases are implemented as DNS servers, also known as DNSBL (DNS Black Lists). A couple of well known ones used for anti-spam purposes are spamhaus.org and SURBL.
Enter the Honeypot
I am glad to report that there is already a project that uses this concept to protect against malicious web traffic, the Project Honey Pot’s Http:BL (thanks to my reader Heather Paquinas for the tip). They are not currently targeting CGI proxy spammers, but they have pretty much everything we need to defend ourselves:
Http:BL (DNSBL database). An existing, actively-maintained global database with the IP addresses of malicious web users.
Detection modules and API (Apache module, WordPress plugin, etc.). Code that verifies each visitor against the blacklist database and blocks unwanted traffic to your site.
Honeypot (PHP, ASP, and other scripts). Code that sets up traps to catch e-mail harvesters, comment spammers, etc.
I already installed a WordPress plugin on this blog that does the detection; I also added a honeypot. To do so yourself and start blocking malicious traffic, you just need to register a free account, request an access key and set up your WordPress plugin (or Apache module, etc.). Just blocking the comment spammers is a great help in reducing the spam from the Askimet queue. 🙂
Next, I recommend you install their honeypot on your server. In my case, I only had to copy a PHP script to the root folder of my blog, access it from the Web and follow a link on the displayed page to activate it. Next, I placed links to the page that are not clickable (or visible) for my users. I checked the honeypot page and it includes the relevant meta robots directive to prevent crawling or indexing, so a real search engine robot would not bother with such a page. I’d recommend you block access to your honeypot page via robots.txt, so that any IP that gets to the page can be flagged as suspicious.
Using the honeypot to lure CGI proxy hijackers
According to the API, Project Honey Pot currently detects search engines, suspicious robots, e-mail harvesters and comment spammers. We need to work with them to set up traps that catch CGI proxy hijackers too. My proposal to identify the CGI proxy hijackers is to generate an encoded text in the body of the honeypot page, and later perform searches for this text in major search engines to see if results come up. As the honeypot page is not supposed to be indexed (per the meta robots tag and robots.txt instructions), the presence of the text in the index means a CGI proxy is responsible. Finally, the encoded IP addresses and other information can be recorded in the blacklist and labeled as CGI proxy hijackers using one of the reserved slots shown in the diagram above.
If the CGI proxies respect robots.txt and don’t alter the meta robots noindex tag, we use the solution recommended by Thies. Otherwise, I suggest the detection code (the WordPress plugin in my case) be modified to set meta robots noindex to any visitors that the code identifies as not being a search engine.
I wanted to write this post first in as much detail as I could, but I plan to contact The Honey Pot project to see if they are interested in adding CGI proxy hijacker detection to their already robust and comprehensive solution. Hopefully this will give the hijackers a really complex obstacle to deal with. 🙂