Like Flies to Project Honeypot: Revisiting the CGI proxy hijack problem

by Hamlet Batista | October 11, 2007 | 16 Comments

CGI proxy hijacking appears to be getting worse. I am pretty sure that Google is well aware of it by now, but it seems they have other things higher on their priority list. If you are not familiar with the problem, take a look at these for some background information:

  1. Dan Thies take and proposed solutions

  2. My take and proposed solutions

Basically negative SEOs are causing good pages to drop from the search engine results by pointing CGI proxy servers’ URLs to a victim’s domain, and then linking to those URLs so that search engine bots find them and the duplicate content filters drop one of the pages—inevitably the one with the lowest PageRank, the victim’s page.

As I mentioned in a previous post, it is very likely that this would be an ongoing battle, but that doesn’t mean we have to lay down and do nothing. Existing solutions require the injection of a meta robots noindex tag on all web pages if the visitor is not a search engine. In this way search engines won’t index the proxy-hijacked page. Unfortunately, the proxies are already altering the content before passing it to the search engine. I am going to present a solution I think can drastically reduce the effectiveness of such attacks.

Case in point: I got an e-mail last week from an old friend I hadn’t heard from in a while. Richard operates the popular traffic stats web site GoStats and has been having some serious problems fighting CGI hijackers:

…In related Search Engine talk, I did post some information on DP about
some anti-proxy hijacking techniques:
http://forums.digitalpoint.com/showthread.php?t=499858

Perhaps this can be of interest to your anti-proxy research.
One thing I’m interested in learning more about is dealing with the bad
proxies who strip the meta noindex tags and cache the content (or
otherwise ignore the 403 forbidden message). Sending a DMCA fax to
Google for all of the offenders seems like a very time consuming way to
deal with a straight forward web-spam problem. What is your though[t] on
that? Do you know of any alternatives to expedite the proxy-hijack
problem when proxies are not behaving?
Regards,
Richard

It is interesting that the flaws I mentioned in my comments to Dan Thies’s post are already being exploited. I think this is a good time for me to resume my research and see how I can contribute something useful back to the community.

A solution built on experience

My idea is not 100% bulletproof, but it is definitely a good start. It builds upon the experience that the e-mail anti-spam community has accumulated through years of fighting spam.

First, a little bit of background. Most responsible ISPs implement strong anti-spam filters on their mail servers; if they didn’t you would receive far more than you currently do. One of the techniques used is to query public databases of blacklisted IP addresses. Users who receive spam can report the e-mail to a service such as SpamCop and the service detects and submits the IP for inclusion in a shared blacklist. Anti-spam researchers also place decoy e-mails in public places (known as honeypots) to trap e-mail harvesters, and any spam they get at those addresses gets flagged and the source is blacklisted. For efficiency, the blacklist databases are implemented as DNS servers, also known as DNSBL (DNS Black Lists). A couple of well known ones used for anti-spam purposes are spamhaus.org and SURBL.

Enter the Honeypot

I am glad to report that there is already a project that uses this concept to protect against malicious web traffic, the Project Honey Pot’s Http:BL (thanks to my reader Heather Paquinas for the tip). They are not currently targeting CGI proxy spammers, but they have pretty much everything we need to defend ourselves:

  1. Http:BL (DNSBL database). An existing, actively-maintained global database with the IP addresses of malicious web users.

  2. Detection modules and API (Apache module, WordPress plugin, etc.). Code that verifies each visitor against the blacklist database and blocks unwanted traffic to your site.

  3. Honeypot (PHP, ASP, and other scripts). Code that sets up traps to catch e-mail harvesters, comment spammers, etc.

I already installed a WordPress plugin on this blog that does the detection; I also added a honeypot. To do so yourself and start blocking malicious traffic, you just need to register a free account, request an access key and set up your WordPress plugin (or Apache module, etc.). Just blocking the comment spammers is a great help in reducing the spam from the Askimet queue. 🙂

Next, I recommend you install their honeypot on your server. In my case, I only had to copy a PHP script to the root folder of my blog, access it from the Web and follow a link on the displayed page to activate it. Next, I placed links to the page that are not clickable (or visible) for my users. I checked the honeypot page and it includes the relevant meta robots directive to prevent crawling or indexing, so a real search engine robot would not bother with such a page. I’d recommend you block access to your honeypot page via robots.txt, so that any IP that gets to the page can be flagged as suspicious.

Using the honeypot to lure CGI proxy hijackers

According to the API, Project Honey Pot currently detects search engines, suspicious robots, e-mail harvesters and comment spammers. We need to work with them to set up traps that catch CGI proxy hijackers too. My proposal to identify the CGI proxy hijackers is to generate an encoded text in the body of the honeypot page, and later perform searches for this text in major search engines to see if results come up. As the honeypot page is not supposed to be indexed (per the meta robots tag and robots.txt instructions), the presence of the text in the index means a CGI proxy is responsible. Finally, the encoded IP addresses and other information can be recorded in the blacklist and labeled as CGI proxy hijackers using one of the reserved slots shown in the diagram above.

If the CGI proxies respect robots.txt and don’t alter the meta robots noindex tag, we use the solution recommended by Thies. Otherwise, I suggest the detection code (the WordPress plugin in my case) be modified to set meta robots noindex to any visitors that the code identifies as not being a search engine.

I wanted to write this post first in as much detail as I could, but I plan to contact The Honey Pot project to see if they are interested in adding CGI proxy hijacker detection to their already robust and comprehensive solution. Hopefully this will give the hijackers a really complex obstacle to deal with. 🙂

Hamlet Batista

Chief Executive Officer

Hamlet Batista is CEO and founder of RankSense, an agile SEO platform for online retailers and manufacturers. He holds US patents on innovative SEO technologies, started doing SEO as a successful affiliate marketer back in 2002, and believes great SEO results should not take 6 months

16

REPLIES

Try our SEO automation tool for free!

RankSense automatically creates search snippets using advanced natural language generation. Get your free trial today.

OUR BLOG

Latest news and tactics

What do you do when you’re losing organic traffic and you don’t know why?

Making it easier to implement SEO changes on your website

Changes to the RankSense SEO rules interface

As we continue to improve the RankSense app for Cloudflare, we are always working to make the app more intuitive and easy to use. I'm pleased to share that we have made significant changes to our SEO rules interface in the settings tab of our app. It is now easier to publish multiple rules sheets and to see which changes have not yet been published to production.

READ POST

How to Find Content Gaps at Scale: Atrapalo vs Skyscanner

For the following Ranksense Webinar, we were joined by Antoine Eripret, who works at Liligo as an SEO lead. Liligo.com is a travel search engine which instantly searches all available flight, bus and train prices on an exhaustive number of travel sites such as online travel agencies, major and low-cost airlines and tour-operators. In this...

READ POST

How to Build a Simple HTTP Code Checker in Python with Streamlit

In this RankSense Webinar, we were joined by Charly Wargnier who is a member of the Streamlit Creators Program. Streamlit is a Python open-source library used to make data apps quickly. Charly takes us on his Python journey from starting out in digital marketing with image editing software like Photoshop all the way to working...

READ POST

Exciting News!
seoClarity acquires RankSense

X