Like Flies to Project Honeypot: Revisiting the CGI proxy hijack problem

Hamlet Batista

Chief Executive Officer

Hamlet Batista is CEO and founder of RankSense, an agile SEO platform for online retailers and manufacturers. He holds US patents on innovative SEO technologies, started doing SEO as a successful affiliate marketer back in 2002, and believes great SEO results should not take 6 months

Try our SEO automation tool for free!

RankSense automatically creates search snippets using advanced natural language generation. Get your free trial today.

Getting Started with NLP and Python for SEO [Webinar]

Custom Python scripts are much more customizable than Excel spreadsheets. This is good news for SEOs — this can lead to optimization opportunities and low-hanging fruit. One way you can use Python to uncover these opportunities is by pairing it with natural language processing. This way, you can match how your audience searches with your...

READ POST

Making it easier to implement SEO changes on your website

Changes to the RankSense SEO rules interface

As we continue to improve the RankSense app for Cloudflare, we are always working to make the app more intuitive and easy to use. I'm pleased to share that we have made significant changes to our SEO rules interface in the settings tab of our app. It is now easier to publish multiple rules sheets and to see which changes have not yet been published to production.

READ POST

How to Find Content Gaps at Scale: Atrapalo vs Skyscanner

For the following Ranksense Webinar, we were joined by Antoine Eripret, who works at Liligo as an SEO lead. Liligo.com is a travel search engine which instantly searches all available flight, bus and train prices on an exhaustive number of travel sites such as online travel agencies, major and low-cost airlines and tour-operators. In this...

READ POST

Dan Thies

October 11, 2007 at 3:03 pm

Thanks yet again, Hamlet... I'm gonna throw that HTTPBL plugin up and see how much stuff it catches.
Hamlet Batista

October 11, 2007 at 3:13 pm

Good, Dan. Please let me know your results.
egorych

October 12, 2007 at 4:33 am

Nice idea. But may be it's too late? Webmasters report that hijacking proxies were banned in Google. May be Google already has a right solution? Or you think hijacking will never die, it can be only implemented in different ways? Anyhow thanks for information. It's very interesting.
Hamlet Batista

October 12, 2007 at 5:21 am

It seems it is not a complete solution. See IncrediBILL comments <a href="http://www.webmasterworld.com/google/3473585.htm" rel="nofollow">http://www.webmasterworld.com/google/3473585.htm</a> I am glad they finally decided to address the problem, though.
Dan Thies

October 12, 2007 at 5:33 am

I haven't seen any real evidence that Google did anything. A couple folks reported that they no longer see proxy copies of their own sites. But it's easy enough to check the # of php-proxy,cgi-proxy,nph-proxy etc. duplicates in the index, and those are just the obvious EASY ones to remove.
David Hopkins

October 12, 2007 at 8:05 am

Is it possible to use a htaccess solution rather than a PHP one for this problem? A lot of people have low grade coded websites that don't have any ability to set anything globally. In these cases it would be better to use a the server to block.
Hamlet Batista

October 12, 2007 at 9:05 am

David - the project honeypot http:BL has something they call quicklinks for people that don't have access to scripting functionality on their websites. They would need at least the Apache module installed for detection, though.
Richard Chmura

October 13, 2007 at 2:40 am

From my end I've seen a brief period of what seemed to be "testing" at google where the proxies were gone. I think "G" is up to something here - but not ready for a full production solution. Even still - despite some overall imminent victories from G against proxy-hijackers, it is always important to protect your content from other scrapers - and even future revisions of proxy-hijacking. I propose an additional step in dealing with this scourge: content stamping and source stamping. This stamp would be inserted into the content as a unique string of characters unique to each of your documents. A second part of the string would be an IP address encoding (either in decimal notation, in hex, or in any encrypted form for added effectiveness.) This would allow quick discovery of which content in the index is being copied and also give you an idea about what IP address was used to copy the content (or even tell you where this site got your content from) Here's an example of the content stamp: (first is the unique content ID, the second is the hex format of the IP address) 3P1I4E159 7F000001 -ensure that the content id tag is unique for each page or resource for your site. -of course you can make the unique content id longer to decrease the effectiveness when searching. -This is also very effective when copying sites insert random words or parts of words to break up your normal sentences. (I've seen "th" used as a random injection into various sentences to avert manual duplicate content detection) -You can also prepend the content stamp with a sitewide content stamp. This will allow you to search google for [your-global-content-stamp -site <a href="http://:http://yoursite.com]" rel="nofollow">:http://yoursite.com]</a> Making a google alert (or alert for any other service) regarding copies of your content will be swift and easy. Another thing I should mention: -You can use .htaccess blocking for sites without php. Just use a cron job (or scheduled task) to rebuild your .htaccess file regularly from your block list source. This is assuming that you can download the BL & get regular updates. -You can also use a captcha for forbidden requests so that real humans can access your site in the event of false positives. (You may want to "allow all" to the captcha page.
Richard Chmura

October 13, 2007 at 2:50 am

Small typo oops: "-of course you can make the unique content id longer to decrease the effectiveness when searching." should read: "-of course you can make the unique content id longer to increase the effectiveness when searching." Also, here's an example of an extended content stamp. (or content-id tag) 3243F6A8885A308D31319 3P1I4E159 7F000001 1 site wide content tag, 2 unique resource tag, 3 hex code for IP address of client requesting your resource. 1 is constant across your domain 2 is unique to each page/resource 3 is unique to each client who requests your content Find copies of your site by searching: -site:http:/yourdomain.com 3243F6A8885A308D31319 Find copies of your resource by searching: -site:http:/yourdomain.com 3243F6A8885A308D31319 3P1I4E159 (or remove the global content tag portion to broaden your results) Then WHOIS the associated IP address to determine where the content was picked up from. Blacklist if necessary.
Hamlet Batista

October 13, 2007 at 6:17 am

Hi Richard, Thanks for stopping by. <blockquote>I propose an additional step in dealing with this scourge: content stamping and source stamping. </blockquote> Sorry, I didn't make it clearer, but what you describe is the purpose of the encoded text I mentioned in my recommendation. Thanks for making it clearer, though. The honeypot does exactly this. The honeypot generates email addresses with encoded IP, timestamp, etc. The harvesters collect the addresses and when the spam hits, the email address has all the identifying information. They use base64 encoding, which makes the strings a little bit shorter than simply hexing the characters. <blockquote> My proposal to identify the CGI proxy hijackers is to generate an encoded text in the body of the honeypot page, and later perform searches for this text in major search engines to see if results come up. As the honeypot page is not supposed to be indexed (per the meta robots tag and robots.txt instructions), the presence of the text in the index means a CGI proxy is responsible. Finally, the encoded IP addresses and other information can be recorded in the blacklist and labeled as CGI proxy hijackers using one of the reserved slots shown in the diagram above. </blockquote> Your content stamping idea is really interesting, but the encoded text needs to be included in all pages. Many site owners would not like to display such text in their copy. I'd personally prefer to have that encoded text on the honeypot page that is not accessible by regular readers. I believe the CGI proxy hijackers 'copy' all the pages, so we can do the detection in one page to catch them. I guess you could hide the encoded text too, but there will be the concern of being penalized. <blockquote> Another thing I should mention: -You can use .htaccess blocking for sites without php. Just use a cron job (or scheduled task) to rebuild your .htaccess file regularly from your block list source. This is assuming that you can download the BL & get regular updates. -You can also use a captcha for forbidden requests so that real humans can access your site in the event of false positives. (You may want to “allow all” to the captcha page. </blockquote> These are excellent ideas. Thanks for sharing! Alternatively, they could rebuild the .htaccess on their PC and upload it to the server if they don't have access to scripting (the cron job needs to run a script).
Richard Chmura

October 13, 2007 at 11:46 am

Hi Hamlet, Yes, sorry I got a little carried away when detailing my explanation. ;) But here is the reason why I think content stamping on legitimate pages (non-honey pot) is also important: proxy-hijacking techniques will evolve - so will web scraping. What may be simple today, will be a totally different beast down the line. Honey pots will make a great positive identification of the unsophisticated proxies. However, as they change their tactics, locating the source of scraped content may become not so simple (or not so easy to black list). It is even possible that embedding scraping or proxy collecting software in malware could open up content to legitimate clients facilitating the theft. At that point the best defense is an aggressive offense of locating all content that has been copied. (And having a stamp in the public content too will help greatly) - Perhaps it could be worked into a copyright statement like: "copyright 2007 example.com 3243F6A8885A308D31319 3P1I4E159 7F000001" - or similar.
5ubliminal

October 16, 2007 at 1:49 am

My solution is for them coders out there with access to their site code ;) Cheers!
Richard Chmura

October 16, 2007 at 9:15 am

I see some ways in which Google is fixing this problem. However, it's the false positives which exascerbate this problem.
Jakki Degg

October 19, 2007 at 6:02 pm

Hey!...Thanks for the nice read, keep up the interesting posts..what a nice Friday
Hot Trends for Black Hat SEO at busin3ss’s black hat seo blog

March 10, 2008 at 11:48 pm

[...] but I’ve seen this technique being used more and more each day. Read more about it here and here. Basically you use proxy sites to generate duplicated content of a site… And Google is not [...]

Hamlet Batista

Dan Thies

Hamlet Batista

egorych

Hamlet Batista

Dan Thies

David Hopkins

Hamlet Batista

Richard Chmura

Richard Chmura

Hamlet Batista

Richard Chmura

5ubliminal

Richard Chmura

Jakki Degg

Hot Trends for Black Hat SEO at busin3ss’s black hat seo blog

Try our SEO automation tool for free!

OUR BLOG

Latest news and tactics

Getting Started with NLP and Python for SEO [Webinar]

Making it easier to implement SEO changes on your website

Changes to the RankSense SEO rules interface

How to Find Content Gaps at Scale: Atrapalo vs Skyscanner

Request your SEO Monitoring Invitation

Like Flies to Project Honeypot: Revisiting the CGI proxy hijack problem

Hamlet Batista

Dan Thies

Hamlet Batista

egorych

Hamlet Batista

Dan Thies

David Hopkins

Hamlet Batista

Richard Chmura

Richard Chmura

Hamlet Batista

Richard Chmura

5ubliminal

Richard Chmura

Jakki Degg

Hot Trends for Black Hat SEO at busin3ss’s black hat seo blog

Try our SEO automation tool for free!

OUR BLOG

Latest news and tactics

Getting Started with NLP and Python for SEO [Webinar]

Making it easier to implement SEO changes on your website

Changes to the RankSense SEO rules interface

How to Find Content Gaps at Scale: Atrapalo vs Skyscanner