In computer security we have several ongoing battles: the virus/spyware writers vs. the antivirus vendors, the spammers vs. the anti-spam vendors, the hackers vs. the security experts. Add to that list the search engine marketers vs. the CGI hijackers.
Dan Thies, the undisputed keyword research master, used his influence in the search engine marketing industry to bring the problem we have blogged about in the past to a wider audience. Specifically, the issue is the CGI proxy hijacking. He mentioned a couple of solutions, but as I pointed out in my comment, both solutions have weaknesses. I recommended a stronger countermeasure, similar to what is in use in the anti-spam industry at the moment. But after reflecting on my proposed solutions and others’, it is clear in my head that this is a never-ending battle. We can create defenses to current techniques and attackers will adapt and make their attacks smarter.
Why? All the content and headers must pass through the proxy, and the proxy can alter it without problems. A determined hijacker will be able to circumvent any defenses. If we check the HTTP_USER_AGENT, the proxy can provide a fake one to avoid detection. If we alter the content of the page to pass a meta robot’s “noindex” tag, the proxy can remove it. The same can happen if we pass an X-Robots-Tag header. Every page passes through the proxy and the proxy can alter the content.
The solution I proposed requires more work on the hijacker to beat, but it is definitely possible to break too. It requires the content be altered and the proxy can identify that content and remove it. This will make collecting the IPs impossible. For example, to tell what has changed, the code can compare the content to the one cached by the search engine or, even better, to pass the content directly from the search engine cache.
Dan is confident that most attacks will not come from modified proxies, but from hijackers using other people's unmodified proxies. They would not install the proxies themselves to avoid being identified. The problem is that serious hackers rarely use their own systems; they use compromised ones. They first hack into servers where the administrator has not installed the latest security patches, or where there are web applications with exploitable holes.
In principle we need to understand the same concept that is used in security in general. We need to make it hard enough for the attacker so that the reward isn’t worth the effort involved. Sometimes this is easier said than done.
The bottom line is that we can go back and forth battling CGI hijackers, but it is ultimately Google that needs to fix this problem. They need to change the method they use to determine the original source of some content. I proposed a solution to them in another post. I'd appreciate your feedback.