In order to protect some of the inventions in our software, I’ve been working with a law firm that specializes in IP protection. I’ve learned a lot from them, but I’ve learned far more from reviewing the patent applications they sent me back as possible ‘prior art.’ Let me share one of the most interesting ones I’ve seen so far, Patent Application 20070143283. Here is the abstract:
A system and method for optimizing the rankings of web pages of a commercial website within search engine keyword search results. A proxy website is created based on the content on the commercial website. When a search engine spider reaches the commercial website, the commercial website directs the search engine spider to the proxy website. The proxy website includes a series of proxy web pages that correspond to web pages on the commercial website along with modifications that enhance the rankings of the pages by the search engines. However, hyperlinks containing complex, dynamic URLs are replaced with spider-friendly versions. When a human visitor selects a proxy web page listing on the search engine results page, that visitor is directed to the proxy web page. The proxy server delivers the same content to the human visitor as to the search engine spider, only with simplified URLs for the latter.
Basically they use a reverse proxy (I wrote about this before) to replace dynamic URLs with search engine–friendly ones automatically. In addition to this, they make ‘enhancements’ to the proxy version of the pages so that they get high search engine rankings. They claim this is not cloaking:
 The content contained on the proxy web pages is the same when the proxy web page is accessed either by the search engine spider or by the human visitor. The presentation of the same web page content to both the search engine spider and the human visitor allows the proxy website to stay within the ‘no cloaking’ guidelines set by most commonly used search engines.
If they rewrote only the dynamic URLs I would agree that they present the same content to both users and search engines. I don’t think as users we care much about the URLs, unless we have to type them manually into the address bar. However, I do think it is cloaking because they say changes are made to the pages in order to optimize them for higher search engine rankings—and they only present these optimized pages to the search engine crawlers. From the patent application:
 Since the proxy web pages are contained on a proxy website separate from the commercial website, additional content and HTML optimization can be added to the proxy web pages that are not included on the corresponding web pages on the commercial site, via a web-based interface. The addition of this content and HTML optimization on the proxy web pages can be utilized to enhance the ranking of the proxy web pages on the search engine results pages. The effect of the addition of these optimizations on ranking can be analyzed and the content can then be revised to further enhance the ranking of the proxy web page. By utilizing the proxy web pages rather than the web pages contained on the commercial website, the rankings and functionality of the proxy web pages can be enhanced without altering the commercial web pages.
That being said, I think this is a very useful and clever technique. Rewriting dynamic URLs to make them search engine friendly via a reverse proxy is extremely useful, particularly for large e-commerce sites where the CMS or shopping cart software is not flexible enough.
Here is another interesting use that came to my mind and is not mentioned in the PA. (Maybe I should file a patent for this.) 😉
If only the world were flat
Picture a tiered site architecture where you have a home page and tiered internal pages. Tier 1 includes pages that the search engine robots access in one click; tier 2 are pages that are accessible via two clicks, tier 3 via three clicks, and so on. Search engine spiders visit a limited number of pages per site and follow a limited number of clicks from the entrance page (usually the home page). The more clicks necessary to arrive at a page the less likely the page will be crawled or indexed. Ideally you would like to have a flat site architecture where all the pages are in tier 1. Unfortunately, while this is good for search engines, it is not very appealing for your site’s visitors. Imagine how crowded your home page would look with so many links!
An automatic solution
In the initial step, a simple crawler script visits the whole site and tags each page with its corresponding tier: tier 1, tier 2, tier 3, etc. The script would record such information in a database. When a search engine requests a tier 1 page via the reverse proxy, the proxy can inject the URLs of the pages in the next non-direct tier (tier 3 — tier 2 pages are directly accessible when the robot parses the tier 1 page) and so on. This will provide a flatter structure for the search engine robot, allowing for more pages to be indexed, saving bandwidth and CPU cycles for the SEs crawlers. Alternatively, the proxy can inject links to all internal pages beyond the next tier, i.e.: tier 3, tier 4, etc. when the search engine robot requests pages on tier 1. This would make the site completely flat.
This is definitely very useful, but as I clearly explained above, this is cloaking. In my last post about cloaking Jill Whalen and others expressed concern that Google’s view of this is still negative. It is my personal opinion that Google needs to draw a line between the legitimate uses of cloaking and cloaking to take advantage of search engines. In order to stay on the safe side it is not a bad idea to ask Google if they are OK with this.
After reading an insightful comment from