The Fascinating Role of AI in Accelerating SEO Success

The Fascinating Role of AI in Accelerating SEO Success

In this post I’m going to officially introduce RankSense: its rebirth as an SEO Artificial Intelligence platform, why I built it, and the problems it aims to address.

I believe that SEO success should take a few weeks, not six months or more. Unfortunately, that is the sad reality we live in now. Developing new content, building an audience around it, and waiting for Google to reward your efforts does take many months. But is that the only way to see SEO success?

When you operate a large site, such as an online retail or enterprise site with thousands or millions of pages, you have several factors that greatly delay SEO results. On the flip side, you also have other key SEO levers to pull besides content marketing. I covered one of those in detail in the popular Practical Ecommerce article “How to Detect and Correct Duplicate Content Pages.” Technical SEO remains a massive untapped opportunity for large sites. 

In order to understand the current challenges, I’ll go through the best-case scenario I saw while running my SEO agency. The best-case scenario is an SEO-savvy company that understands that pages need to be indexed before they rank. This was not typically the case because most companies’ understanding of SEO is limited to ranking #1 for competitive head terms.

Here is the scenario:

An SEO consultant or in-house SEO starts with a Screaming Frog, Moz or Deep Crawl crawl of the site, which can take several days or weeks, depending on how big the site is.

Timeline - crawl site

Our SEO specialist takes the reports from those fine tools and writes a detailed list of action items, which are converted to JIRA tickets. The technical tickets get assigned to developers, and the content/writing tickets get assigned to marketing people or are performed directly by the SEO specialist.

Timeline - action items

Developers are generally busy with other projects that get higher priority because their ROI is clear and certain. SEO projects often don’t get prioritized. I believe most companies only invest “gambling” money on SEO: money that could have a big payoff but that they aren’t afraid to lose.

Developers publish changes according to their software development lifecycles and the complexity of the fixes. In some cases, the fixes or recommendations are impossible to implement because of limitations in the content management or e-commerce system.

SEO timeline - developers stage site

Later, the changes get published to a staging environment, and the SEO consultant or in-house person reviews them before they are released to production.

SEO timeline - review and release

The changes are now live, and we need to wait weeks for Google to pick up the changes.

So, in this best case scenario where the company is SEO-savvy, the developers are available, the teams execute with precision, and the recommendations work, we have to wait around 2 months to see results.

Now, imagine the improvements added $5,000 per day in organic revenue. If we could accelerate the work by one month, we would save $150,000 in opportunity cost, money that could be invested in other marketing efforts.

Accelerated results would also allow us to determine the effectiveness and ROI of the SEO strategy faster and more accurately.

There has to be a better way. Right?

That is what I thought.

I designed our artificial intelligence SEO software with the main goal of compressing time in the SEO cycle. This goal was anchored on three core principles:

  1. Tasks should be done in parallel.
  2. Repetitive, high-value tasks should be automated.
  3. Processes should be improved by a data-driven feedback loop.

Parallelizing Tasks

In software engineering, we have the well-known concepts of batch and stream processing. Batch processing is essentially how SEO work is done now. All the tasks are completed in large chunks which must be performed sequentially because they depend on one another. This is the obvious choice when you have people performing the work. Each person is specialized on a different type of task. One specialist, say a web developer, needs to wait on the completed work from another, such as an in-house SEO.

However, when you have machines doing the work, you can tell them to complete small portions of the work and collaborate to do independent tasks in parallel.

Let me explain how RankSense completes the workflow I described above in parallel.

The first breakthrough idea is that we don’t need to simulate a spider crawl to audit sites like all other SEO tools do. We piggyback on the crawls from search engine crawlers that take place all the time. Think about that for a minute.

Google, Bing and other search engines are actively crawling sites all the time, and if we could tap into those crawls to perform our audits, we get at least 3 benefits:

  1. We would get an accurate picture of SEO issues as seen by search engines.
  2. We would not slow down sites with extra crawls or consume bandwidth.
  3. We would detect SEO issues in real time as they are discovered by search engines. (This is my favorite because it aligns with my goal of compressing time.)

That is exactly what RankSense does, and it works incredibly well. 

When you are manually addressing SEO issues, getting real-time reports is not very helpful because there is still the bottleneck of the manual processes.

Now, I’ll need to introduce the next breakthrough that enabled our automated implementation.

AI Workflows

I wrote my first AI system during my first programming class back in college many years ago. We were tasked with writing a Connect 4 game where the computer would play against the human player. I wrote about this a while ago in this post. My classmates designed their games by writing down a set of rules the machine would follow, but I wrote mine so the computer would look ahead to find the best possible moves. Computer chess programs work in a similar way. The advantage of a game is that you have complete information, but in many domains, like SEO, that is not the case. For our first generation, we are focusing on a rules-based approach, and using machine learning to rank which workflows (set of rules) work best over time.

A large number of high-value technical SEO fixes are repetitive. For example, many duplicate content issues can be addressed the same way using canonicals or 301 redirects. The big challenge is generalizing, since most sites are completely unique.

I borrowed the solution and our next breakthrough from two industries: marketing automation and cyber security. In most marketing automation tools, custom workflows are heavily used to personalize e-mail marketing campaigns. In the security industry, antivirus and firewall vendors use small “fingerprints” or signatures to detect viruses or hacking attacks. I combined both ideas: we treat SEO issues like viruses with “signatures” that are consistent across sites, and we use customizable workflows to guide precise recommended actions. As my goal is to compress time, we now have a growing list of predefined workflows to address the most common SEO problems and opportunities.

If our client’s SEO specialist doesn’t agree with how we are prescribing solutions, they can create their own workflow with a visual drag and drop interface.

Here is our automation designer. We are still working to make it more intuitive to use, but it works really well.

RankSense Automation Designer

Pause for a moment to think about the implications of these two key components of our system. We can audit sites in real time, and as the issues are identified, we can execute predetermined fixes in an instant. Just think about how much valuable time this saves. Massive. Right?

Let me explain the third, and most important breakthrough in our system. While saving time is valuable, it is hard to sell a solution based on opportunity cost alone. Nobody wakes up thinking they want to save opportunity cost. The reason people invest in SEO or SEO tools is because they’re trying to increase business with more traffic. So, a second key goal for me is that our recommendations actually work and increase revenue – every time. But, how do you do that when Google is a moving target, constantly changing its algorithm?

Change Validation

In order to appreciate our third breakthrough idea, imagine that you could:

  1. Make quick changes to your site, say title changes,
  2. Learn when those changes are picked up by Google and other search engines, and
  3. See the positive or negative impact of those changes directly inside your analytics package.

This would allow you to see the traffic before and after the change and the revenue impact. You’d be able to determine the effectiveness of the change pretty easily. Now, imagine if information about effective changes came not only from your own site, but from all the sites in the network. The crowd-sourced information would help prioritize changes with a proven track record of success. Plus, it would help you avoid changes that have a track record of failures. That would be super powerful. Right?

Well, this is our third and most valuable breakthrough. 

RankSense sequences

Now, let me show you the kind of amazing results that we’ve seen from combining these three ideas. This client implemented RankSense on January 4. Their organic rankings improved for significantly more keywords with RankSense than with manual efforts.

Ranking improvements

We dramatically increased the number of pages receiving traffic (blue line, below), while maintaining and growing the number of visitors per page. This has resulted in more than $400k in additional revenue in less than 4 months. I’ll cover the details of what worked so well for them in a separate post.

Traffic increase with RankSense

Recently, SEO teams such as Etsy’s have popularized an interesting approach to validate the SEO impact of site changes. It consists of treating SEO changes as scientific experiments. We started playing with this idea early this year. We had some great success, so we added the capability to run SEO experiments to our platform. We are specifically focusing on search snippet experiments because we see the biggest promise there. You never know what messaging will resonate best with search visitors, so it’s beneficial to be able to test different messaging. We have found that there’s massive opportunity in treating organic search snippets like ads.

Traffic increase

SEO A/B testing is the main focus of similar tools from our friends at Distilled and YC combinator startup that recently launched. We don’t make it our main focus because our mission is to accelerate results, and change validation is important, but only one piece of the puzzle.

It looks like we have good timing because many companies now know all the SEO work they need to do, but they are having difficulty getting it done fast.

Technical Integration

I had a fun sales conversation a few weeks back where the prospect didn’t believe one bit of what our platform does, even while looking at a live demo. He told me that 301 redirects take a week or more to get implemented by their SI (systems integrator), and forget about URL rewrites – they are not possible. How can we address SEO issues or add new SEO features if the e-commerce platform doesn’t support them?

We came up with a solution for this in 2010 while working at Altruik. We simply reverse proxied the whole site, and made the changes on the fly on the reverse proxy. Before that, to my knowledge, most reverse proxies improving SEO were limited to a directory or subdomain. The adoption challenge we faced back then was that we added around 500ms to the page load time, which is too much.

Content delivery networks like Cloudflare that make on the fly security changes solve this problem by focusing on site speed. They offload page resources (which take more of the page load time) across a distributed network of caches, which leads to faster sites. The challenge is that building a CDN from scratch takes serious capital investment.

Google solved this problem for us beautifully with the introduction of Google Cloud CDN, which is directly integrated into HTTP(S) load balancers, and has high speed connections to the mayor CDN providers (Akamai, Cloudflare, etc.) You can learn more about the advantages of Google Cloud CDN here.

The combination of Kubernetes and Google Cloud CDN allow us to deliver our SEO optimizations while speeding up our clients’ sites instead of slowing them down. Our on-the-fly changes take around 10 milliseconds, but the speed gains more than make up for that. We cache all page resources on the CDN by default, and our infrastructure scales automatically thanks to Kubernetes. It also helps that most clients and potential clients are moving to the cloud.

This is a third party CDN performance report from Cedexis where Google Cloud CDN is outperforming the leaders

Using this system, we can implement RankSense for new clients very fast, with only two DNS changes on their end:

  1. (Optional) One record to generate the Comodo SSL certificate, and
  2. A DNS change to activate our software.

Some of our clients that already have CDNs like Akamai, Cloudflare, etc., enjoy the fast interconnect partnership those CDNs have with Google Cloud.

We cache all page resources on their CDNs, and optimize their sites. The integration remains very simple.

Now, while our clients get a CDN with our service, we don’t see ourselves as a CDN provider. The CDN is one of the delivery mechanisms. For example, we have direct API integration to Akamai to automate provisioning.

Because we run our infrastructure in Kubernetes, we can deliver our software on most cloud providers and on premise.

This novel approach of improving websites on the edge of the CDNs will open up all sorts of interesting applications outside of SEO and cyber security. We actually have a licensing partner that will leverage our technology to address the quality of content on very large enterprises (a partnership we’re very excited about).

Who will benefit from automated SEO?

Sites with many pages will see the most benefit from the platform – the more pages, the better. We’ve focused primarily on e-commerce retailers (and needed to obtain PCI compliance to do so). If you operate an e-commerce or enterprise site, and you’d like to learn more about how your company could benefit from our platform, please feel free to sign up for a trial.

We also offer a white label version of the platform to agencies serving enterprise and e-commerce clients, but most agencies with or without in-house SEO expertise could benefit from our software. Please feel free to reach out to me directly for agency partnerships.

Additional Ways to Use Chrome Developer Tools for SEO

I recently read Aleyda’s excellent post about using Chrome Developer Tools for SEO, and as I’m also a big fan of DevTools, I am going to share my own use cases.

Misplaced SEO tags

If you are reviewing correct SEO tag implementations by only using View Source, or running an SEO spider, you might be overlooking an important and interesting issue.

I call this issue misplaced DOM tags, and here is one example

If you check this page using View Source in Chrome or any other browser, you would see the canonical tag correctly placed inside the <HEAD> HTML element.

Similarly, if you check this page using your favorite SEO spider, you’d arrive at the same conclusion. The canonical tag is inside the <HEAD> HTML element, where it should be.

Now, let’s check again using the Chrome Developer Tools Elements tab.

Wait! What?? Surprisingly, the canonical tag appears inside the <BODY> HTML element. This is incorrect, and if this is what Googlebot sees, the canonical tag on this page is effectively useless. Then we go blaming the poor tag saying that it doesn’t work.

Is this a bug in Google Chrome Developer Tools? Let’s review the same page with Firefox and Safari Developer Tools.

You can see the same issue is visible in Firefox and Safari too, so we can safely conclude that it is not a problem with Developer Tools. It is very unlikely all of them would have the same bug. So why is this happening? Does The Home Depot need to fix this?

Let’s first look at how to fix this to understand why it happens.

We are going to save a local copy of this page using the popular command line tool curl. I will explain why it is better to use this tool than to save directly from Chrome.

Once we download the web page, open it in any of the browsers to confirm the problem is still visible in the DevTools. In my case, I didn’t see the issue in Chrome, but saw it in Safari. I’ll revisit why the discrepancy when we discuss why this happens.

Next, in order to correct the issue we will move the SEO meta tags so they are the first tags right after the opening <HEAD> HTML tag.

Now, let’s reload the page in Safari to see if the canonical still shows up inside the <BODY> HTML tag.

Bingo! We have the canonical correctly placed, and visible inside the HTML <HEAD>.

In order to understand why this addresses the issue, we need to understand a key difference between checking pages with View Source, and inside the Elements tab in the web browsers’ DevTools.

The Elements feature has a handy feature that allows you to expand and collapse parent and child elements in the DOM tree of the page. In order for this feature to work, the web browser needs to parse the page and build the tree that will represent the DOM. A common issue with HTML is that it often contains markup errors or invalid tags placed in the wrong places.

For example, if we check the page using

You can see this page “only” has 61 HTML coding errors and warnings.

Fortunately, web browsers expect errors and automatically compensate for them using a process called HTML linting or tidying. A popular tool that does this is by Dave Raggett at W3C.

The tidying process works by adding missing closing tags, reordering tags, etc. This works flawlessly most of the time, but it can often fail and tags end up in the wrong places. This is precisely what is happening here.

Understanding this allowed me to come up with the lazy trick to move the SEO tags to the beginning of the head, because this essentially bypasses any problems introduced by other tags. 🙂

A more “professional” solution is to at least fix all the errors reported between the HTML <HEAD> tags.

Can we tell if this is affecting Googlebot or not?

It is fair to assume that as Google is now able to execute JavaScript, that Google’s indexing systems need to build DOM trees just like the main browsers do. So, I’d not ignore or overlook this issue.

A simple litmus test to see if the misplaced canonicals are being ignored is to check whether the target page is reporting duplicate titles and/or duplicate meta descriptions in Google Search Console, or not. If it is reporting duplicates, correct the issue as I explained here, use Fetch as Googlebot, and re-submit the page to the index. Then wait and see if the duplicates clear.


Following redirect chains

Another useful use case is reviewing automatic redirects from desktop to mobile optimized websites, or from http to https or viceversa directly in your browser.

In order to complete the next steps, you need to customize DevTools a little bit.

  1. Tick the checkbox that says “Preserve Log” in the Network tab so the log entries don’t get cleared up by the redirects
  2. Right-click on the headers of the Network tab, and select these additional headers: Scheme, Vary, and optionally Protocol to see if the resources are using the newer HTTP/2 protocol

In this example, we opened, and you can see we are 301 redirected to, from secure to non-secure, and we can also see that the page provides a Vary header with the value User-Agent. Google recommends the use of this header with this value to tell Googlebot to try refetching the page but with a mobile user agent. We are going to do just that, but within Chrome using the mobile emulation feature.

Before we do that, it is a good idea to clear the site cookies because some sites set “desktop sticky” cookies that prevent the mobile emulation from working after you have opened the site as a desktop user.

Let’s clear the network activity log and get ready to refresh as a mobile user. Remember that we will open the desktop URL to see the redirection.

In this case you can see that Macys correctly 302 redirects to the mobile site at, which is consistent with Google’s recommendation.


Sneaky affiliate backlinks

As Aleyda mentioned in her post, we can use DevTools to find hidden text, and some really sneaky spam. Let me share with you a super clever link building trick I discovered a while ago while auditing the links of a client’s competitor. I used our free Chrome DevTools extension as it eliminates most of the manual checks. You can get it from here.  

To most of you, and to most Googlers, this looks like a regular backlink and it doesn’t raise any red flags. The anchor text is “here”, and it is directly in the editorial content like most editorial links. However, coming from an affiliate marketing background, I see the extra tracking parameters can be effectively used to track any sales that come from that link.

I’m not saying they are doing this, but it is relatively easy to convince many unsophisticated bloggers to write about your product, and place affiliate links like this back to your site to get compensated for sales they generated. Sales you would track directly in Google Analytics, and maybe even provide reporting by pulling stats via the GA API.

Now, the clever part is this one: they are likely setting up these tracking parameters in Google Search Console so Googlebot ignores them completely, and it is normal to expect utm_ parameters to be ignored. This trick effectively turns these affiliate links into SEO endorsement links. This is one of the stealthiest affiliate + SEO backlink tricks I’ve seen in many years reviewing backlink profiles!


Troubleshooting page speed issues

Let’s switch gears a bit, and discuss pagespeed from an implementation review perspective.

Let’s review another example to learn how well the website server software or CDN handles caching page resources. Caching page resources in the client browser or CDN layer offers an obvious way to improve page load time. However, web server software needs to be properly configured to handle this correctly.

If a page has been visited before, and the page resources are cached, Chrome sends conditional web requests to avoid refetching them each time.

You can see page resources already cached by looking for ones with the status code 304, which means that they haven’t changed on the server. The web server only sends headers in this case, saving valuable bandwidth and page load time.

The conditional requests are controlled by the IF-Modified-Since request header. When you tick the option in DevTools to disable the cache, Chrome doesn’t send this extra header, and you won’t see any 304 status code in the responses.

This is particularly handy to help troubleshoot page resource changes that users report are not visible.

Finally, it is generally hard to reproduce individual users’ performance problems because there are way too many factors that impact page load time outside of just the coding of the web page.

One way to easily reproduce performance problems is to have users preserve the network log and export the entries in the log as an HAR file. You can learn more about this in this video from Google Developers

Google provides a web tool you can use to review HAR files you receive from users here Make sure to warn users about saving potentially sensitive information in this file.

Bonus: Find mixed http and https content quickly

Aleyda mentioned using DevTools to check for mixed http and https content raising warnings in your browser. Here is a shortcut to identify the problematic resources quickly.

You can type “mixed-content:displayed” in the filter to get the resources without https.

If you are not actively using DevTools in your SEO audits, these extra tips encourage you to get started. And, if you are, please feel free to share any cool tips you might have discovered yourself.

How to Get Googlebot to “Teach You” Advanced SEO

I recently worked on an enterprise-level client’s non-SEO related project where the goal was to confirm or deny that their new product:

1)  Was not doing anything that could be considered black hat.

2)  Was providing any SEO benefit for their clients.

The problems you face with projects like this is that Google doesn’t provide enough information, and you cannot post corner-case questions like this in public Webmaster forums. To do so would violate your NDA, and potentially reveal your client’s intellectual property. So, what option do you have left? Well, you set up a honeypot!

A honeypot is a term that comes from the information security industry. Honeypots are a set of files that, to an automated program, appear like regular files, but they allow for the monitoring and “capturing” of specific viruses, e-mail harvesters, etc. In our case, we set up a honeypot with the purpose of detecting and tracking search engine bot behavior in specific circumstances. We also wanted to track the outcome (positive, neutral or negative) in the search engine results pages (SERPs).

Let me walk you trough a few ways you can learn advanced SEO by using a honeypot. Read more

How to Think Like an SEO Expert

If you want to become an expert you need to start thinking like one. People perceive you as an authority in your field not because you claim you are, but by listening to what you say or reading what you write. From my personal experience, the key seems to be the originality, usefulness and depth of what you have to share. Recently I was very honored to contribute to a link-building project. I wanted to share with you my idea, but more than that, in this blog I like to take extra time to explain the original thought process that helped me come up with the idea in the first place.

The Challenge

Toolbar PageRank was a very important factor in measuring the quality of a link for a long while. But Google has played so much with it that it can hardly be considered reliable these days. I like to see problems like these as challenges and opportunities, so I decided to look hard for alternatives. I know there are several other methods (like using the Yahoo backlink count, number of indexed pages, etc.) but I did not feel these directly reflected how the link was important to Google, or to any other specific search engine. Each search engine has its own evaluation criteria when it comes to links, so using metrics from one to measure another is not a reliable gauge in my opinion.

I knew the answer was out there, and I knew just where to look. Read more

PageRank: Caught in the paid-link crossfire

Last week the blogosphere was abuzz when Google decided to ‘update’ the PageRank numbers they display on the toolbar. It seems Google has made real on its threat to demote sites engaged in buying and selling links for search rankings. The problem is that they caught some innocent ones in the crossfire. A couple of days later, they corrected their mistake, and those sites are now back to where they were supposed to be.

The incident reveals that there is a lot of misunderstanding about PageRank, both inside and outside the SEO community. For example, Forbes reporter Andy Greenberg writes:

On Thursday, Web site administrators for major sites including the, Techcrunch, and Engadget (as well as found that their “pagerank”–a number that typically reflects the ranking of a site in Google

He also quotes Barry Schwartz saying:

But Schwartz says he knows better. “Typically what Google shows in the toolbar is not what they use behind the scenes,” he says. “For about two and a half years now this number has had very little to do with search results.”

There are two mistakes in these assertions:

  • The toolbar PageRank does not reflect the ranking of a site in Google. It reflects Google’s perceived ‘importance’ of the site.

  • The toolbar PageRank is an approximation of the real PageRank Google uses behind the scenes. Google doesn’t update the toolbar PageRank as often as they update the real thing, but saying that it has little to do with search results is a little farfetched.

Several sites lost PageRank, but they did not experience a drop in search referrals. Link buyers and sellers use toolbar PageRank as a measure of the value of a site’s links. By reducing this perceived value, Google is clearly sending a message about paid links. The drop is clearly intended to discourage such deals.

Some ask why Google doesn’t simply remove the toolbar PageRank altogether so that buyers and sellers won’t have a currency to trade with. At first glance it seems like a good idea, but here is the catch—the toolbar PageRank is just a means of enticing users to activate the surveillance component that Google uses to study online behavior. Google probably has several reasons for doing so, but at minimum it helps measure the quality of search results and improve its algorithms. If Google were to remove the toolbar PageRank users would have no incentive to let Google ‘spy’ on their online activities. Read more

Like Flies to Project Honeypot: Revisiting the CGI proxy hijack problem

CGI proxy hijacking appears to be getting worse. I am pretty sure that Google is well aware of it by now, but it seems they have other things higher on their priority list. If you are not familiar with the problem, take a look at these for some background information:

  1. Dan Thies take and proposed solutions

  2. My take and proposed solutions

Basically negative SEOs are causing good pages to drop from the search engine results by pointing CGI proxy servers’ URLs to a victim’s domain, and then linking to those URLs so that search engine bots find them and the duplicate content filters drop one of the pages—inevitably the one with the lowest PageRank, the victim’s page.

As I mentioned in a previous post, it is very likely that this would be an ongoing battle, but that doesn’t mean we have to lay down and do nothing. Existing solutions require the injection of a meta robots noindex tag on all web pages if the visitor is not a search engine. In this way search engines won’t index the proxy-hijacked page. Unfortunately, the proxies are already altering the content before passing it to the search engine. I am going to present a solution I think can drastically reduce the effectiveness of such attacks. Read more

Google's Architectural Overview (Ilustrated) — Google's inner workings part 2

For this installment of my Google's inner workings series, I decided to revisit my previous explanation. However, this time I am including some nice illustrations so that both technical and non-technical readers can benefit from the information. At the end of the post, I will provide some practical tips to help you improve your site rankings based on the information given here.

To start the high level overview, let's see how the crawling (downloading of pages) was described originally.

Google uses a distributed crawling system. That is, several computers/servers with clever software that download pages from the entire web Read more

What is the practical benefit of learning Google's internals?

I forgot to start my Google inner workings series with WIIFM. My plan is to write one post each week.

Not matter how well I try to explain it, it is a complex subject. I should have started the first post explaining why you would want to learn that. There are a lot of easier things to read.With some people questioning the usefulness of SEO, this is a good time to make my views clear. Please note that I believe in a solid marketing mix that includes SEO, PPC, SMO, affiliate marketing, viral marketing, etc. Do not put all your eggs in one basket.

If you have been blogging for a while, you have probably noticed that you are getting hits from the search engines for words that you did not try to optimize. For example, the next day I started this blog, I received a comment from a reader that found my blog through a blog search! How was this possible?

Heather Paquinas May 26th, 2007 at 1:24 am

I found your blog in google blogsearch. Needless to say I subscribed right away after reading this. I always suspected what you said, especially after Mike Levin from hittail blogged about using hittail for ppc, but you really hit the nail on the head with this post.

This is possible because that is the job of the search engines! If every page you search had to be optimized, there wouldn't be billions of pages in Google index. It would take a lot of people to do the SEO work :-).

Why we need SEO then? Read more

Google's architectural overview — an introduction to Google's inner workings

Google keeps tweaking its search engine, and now it is more important than ever to better understand its inner workings.

Google lured Mr. Manber from Amazon last year. When he arrived and began to look inside the company’s black boxes, he says, that he was surprised that Google’s methods were so far ahead of those of academic researchers and corporate rivals.

While Google closely guards its secret sauce, for many obvious reasons, it is possible to build a pretty solid picture of Google's engine. In order to do this we are going to start by carefully dissecting Google's original engine: How Google was conceived back in 1998. Although a newborn baby, it had all the basic elements it needed to survive in the web world.
Read more

Determining searcher intent automatically

Here is an example of how useful it is to learn SEO from research papers.

If you’ve read some of my previous posts, you will know that I am a big fan of finding out what exactly search visitors want. I posted about classifying both visitors and landing pages, so that search visitors looking for information find information articles, searchers looking to take action land on transaction pages, etc.

I really like the research tools MSN Labs has. One of my favorites is this

You can use it to detect commercial intent. Try it. It is really nice.

I’ve been wanting to do something like that, but I didn’t have enough clues as to how to do it. Until now.

Search engines patent expert, Bill Slawsky, uncovered a gem. A research paper that details how a team of researchers achieved exactly this.

I still need to dig deep into the document and the reference material, but it is definitely an excellent find.

I will try to make a new tool for this. I will also try to make this and other scripts I write, more accessible to non-technical readers. I guess most readers don’t care much about the programming details. They just want to be able to use my tools easily 🙂