Google's Architectural Overview (Ilustrated) — Google's inner workings part 2

by Hamlet Batista | June 22, 2007 | 2 Comments

For this installment of my Google's inner workings series, I decided to revisit my previous explanation. However, this time I am including some nice illustrations so that both technical and non-technical readers can benefit from the information. At the end of the post, I will provide some practical tips to help you improve your site rankings based on the information given here.

To start the high level overview, let's see how the crawling (downloading of pages) was described originally.

Google uses a distributed crawling system. That is, several computers/servers with clever software that download pages from the entire web

Step 1

There is an URLserver that sends URL lists to be fetched by the Crawlers. These crawlers download each individual page to a Store Server.

step11.PNG

Step 2

The Store Server compresses all web pages and stores them in a Repository, where every page is assigned an ID, the docID.

step21.PNG

Step 3

This is the complex interesting part: the indexing.

The Indexer reads the pages from the Repository, uncompresses them, and extracts key information.

Each document is broken down into a set of word occurrences called word hits.

The word hits record the word, position in the document, an approximate font size, and capitalization. These are later distributed into a set of "barrels". This is the first step in creating the index. At this moment, it is a partially sorted forward index. This means that you can find the word hits by document identifier, but to be useful for search you need the opposite (find documents by the words they contain). That is why it is needed an inverted index.

step31.PNG

The indexer extracts all the links in every web page and stores important information about them in an Anchors File. This file contains enough information to determine where each link points to and from, and the text of the link.

step41.PNG

Step 4

The Sorter takes the barrels, that were previously sorted by docID and resorts them by word identifier to generate the Inverted Index. Each possible word in the index has an unique identifier, the wordID. The Sorter then produces a list of wordIDs, and their locations in the Inverted Index.

step4b.PNG

Step 5

The URLresolver reads the anchors file and converts relative URLs (/page.html) into absolute URLs (http://site.com/page.html) and assigns docIDs. The link text is included in the Document Index, associated with the docID that the link points to. It also creates a database of links( pairs of docIDs) that is used to compute the PageRank of each document.

step52.PNG

 

Step 6

A program called DumpLexicon takes the wordID list produced from the Inverted Index

as well as the Lexicon generated by the Indexer and converts this information into a new lexicon for the Searcher. The searcher is run by a web server and uses the lexicon built by DumpLexicon together with the inverted index and the PageRanks to answer queries.

step61.PNG

I really hope that this time it was far easier to understand these concepts; now to the practical applications.

Search engine friendliness tips

The ability of the Indexer to decompose your web pages and find all of the links is paramount if you want your content to be indexed. Make sure your pages' HTML are not broken or the site isn't inaccessible for long periods of time.

Expect robot hits coming from multiple IPs (Thanks to distributed crawling).

On-page optimization tips

I highlighted an important section in the explanation. It is clear on Google's
original research paper that they use the position of the word in the document, the font of the word, and the capitalization information. I am sure a lot of people don't pay attention to the value of putting words in capitals to emphasize them. This is a confirmation from the source that it is useful to engage in this practice.

When we go more in depth in the next installments we will see just how important the presence of the words in the title, URL, etc. actually are.

For now, make sure you use your most important keywords at the top of the page. Use large fonts or headings (h1,h2, etc.) and if possible use capitals. How
ever don't abuse it by being overly aggr
essive.

Off-page optimization tips

As you can see here, Google was conceived to make heavy use of the text in the links as a qualifier for what the page that the link is pointing to is about. Nowadays, Google has very potent filters to avoid unscrupulous manipulation of this key feature. My recommendation is that you should try to get links with several different anchor texts; the more mixed the better. It needs to look really natural to avoid filtering.

In our next post, we are going to briefly look at the main software components and take a closer look at the Crawling part.

Hamlet Batista

Chief Executive Officer

Hamlet Batista is CEO and founder of RankSense, an agile SEO platform for online retailers and manufacturers. He holds US patents on innovative SEO technologies, started doing SEO as a successful affiliate marketer back in 2002, and believes great SEO results should not take 6 months

2

REPLIES

Try our SEO automation tool for free!

RankSense automatically creates search snippets using advanced natural language generation. Get your free trial today.

OUR BLOG

Latest news and tactics

What do you do when you’re losing organic traffic and you don’t know why?

Getting Started with NLP and Python for SEO [Webinar]

Custom Python scripts are much more customizable than Excel spreadsheets.  This is good news for SEOs — this can lead to optimization opportunities and low-hanging fruit.  One way you can use Python to uncover these opportunities is by pairing it with natural language processing. This way, you can match how your audience searches with your...

READ POST
Making it easier to implement SEO changes on your website

Changes to the RankSense SEO rules interface

As we continue to improve the RankSense app for Cloudflare, we are always working to make the app more intuitive and easy to use. I'm pleased to share that we have made significant changes to our SEO rules interface in the settings tab of our app. It is now easier to publish multiple rules sheets and to see which changes have not yet been published to production.

READ POST

How to Find Content Gaps at Scale: Atrapalo vs Skyscanner

For the following Ranksense Webinar, we were joined by Antoine Eripret, who works at Liligo as an SEO lead. Liligo.com is a travel search engine which instantly searches all available flight, bus and train prices on an exhaustive number of travel sites such as online travel agencies, major and low-cost airlines and tour-operators. In this...

READ POST

Exciting News!
seoClarity acquires RankSense

X