Anatomy of a Distributed Web Spider — Google's inner workings part 3

by Hamlet Batista | July 06, 2007 | 5 Comments

What can you do to make life easier for those search engine crawlers? Let's pick up where we left off in our inner workings of Google series. I am going to give a brief overview of how distributed crawling works. This topic is useful, but can be a bit geeky, so I'm going to offer a prize for you at the end of the post. Keep reading, I am sure you will like it. (Spoiler: it's a very useful script ).

 

4.3 Crawling the Web

Running a web crawler is a challenging task. There are tricky performance and reliability issues and even more importantly, there are social issues. Crawling is the most fragile application since it involves interacting with hundreds of thousands of web servers and various name servers which are all beyond the control of the system.

In order to scale to hundreds of millions of web pages, Google has a fast distributed crawling system. A single URLserver serves lists of URLs to a number of crawlers (we typically ran about 3). Both the URLserver and the crawlers are implemented in Python. Each crawler keeps roughly 300 connections open at once. This is necessary to retrieve web pages at a fast enough pace. At peak speeds, the system can crawl over 100 web pages per second using four crawlers. This amounts to roughly 600K per second of data. A major performance stress is DNS lookup. Each crawler maintains a its own DNS cache so it does not need to do a DNS lookup before crawling each document. Each of the hundreds of connections can be in a number of different states: looking up DNS, connecting to host, sending request, and receiving response. These factors make the crawler a complex component of the system. It uses asynchronous IO to manage events, and a number of queues to move page fetches from state to state.

If that was when they started, can you imagine how massive it is now?

As I explained in the first installment, crawling is the process of downloading the documents that are going to be indexed. If Google is unable to download your pages, you won't be listed. That is why it is so important to use reliable hosting and to keep your pages up at all times. For my most profitable sites I use a redundant setup I came up with a few years ago. I will share it in a future post.

A distributed crawling system is Google's answer to divide the crawling task (URL Server, crawlers, etc.), and they use multiple servers to perform the crawling. Let me explain this with an analogy.

Imagine a librarian that has been assigned with the task of preparing books for a new library. The library is empty. He has several assistants to help with the task. The assistants don't know anything about the books, where they can be found or where they go. The librarian does know and he gives the assistants a list of books and exactly where to find them. "Please bring them here," he says. He does the same for every assistant, until he has all the books he needs.

The librarian is the URL Server in this case, and the assistants are the crawlers. The crawlers receive URL lists that they need to download, and they need to do it efficiently. (As a consequence of this, you can see that Googlebot hits to your site from different IP addresses.)

There are also some people who do not know about the robots exclusion protocol, and think their page should be protected from indexing by a statement like "This page is copyrighted and should not be indexed." Needless to say, web crawlers don't read pages the same way humans do, and they don't quite understand.

Another important piece of information is how to tell search engines not to crawl sensitive pages. There are pages on your site, that for one reason or another, you don't want to have them included in the Google index. One way to do that is by way of the robots exclusion protocol. Another way is via meta robots tag noindex.

If you read this far (or jumped to the end of the post ;-)), here is your prize: chkng.zip

Before, I fell in love with Python, I coded this program in Perl a few years ago. An actual crawler for your own use. 

What you can use it for?

1. To check your site for broken links.
2. To find out if the site that is supposed to be linking to you (because you paid or exchanged a link), still is.

There are some other uses that I would not comment–as I do not want to encourage spamming.

How to use it? It is multi-threaded code. That means that it can work on multiple sites simultaneously. You simply need to paste the number of sites you want to check in parallel. You also need to pass the url to search for (your site url in most cases), and:

1. Download to your Linux box (I have not tested it on Windows or Mac)
2. Run chmod +x chkng.pl
3. Create a file with the list of sites you want to check, one per line.
4. Run ./chnkng.pl 5 http://yrousite.com < sitestochecklist.txt

This example will check 5 sites in parallel, and search for your url on any of the pages.

Hamlet Batista

Chief Executive Officer

Hamlet Batista is CEO and founder of RankSense, an agile SEO platform for online retailers and manufacturers. He holds US patents on innovative SEO technologies, started doing SEO as a successful affiliate marketer back in 2002, and believes great SEO results should not take 6 months

5

REPLIES

Try our SEO automation tool for free!

RankSense automatically creates search snippets using advanced natural language generation. Get your free trial today.

OUR BLOG

Latest news and tactics

What do you do when you’re losing organic traffic and you don’t know why?

Getting Started with NLP and Python for SEO [Webinar]

Custom Python scripts are much more customizable than Excel spreadsheets.  This is good news for SEOs — this can lead to optimization opportunities and low-hanging fruit.  One way you can use Python to uncover these opportunities is by pairing it with natural language processing. This way, you can match how your audience searches with your...

READ POST
Making it easier to implement SEO changes on your website

Changes to the RankSense SEO rules interface

As we continue to improve the RankSense app for Cloudflare, we are always working to make the app more intuitive and easy to use. I'm pleased to share that we have made significant changes to our SEO rules interface in the settings tab of our app. It is now easier to publish multiple rules sheets and to see which changes have not yet been published to production.

READ POST

How to Find Content Gaps at Scale: Atrapalo vs Skyscanner

For the following Ranksense Webinar, we were joined by Antoine Eripret, who works at Liligo as an SEO lead. Liligo.com is a travel search engine which instantly searches all available flight, bus and train prices on an exhaustive number of travel sites such as online travel agencies, major and low-cost airlines and tour-operators. In this...

READ POST

Exciting News!
seoClarity acquires RankSense

X