The architecture of Google search engine:
You may like to read Introduction to search engines before we begin with this post.
In Google Search engine, the web crawling is done by several distributed crawlers. There is a URL server that sends lists of URLs to be fetched to the crawlers. The web pages that are fetched are then sent to the storeserver, which then compresses and stores the web pages into a repository. Every web page has an associated ID number called a docID, which is assigned whenever a new URL is parsed out of a web page. The indexing function is performed by the indexer and the sorter. Indexer reads the repository, uncompresses the documents, and parses them. Each document is converted into a set of word occurrences called hits. The hits record the word, position in the document, an approximation of font size, and capitalization. The indexer distributes these hits into a set of barrels, creating a partially sorted forward index. The indexer performs another important function. It parses out all the links in every web page and stores important information about them in an anchors file. This file contains enough information to determine where each link points from and to, and the text of the link.
The URLresolver reads the anchors file and converts relative URLs into absolute URLs and in turn into docIDs. It puts the anchor text into the forward index, associated with the docID that the anchor points to.
It also generates a database of links, which are pairs of docIDs. The links database is used to compute PageRank for all the documents.
The sorter takes the barrels, which are sorted by docID, and resorts them by wordID to generate the inverted index. This is done in place so that little temporary space is needed for this operation. The sorter also produces a list of wordIDs and offsets into the inverted index. A program called Dump Lexicon takes this list together with the lexicon produced by the indexer and generates a new lexicon to be used by the searcher. The searcher is run by a web server and uses the lexicon built by Dump Lexicon together with the inverted index and the Page Ranks to answer queries.
How does Google crawl the web content?
Running a web crawler is a challenging task. Crawling is the most fragile application since it involves interacting with hundreds of thousands of web servers and various name servers which are all beyond the control of the system.
To scale to millions of web pages, Google has a fast distributed crawling system. A single URLserver serves lists of URLs to a number of crawlers. Each crawler keeps roughly 300 connections open at once. It works on a simple iterative algorithm. This algorithm differs with the search engines as well as the kind of query. This is necessary to retrieve web pages at a fast enough pace. This factor makes the crawler a complex component of the system.(Refer fig)
This means running a crawler which connects to more than half a million servers and generates tens of millions of log entries. Because of the immense variation in web pages and servers, it is virtually impossible to test a crawler without running it on a large part of the Internet.
The Google Ranking System:
Google maintains much more information about web documents than typical search engines. Every hit list includes position, font, and capitalization information. Combining all of this information into a rank is difficult.
First, consider the simplest case — a single word query. In order to rank a document with a single word query, Google looks at that document’s hit list for that word. Google counts the number of hits of each type on the hit list. Then it computes an IR score for the document. Finally, the IR score is combined with PageRank to give a final rank to the document.
For a multi-word search, the situation is more complicated. Now multiple hit lists must be scanned through at once so that hits occurring close together in a document are weighted higher than hits occurring far apart. For every matched set of hits, proximity is computed. Counts are computed not only for every type of hit but for every type and proximity. Thus it computes an IR score. (Refer fig.)
Fig. Google ranking mechanism