Google: World’s Best Search Engine?

Learn how Googlebot fetches a page and gives the Google indexer the full text of the pages in this article from Atul Davare. All the elements of a Google results page are explained as well as the ins and outs of cached pages and how you can specify your preferences.

Introduction

In the world of World Wide Web, there can hardly be anyone who is unaware of the best and the fastest search engine, Google. It’s incredibly fast, showing search results with detailed information in milliseconds. There are many search engines all over the WWW, but Google has created a distinct reputation for itself with its easy search facilities and comprehensive results that provide lots of related information at an awesome speed. Let’s learn more about Google, how it works and its different features.

How Google Works

Google consists of three distinct parts, each of which is run on a distributed network of thousands of low-cost computers and can therefore carry out fast parallel processing. Parallel processing is a method of computation in which many calculations can be performed in parallel, or at the same time, significantly speeding up data processing. The three parts of Google are: 

  • The web crawler or spiders, known as Googlebot, finds and fetches web pages.
  • The indexer which, as its name implies, indexes every word on every page and stores the resulting index of words in a huge database.
  • The query processor compares your search query to the index and recommends the documents that it considers most relevant.

Let’s take a closer look at each part.

Googlebot is Google’s web crawling robot, which finds and retrieves pages on the web and hands them off to the Google indexer. It’s easy to imagine Googlebot as a little spider scurrying across the strands of cyberspace, but in reality Googlebot doesn’t traverse the web at all. It functions much like your web browser, by sending a request to a web server for a web page, downloading the entire page, then handing it off to Google’s indexer.

Googlebot consists of many computers requesting and fetching pages much more quickly than you can with your web browser. In fact, Googlebot can request thousands of different pages simultaneously. To avoid overwhelming web servers, or crowding out requests from human users, Googlebot deliberately makes requests of each individual web server more slowly than it’s capable of doing.

Googlebot finds pages in two ways: through an add URL form, www.google.com/addurl.html, and through finding links by crawling the web.

Unfortunately, spammers figured out how to create automated bots that bombarded the add URL form with millions of URLs pointing to commercial propaganda. Google rejects those URLs submitted through its add URL form that it suspects are trying to deceive users by employing tactics such as including hidden text or links on a page, stuffing a page with irrelevant words, cloaking (aka bait and switch), using sneaking redirects, creating doorways, domains, or sub-domains with substantially similar content, sending automated queries to Google, and linking to bad neighbors.

When Googlebot fetches a page, it culls all the links appearing on the page and adds them to a queue for subsequent crawling. Googlebot tends to encounter little Spam because most web authors link only to what they believe are high-quality pages. By harvesting links from every page it encounters, Googlebot can quickly build a list of links that can cover broad reaches of the web. This technique, known as deep crawling, also allows Googlebot to probe deep within individual sites. Because of their massive scale, deep crawls can reach almost every page in the web. Because the web is vast, this can take some time, so some pages may be crawled only once a month.

Although its function is simple, Googlebot must be programmed to handle several challenges. First, since Googlebot sends out simultaneous requests for thousands of pages, the queue of “visit soon” URLs must be constantly examined and compared with URLs already in Google’s index. Duplicates in the queue must be eliminated to prevent Googlebot from fetching the same page again. The Googlebot must determine how often to revisit a page. On the one hand, it’s a waste of resources to re-index an unchanged page. On the other hand, Google wants to re-index changed pages to deliver up-to-date results.

To keep the index current, Google continuously recrawls popular frequently changing web pages at a rate roughly proportional to how often the pages change. Such crawls keep an index current and are known as fresh crawls. Newspaper pages are downloaded daily, pages with stock quotes are downloaded much more frequently. Of course, fresh crawls return fewer pages than the deep crawl. The combination of the two types of crawls allows Google to both make efficient use of its resources and keep its index reasonably current.

The Googlebot gives the indexer the full text of the pages it finds. These pages are stored in Google’s database, usually in an inverted-index data structure. This index is sorted alphabetically by search term, with each index entry storing a list of documents in which the term appears and the location within the text where it occurs. This data structure allows rapid access to documents that contain user query terms.

To improve search performance, Google eliminates common words called stop words (such as the, is, on, or, of, how, why, as well as certain single digits and single letters). Stop words are so common that they do little to narrow a search, and therefore they can safely be discarded. The indexer also eliminates some punctuation and multiple spaces, as well as converting all letters to lowercase, to improve Google’s performance.
 
Google’s Query Processor

The query processor has several parts, including the user interface (search box), the “engine” that evaluates queries and matches them to relevant documents, and the results formatter.

Google considers over a hundred factors in determining which documents are most relevant to a query, including the popularity of the page, the position and size of the search terms within the page, and the proximity of the search terms to one another on the page. Google also applies machine-learning techniques to improve its performance automatically by learning relationships and associations within the stored data. For example, the spelling-correcting system uses such techniques to figure out likely alternative spellings. Google closely guards the formulas it uses to calculate relevance, and tweaks them to improve quality and performance, and to outwit the latest devious techniques used by spammers.

Indexing the full text of the web allows Google to go beyond simply matching single search terms. Google gives more priority to pages that have search terms near each other and in the same order as the query. Google can also match multi-word phrases and sentences. Since Google indexes HTML code in addition to the text on the page, users can restrict searches on the basis of where query words appear, e.g., in the title, in the URL, in the body, and in links to the page, options offered by the Advanced-Search page and search operators.

Let’s see how Google processes a query.

google

The results page is filled with information and links, most of which relate to your query.

google

  • Google Logo: Click on the Google logo to go to Google’s home page.
  • Statistics Bar: Describes your search, includes the number of results on the current results page and an estimate of the total number of results, as well as the time your search took. For the sake of efficiency, Google estimates the number of results; it would take considerably longer to compute the exact number. Every underlined term in the statistics bar is linked to its dictionary definition.
  • Tips: Sometimes Google displays a tip in a box just below the statistics bar such as these:

google

google

  • Search Results: Ordered by relevance to your query, with the result that Google considers the most relevant listed first. Consequently you are likely to find what you’re seeking quickly by looking at the results in the order in which they appear. Google assesses relevance by considering over a hundred factors, including how many other pages link to the page, the positions of the search terms within the page, and the proximity of the search terms to one another.

Below are descriptions of some search-result components. These components appear in fonts of different colors on the result page to make it easier to distinguish them from one another.

  • Page Title: (blue) the web page’s title, if the page has one, or its URL if the page has no title or if Google has not indexed all of the page’s content. Click on the page title, e.g., Brassiere History, to display the corresponding page.
  • Snippets: (black) Each search result usually includes one or more short excerpts of the text that matches your query with your search terms in boldface type. These snippets, which appear in a black font, may provide you with:

    - The information you are seeking 
    - What you might find on the linked page 
    - Ideas of terms to use in your subsequent searches

When Google hasn’t crawled a page, it doesn’t include a snippet. A page might not be crawled because its publisher requested no crawling, or because the page was written in such a way that it was too difficult to crawl.

  • URL of Result: (green) Web address of the search result. In the screen shot, the URL of the first result is www.porvo.com/fashionbra.htm.
  • Size: (green) The size of the text portion of the web page. It is omitted for sites not yet indexed. In the screen shot, “5k” means that the text portion of the web page is 5 kilobytes. One kilobyte is 1,024 (210) bytes. One byte typically holds one character. In general, the average size of a word is six characters. So each 1k of text is about 170 words. A page containing 5K characters thus is about 850 words long.

    Large web pages are far less likely to be relevant to your query than smaller pages. For the sake of efficiency, Google searches only the first 101 kilobytes (approximately 17,000 words) of a web page. Assuming 15 words per line and 50 lines per page, Google searches the first 22 pages of a document. If a page is larger, Google will list the page as being 101 kilobytes. This means that Google’s results won’t reference any part of a document beyond its first 101 kilobytes.

  • Date: (green) Sometimes the date Google indexed a page appears just after the size of the page. Dates are included when Google runs a fresh crawl.

  • Indented Result: When Google finds multiple results from the same website, it lists the most relevant result first with the second most relevant page from that same site indented below it. In the screen shot, the indented result and the one above it are both from the site www.porvo.com.

    Limiting the number of results from a given site to two ensures that pages from one site will not dominate your search results and that Google provides pages from a variety of sites.

  • More Results: When there are more than two results from the same site, access the remaining results from the “More results from…” link.

    When Google returns more than one page of results, you can view subsequent pages by clicking either a page number or one of the “o”s in the whimsical “Gooooogle” that appears below the last search result on the page:

google

If you find yourself scrolling through pages of results, consider increasing the number of results Google displays on each results page by changing your global preferences (see the section Changing Your Global Preferences).

In practice, however, if pages of interest to you aren’t within the first 10 results, consider refining your query instead of sifting through pages of irrelevant results. To simplify such refinements, Google includes a search box at the bottom of the page you can use to enter your refined query

  • Sponsored Links: Your results may include some clearly identified sponsored links (advertisements) relevant to your search. Google displays your search terms that appear in the ads in boldface type, e.g., Brassiere on the top ad on the right.

Here’s another screen shot of the results page in case the one at the top of this page scrolled off your screen.

google

Google engineer Noam Shazeer developed a spelling correction (suggestion) system based on what other users have entered. The system automatically checks whether you are using the most common spelling of each word in your query.

Want to know the approximate value of a used car? Check out its “Blue Book” value.

google

Notice that Google suggests the correct spelling if you fail to type the final “e” in “blue.”

googlez

Regardless of whether it suggests an alternative spelling, Google returns results that match your query if there are any. If there aren’t any that match your query, Google may offer an alternative spelling, search tips, and a link to Google Answers. The last is a service that provides assistance from expert online researchers for a fee.

Google figures out possible misspellings and their likely correct spellings by using words it finds while searching the web and processing user queries. So, unlike many spelling correctors, Google can suggest common spellings for:

  • Proper nouns (names and places)
  • Words that may not appear in a dictionary

Want a definition for your search terms? It’s just a click away.

Google looks for dictionary definitions for your search terms. If it finds any definitions, it shows those words as underlined links in the statistics bar section of the results page (located below the search box showing your query). Google is able to find definitions for acronyms, colloquialisms, and slang, as well as words that you would expect to find in a dictionary.

Click on the underlined terms in the statistics bar to link to their dictionary definition, which also may include information on pronunciation, part of speech, etymology, and usage.

Phrases with idiomatic meanings that aren’t necessarily implied by the definitions of the individual words will be linked to their dictionary definitions, e.g., “to get wind,” “happy hour,” “put off,” “greasy spoon,” and “raise the roof.”

If Google doesn’t find a definition for a term, try using Google Glossary.

The online dictionary page includes a link to an online thesaurus. Use an online thesaurus to find suggestions for expressing yourself, whether for a document, a speech, a book, or a query.

Google takes a snapshot of each page it examines and caches (stores) that version as a back up. The cached version is what Google uses to judge if a page is a good match for your query.

Practically every search result includes a Cached link. Clicking on that link takes you to the Google cached version of that web page, instead of the current version of the page. This is useful if the original page is unavailable because of:

  • Internet congestion
  • A down, overloaded, or just slow website
  • The owner’s recently removing the page from the Web

Sometimes you can access the cached version from a site that otherwise require registration or a subscription.

Note: Since Google’s servers are typically faster than many web servers, you can often access a page’s cached version faster than the page itself.

If Google returns a link to a page that appears to have little to do with your query, or if you can’t find the information you’re seeking on the current version of the page, take a look at the cached version.

Let’s search for pages on the Google help basic search operators.

Click on the Cached link to view Google’s cached version of the page with the query terms highlighted. The cached version also indicates terms that appear only on links pointing to the page and not on the page itself.

Do you like a result Google found and want more like it? For example, if you’re interested in finding sites similar to that of Consumer Reports, first search for their site.

Click on the Similar pages link that appears on the bottom line for the Consumer Reports result.

The link may be useful for finding more consumer resources, or information on Consumer Reports’ competitors.

As the web has spread across the world, more and more web pages are available in languages other than English. Google provides a translation link and language tools to enable you to read pages written in unfamiliar languages.

Google translates pages by computer. Machine translation is difficult to do well and tends not to be as clear as human translation. But it can give you the gist of what’s written or suggestions for translating something into another language.

Your results may include a “Translate this page” link when a results page is written in a language different from your interface language (as specified by your Google Preferences, which is described in the next section). Your interface language is the language in which Google displays messages and labels, buttons, and tips on Google’s home page and results page. You can translate pages written in English, French, German, Italian, Portuguese, and Spanish into another language from that set.

Google’s Language Tools overcome language barriers. Click on the “Language Tools” link to the right of the search box on Google’s home page:

  • Search for pages written in specific languages



  • Search for pages located in specific countries
  • Use the Google interface in another language,
    e.g., set Google’s home page, messages and labels, and buttons to display in a specific language
  • Visit Google’s site in a specific country 

Translate any text or web page from a limited set of languages including English, French, German, Italian, Portuguese, or Spanish into another language in that set.

You can customize the way your search results appear by configuring your Google global preferences, options that apply across most Google search services. To change these options, click on the Preferences link, which is to the right of Google’s search box, or visit: www.google.com/preferences.

From the Preferences page, specify your global preferences, including:

  • Interface Language: the language in which Google will display tips, messages, and buttons for you
  • Search Language: the language of the pages Google should search for you
  • SafeSearch: automatic filtering and blocking of web pages with explicit sexual content
  • Number of results: how many search results are to be displayed per page
  • Results window: when enabled, clicking on the main link (typically the page title) for a result will open the corresponding page in a new window

When you set your preferences, Google stores your settings in a “cookie” on the computer you are using. Google doesn’t associate that cookie with any other computer you use. So, if you want Google to work similarly on all the computers you use, you will need to set these preferences on each one of them.

Evaluating Results

Google’s web-page-ranking system, PageRank, tends to give priority to better-respected and trusted information. Well-respected sites link to other well-respected sites. This linking boosts the PageRank of high-quality sites. Consequently, more accurate pages are typically listed before sites that include unreliable and erroneous material. Nevertheless, evaluate carefully whatever you find on the web since anyone can:

  • Create pages
  • Exchange ideas
  • Copy, falsify, or omit information intentionally or accidentally

Many people publish pages to get you to buy something or accept a point of view. Google makes no effort to discover or eliminate unreliable and erroneous material. It’s up to you to cultivate the habit of healthy skepticism. When evaluating the credibility of a page, consider the following AAOCC (Authority, Accuracy, Objectivity, Currency, and Coverage) criteria and questions, which are adapted from www.lib.berkeley.edu/ENGI/eval-criteria1001.html.

Authority

  • Who are the authors? Are they qualified? Are they credible?
  • With whom are they affiliated? Do their affiliations affect their credibility?
  • Who is the publisher? What is the publisher’s reputation?

Accuracy

  • Is the information accurate? Is it reliable and error-free?
  • Are the interpretations and implications reasonable?
  • Is there evidence to support conclusions? Is the evidence verifiable?
  • Do the authors properly list their sources, references or citations with dates, page numbers or web addresses, etc.?

Objectivity

  • What is the purpose? What do the authors want to accomplish?
  • Does this purpose affect the presentation?
  • Is there an implicit or explicit bias?
  • Is the information fact, opinion, spoof, or satirical?

Currency

  • Is the information current? Is it still valid?
  • When was the site last updated?
  • Is the site well maintained? Are there any broken links?

Coverage

  • Is the information relevant to your topic and assignment?
  • What is the intended audience?
  • Is the material presented at an appropriate level?
  • Is the information complete? Is it unique?

Google+ Comments

Google+ Comments