Parts of the Search Engine
There are three main parts to every search engine:
A spider crawls the web. It follows links and scans web pages. All search engines have periods of deep crawl and quick crawl. During a deep crawl, the spider follows all links it can find and scans web pages in their entirety. During a quick crawl, the spider does not follow all links and may not scan pages in their entirety.
The job of the spider is to discover new pages and to collect copies of those pages, which are then analyzed in the index.
Pages that are considered important get crawled frequently. For example, the New York Times may be crawled every hour or so to put new stories in the index. Less authoritative sites with less PR are crawled less frequently, even as rarely as once a month. The crawl rate depends directly on link popularity and domain authority.
If many links point to a website, it may be an important site, so it makes sense to crawl it more often than a site with fewer links. This is also a money-saving issue. If search engines were to crawl all sites at an equal rate, it would take more time overall and cost more as a result.
More Spider Features
Spiders may check for duplicate content before passing page copy to the index, in order to keep the index clean (or at least cleaner).
The index is the place where search engines keep basic copies of web pages and sort search results. When you a do a search, search engines do not search the web; they show results from their index. The number of pages in the index does not represent the entire web, but the number of pages that the spider has discovered, scanned and saved.
The search results count (i.e. Results 1 – 10 of about 160,500) on Google, Yahoo, Live and MSN tells the number of documents in the index that have a search term somewhere in the content or inbound links.
The index is the place where search engineers apply algorithms, and it is the place where rankings are partially determined. Search engineers may choose to apply an algorithm to the entire index, or only to a portion of it.
Datacenters and Different Indexes
Search engines have multiple datacenters around the world. When you enter a search term, your query is directed to the closest datacenter.
Different datacenters may have slightly different indexes, especially during an update. As a result, search results may differ depending on your location. For example, during an index update, Bob in LA will see different results from Tim in New York. This difference in indexes is called the Google Dance, and it’s used by SEOs to spot an update.
A Brief Search Engine History
Web did not have good search engines for a long time. The first search engines did not even analyze page copy; they only looked at titles and had no ranking criteria. As the convenience and commercial potential of search engine became more obvious, more advanced systems were developed.
Excite was the first serious commercial search engine. It was developed in Stanford and was purchased for $6.5 billion by @Home. In 2001 Excite and @Home went bankrupt and InfoSpace bought Excite for $10 million.
At the time the first search engines were rolling out, web directories were still strong competitors, primarily because of poor search results, and later on, because of spam and abuse.
Meta tags were designed to help search engines sort web pages. Pages included keywords in meta tags telling search engines about the contents of each page. For a short time meta tags worked and helped search engines serve relevant results, but over time marketers learned they could easily rank by stuffing those tags with keywords.
As a result, search engine optimization in those days became about cramming "loans, loans, loans, loans, loans" into the meta tag. Search engines got spammed beyond being of any use, and many faced an exodus of users as a result.
Yahoo started as web directory in 1994 and outsourced their search until 2004. Google launched in 1996 and did not have a successful business model until 2001. Microsoft did not come on the search engine scene until 2003.
Some of the important search engines at this time included:
For more information on search engine history, you may want to investigate Search Engine History, a site entirely devoted to this topic. It also touches on the history of search engine optimization. Additionally, Web Master World has an excellent thread that covers the history of SEO.
When you search using a web interface (like Google.com), in many cases results are already presorted to a certain extent. The degree to which results are presorted depends on the complexity of the algorithm. If the time to apply an algorithm to the index is considerable, then that algorithm is applied in advance. On the other hand, some algorithms are applied at the time when the search query is requested.
Search queries go through analysis to determine the possible intent behind the query. Google is currently leading in this area.
"Stop words" are words that are frequently used in the English language. Those words include a, the, all, also, but, down, full, much etc. They are words that are used by everyone regardless of the topic. Generally, search engines ignore "stop" words and will usually correct your search to exclude them. For example, when you search for "cat and dog" search engines will exclude "and" and only search for "cat" "dog."
Google does use stop words to an extent.
Keyword density is a measure of how often a word appears on the page in relation to other words. It is an over-hyped measurement that doesn’t help in search rankings. Search engines use far more than keyword density for on-page analysis. Their technology includes the location of terms on the page, word proximity and natural language processing.
Google has purchased Applied Semantics for its AdSense Network, but may also be using this technology for on-page analysis. Additionally, please keep in mind that one of Google’s current projects involves scanning thousands of books, from which it may learn more about natural language patterns.
Location of Terms on The Page
By analyzing how terms are located in relation to each other on the page, search engines can determine partial relevancy of the page. The closer terms are to each other, the more relevant a page is.
In many cases, keywords appear separately from each other throughout the page. This is considered normal in most cases, but be sure to include a term together at least once in the title, heading or paragraph.
Link analysis is at the core of all search engine relevancy. Apart from Page Rank and general link popularity, Google looks at: link anchor text, the page from which the link comes, age of the link, location of the link, title of the page from which the link comes, authority of the linking page and more.
Links are the biggest quality indicators that search engines have at the moment. Before search engines existed, and before the web was commercialized it was much harder to find information. All you had to rely on was links. There were few if any spammers, and people who found interesting sites shared those sites with others by placing a link. Also, the first web pages and servers were universities and colleges; this is why Google is biased toward .edu domains – they were the first on the scene, and usually contain quality content and resources.
As the web became commercial and Google’s Page Rank well known, links became a form of advertising, where a link could be bought or artificially made by spammers. This is the reason for Google’s bias toward older links and links from trusted domains.
Yahoo put less weight on link analysis than Google, while Ask.com is more about "authoritative hubs." Ask.com generally has a harder time ranking documents unless there’s a community around a topic.
Size and Length of the Page
There’s no "best" page copy length for ranking on search results. Search engines have specifically addressed this issue, and both long content and short content have equal chances to rank.
All major search engines such as Google, Yahoo, Live and Ask collect user feedback about web pages. They look at search queries, prior search queries, time interval between those queries and semantic relationships in order to learn more about intent. They also track click through rates for different listings. If, for example, users click on a listing and then go back right away, search engines may remove that listing and artificially lower its position for one or more keywords.
This brings up the fact that user experience is becoming an important part of SEO. As search engines collect more data, they are constantly learning to interpret it. As they get better at it, retaining users on your pages for a certain time period (maybe a benchmark for an industry) may become an important factor in the SEO game.
Behavior feedback is currently used in personalized search.