Google Algorithms

As SEOs, we spend much of our time trying to figure out how Google’s indexing algorithms work. This article will take a look at the techniques used by the search engine. We’ll cover latent semantic indexing, local inter-connectivity, link analysis, and more. We’ll discuss the implications of these algorithms, and what you should do to give them what they want to see.

Latent Semantic Indexing or LSI

Latent semantic indexing is the science of natural language processing. LSI analyzes relationships between words and is designed to distinguish naturally written text from keyword-stuffed documents. Latent semantic indexing considers natural and synonymous relationship between words.

For example, if you’re examining an article about an airplane, LSI will look for synonyms: aircraft, plane, aero plane. It will also look for related words, which are NOT synonymous to the word, but are often mentioned when discussing airplanes: ailerons, turbulence, fuel, clouds, sky, roll, pitch, etc.

The point of LSI is to detect natural writing and distinguish it from robotic copy written to manipulate search results. Google purchased the company Applied Semantics, which developed advanced LSI technology. Their know-how was incorporated into AdSense, and possibly into search algorithms.

As a webmaster, write naturally and forget about measurements such as keyword density. Also mix related and synonymous phrases into your anchor text.

Ranking Search Results by Reranking the Results Based on Local Inter-Connectivity

This is the name of a patent issued to Google on February 25th 2003.

A search engine for searching a corpus improves the relevancy of the results by refining a standard relevancy score based on the inter-connectivity of the initially returned set of documents.

The search engine referred to above finds a good set of documents using other algorithms (such as PageRank and Trust Rank), and then re-ranks search results based on the inter-connectivity of those documents.

If you have many links from authoritative domains, but still have a hard time ranking on search results, you might need to get links from several websites included in a “set” determined by this algorithm. In other words, you need links from sites that rank at the top for your terms, or links that are better than those of your competitors. If you lack “community” exposure, your rankings may be recalculated and lowered in favor of sites that have more “community” links, even at the cost of some authority.

Google Site and Link Analysis refers to the analysis of information about websites, pages, links and patterns. Patented in 2005, Google looks at factors such as the following:

  • The length of domain registration
  • Domain ownership changes
  • WHOIS data and physical address information
  • C-Class IP information
  • Keyword and non-keyword domains
  • The discovery date of new domains/pages
  • Document change frequency and the amount of change
  • The number of linked internal documents
  • Link anchor text
  • Link discovery date
  • Link changes and deletions
  • External link growth patterns
  • The authority of external links
  • Link quality ratios
  • The distribution of links
  • The lifespan of links
  • Link patterns (new vs old and old vs new)
  • Anchor text variety

This is not a complete list. You can get a detailed analysis of this document by GrayWolf at Thread Watch.

It’s not clear whether this has been incorporated into Google over the years, but it’s a good idea to keep this information in mind when you do link building.

Link spikes can be measured against the search volume in a specific topic. If link spikes are not supported by search volume trends, Google can discount or penalize websites. If the linking pattern exceeds the natural or usual pattern, Google will pay more attention to your website, triggering specialized algorithms to determine whether links are spammy or natural. Remember, Google also owns Google News and can correlate events to link patterns.

Topic Sensitive PageRank TSPR

Topic Sensitive Pagerank (TSPR) bases results on the topical relationship of the query to the documents. This relationship is determined by user input, search history and the Open Directory Project. By matching the topic according to ODP listings, TSPR can return more refined results.

Topic Sensitive Pagerank also takes into account “corrupt” ODP editors who sell listing access. TSPR calculates relevancy based on topical communities rather than pure link power.

Try getting a link from the Open Directory Project. If you’re having a hard time, there are editors who sell those listings for a few hundred bucks. Topic Sensitive Pagerank also shows that Google places a lot of emphasis on links from topically related websites. The algorithm itself is very old and has probably undergone changes, or even been replaced by better successors created in-house at Google. Keep in mind the importance of links from related websites.

Topic Sensitive Trust Rank

To understand Topic Sensitive Trust Rank, you need to understand Trustrank.

Trust Rank is an algorithm that requires human input. Humans analyze pages and determine different “seed pages.” “Seed pages” are high content, high quality web pages with independent, authoritative, un-affiliated links to other websites. Links from “seed pages” pass Trust Rank to websites. If page X is a seed page linking to pages A, B and C, then Trust Rank is equally distributed between those pages. If A, B and C link to other pages, then Trust Rank continues to flow, but has less power. The further a page is from the “seed,” the less Trust Rank is passed.

This algorithm requires humans to identify seed pages.

Topic Sensitive Trust Rank works on the same principle, but relies on seed pages which help determine the topics of websites. DMOZ, Yahoo Directory and other high quality directories can help with this.

The supplemental index is where Google puts websites and pages it doesn’t trust. The regular index features the search results that you usually see. Results from the supplemental index will show when there are not enough documents that match the query in the main index.

If your website or pages are in the supplemental index, it means that Google doesn’t trust your website. There are several ways to get into the supplemental index:

  • Too many low quality links

  • Not enough links

  • Too many low quality outbound links

  • Duplicate content

  • Too many pages for your pagerank (low PR, many pages)

  • Your site is new and has a low link profile. To fix this, get more authoritative links.

There’s no way of knowing if your website is in the supplemental index apart from a good guess. If you have a new website, Google may place you in the supplemental index by default, until the site ages and gains some authority. If you’re in the supplemental index, Google is not likely to crawl all your URLs. Also, be sure that you do not have broken links and 404s, since those can get you into the supplemental index.

Spam Detection

As a quality website owner, you will get plenty of spam links. There’s no way to offset it, but accept the fact and focus your efforts on something more productive.

As you get spam links, Google will compare inbound and outbound links from your site. If you link to bad neighborhoods, you be will considered part of the spam network, so be careful to whom you link. Assuming you get spam inbound links by default, and don’t link to spam from your site, you can offset inbound spam link effects by getting more quality and authoritative links. Aaron Wall states that every single quality link is equal to 40 – 60 spam links, thus if you have two quality links and 60 spam links, in Google’s eyes you will only have one quality link (this is a very rough example).

Human Reviewers and Behavioral Data

Some SEOs claim Google has up to 10,000 human reviewers who check search results for integrity and review websites flagged by algorithms. In his column Search 4.0, Danny claims that search engines are putting humans back into the task and taking behavioral aspects into account in algorithms.

Someone at Google leaked the document titled Quality Rater Guidelines (go read it). The document details how quality raters review search results and also details what Google considers spam. We will not go into detail about the document in this post, but it shows that raters are very real. 

Unlike algorithms, you cannot trick human review, so make sure that your content reads as naturally as it possibly can, because once you’re flagged by a human, it’s very hard to get your position back. 

Behavioral Data and Personalized Search

Behavioral data analysis includes click-through rate, time spent on each site, pages visited and more. Google and other search engines monitor and analyze this data. On top of the tracking listed above, Google is likely to be monitoring users with a long history of specific searches.

Personalized Search

If you have a Google account and use Google search on a regular basis, it adjusts search results based on the sites you visit the most, search queries you frequently use, blogs you track with Google Reader and other content you find/read using Google services. It uses this data to bring you partially customized results that are different from its regular results.

With this in mind, we can hypothesize that once Google has enough data from a pool of users, it can cross-compare and determine if a number of users with specific topics of interest share the same site-visiting patterns. If 1000 users interested in SEO read “SEO Chat” via Google Reader, then it’s a good indicator that SEO Chat is an important website.

Based on collected behavior information, Google can adjust its search results. I would say that they would not go so far as to determine search results based on behavior, but they definitely use it to adjust and improve SERPs.

Microsoft Browse Rank is another example of how search engines use behavioral data. With the focus on behavioral data, now more than ever you need to build your site with your human visitors in mind. After all, that’s truly the point of Google’s algorithms: to return sites that human searchers will find relevant.

Google+ Comments

Google+ Comments