Search Engines and Algorithms: Optimizing for MSN’s RankNet Technology

The latest major change in the search engine giant, MSN Search, has been the inoculation of “neural networks” into its search engine algorithm, something internal researchers call “RankNet.” This change took place in late June of this year. This algorithm is fresh, and it is becoming a great consideration for many search optimizers. In this article, Jennifer Sullivan continues her reviews of search engines and their algorithms, this time focussing on MSN’s RankNet.

MSN RankNet: What Is It?

RankNet is, in essence, a “learning machine” that takes the patterns of human searches into account, and learns from them, in order to provide more relevant results the next time around. They start from a baseline of predictions made that are input into its neural net. Chris Burgess of MSN says, “We take a bunch of data, ‘propagate’ it through the network (basically, take a bunch of weighted sums of the inputs and munch them together), and get values out of the network.”

They make their predictions with supervised learning, which means, “…a machine learning technique for creating a function from training data. The training data consist of pairs of input objects (typically vectors), and desired outputs. The output of the function can be a continuous value (called regression), or can predict a class label of the input object (called classification). The task of the supervised learner is to predict the value of the function for any valid input object after having seen only a small number of training examples (i.e. pairs of input and target output). To achieve this, the learner has to generalize from the presented data to unseen situations in a ‘reasonable’ way.”

MSN uses 569 different generalized properties to predict the relevancy of a document, as part of the input objects of their network, during this supervised learning or training. This is NOT the same as saying there are 569 different factors they weigh when determining a specific document’s relevancy to a particular query, but rather how certain features of a document might render it relevant, then build upon that data.

The clues we get about this technology comes from the patent filings for RankNet. The first patent identified is “Method for scanning, analyzing and handling various kinds of digital information content” which mentions the neural net concept in the patent abstract:

Computer-implemented methods are described for, first, characterizing a specific category of information content–pornography, for example–and then accurately identifying instances of that category of content within a real-time media stream, such as a web page, e-mail or other digital dataset. This content-recognition technology enables a new class of highly scalable applications to manage such content, including filtering, classifying, prioritizing, tracking, etc. An illustrative application of the invention is a software product for use in conjunction with web-browser client software for screening access to web pages that contain pornography or other potentially harmful or offensive content. A target attribute set of regular expression, such as natural language words and/or phrases, is formed by statistical analysis of a number of samples of datasets characterized as “containing,” and another set of samples characterized as “not containing,” the selected category of information content. This list of expressions is refined by applying correlation analysis to the samples or “training data.” Neural-network feed-forward techniques are then applied, again using a substantial training dataset, for adaptively assigning relative weights to each of the expressions in the target attribute set, thereby forming an awaited list that is highly predictive of the information content category of interest.

And Chris Burgess, mentioned in the MSN Search Blog post and head author of the “Learning to Rank with Gradient Descent” paper (one of the RankNet White Papers), was one of the co-authors of this patent application which describes neural network; “System and method for identifying content and managing information corresponding to objects in a signal.” The patent abstract states:

An “interactive signal analyzer” provides a framework for sampling one or more signals, such as, for example, one or more channels across the entire FM radio spectrum in one or more geographic regions, to identify objects of interest within the signal content and associate attributes with that content. The interactive signal analyzer uses a signal fingerprint extraction algorithm, i.e., a “fingerprint engine,” for deriving traces from segments of one or more signals. These traces are referred to as “fingerprints” since they are used to uniquely identify the signal segments from which they are derived. These fingerprints are then used for comparison to a database of fingerprints of known objects of interest. Information describing the identified content and associated object attributes is then provided in an interactive user database for viewing and interacting with information resulting from the comparison of the fingerprints to the database.

What we have seen as important to the way MSN ranks a website can be listed into a few basic concepts. MSN relies heavily upon anchor text in links and content is still King. Because of this, the MSN algorithm relies upon high keyword density as well, even more so than Yahoo. The presence of a robots.txt file lately has been seen to be highly important to the MSN robot’s crawls, and MSN has been known to completely disregard sites that don’t incorporate the robots.txt. Whether this is part of the new neural net or simply a filter, we can’t be certain.

Keyword meta tags no longer hold any importance, and MSN does not index data from the meta keywords tag because they are not visible to the user in a standard browser, as well as having a long history of being spammed. You should always include a title and description meta tag, however.

Keywords in the URL are extremely important, coming in at a full 85% of the top ten sites ranked. Other tags that MSNBot looks at are header tags, alt tags in images, or the title attribute in links. MSN does not differentiate between the <b> or <strong> tags. Either one is fine.

MSN receives about 15% of all of the search engine queries on the internet. They currently do not index flash, but it is on their TO-DO list based on customer feedback. MSN also doesn’t have any issues with 302 temporary redirects. When a page is redirected, MSN indexes the page the visitor will end up on, however, and not the temporary page.

MSN regards static pages more important than dynamic pages, even though they do rank dynamic pages. It also is difficult to sabotage a website in MSN through such techniques as washes (attempting to turn a duplicate content penalty against a site by flooding the SERPs with similar content), or linking to a site from spam pages.

It’s not clear as to whether or not age of a website factors into a MSN’s RankNet algorithm

Filters

Many people would like to have you believe that MSN doesn’t have a duplicate content filter, or penalize sites with duplicate penalties. This isn’t true. In fact, it is MSN’s duplicate content filter that got my number one vote after performing an experiment not long ago with an article I wrote that was my original content. Yahoo did fairly well over a few weeks weeding out duplicate content, and settling on my original content as the Real McCoy, and Google failed miserably. It was MSN that not only filtered out the duplicates, but gave me “bonus points” for having the original. This tells me that not only does MSN employ a filter, but it also has the technological ability to determine content origin and source.

There is a rumor that has yet to be substantiated, that MSN search gives a slight advantage to sites running on Microsoft Servers with IIS.

Another possible problem with MSN search results is possible manipulation via blog comments and forum posts, but MSN says they are aware of the problem, and it is on the TO-DO list as well.

In The Works

Like Google and Yahoo, MSN is constantly expanding horizons. It’s not about just search anymore, but its now about what other features are offered along-side search.

AOL Search

MSN has been in talks with AOL and Time Warner since the beginning of this year regarding possible acquisition of the one-time dial-up internet giant. There has been equal talks of acquisition, partnering, or simple co-op, with nothing having been decided as of yet. Joining with AOL would shake up the search engine industry for sure.

Online Book Search

MSN is getting into the business of offering online searches of books and other writings, and says its approach aims to avoid the legal tussles met by rival Google Inc.

The Redmond-based software giant said that it will avoid copyright issues for now by initially focusing mainly on books, academic materials and other publications that are in the public domain. MSN plans to initially work with an industry organization called the Open Content Alliance to let users search about 150,000 published documents. A test version of the product is promised for next year.

Search Clustering

According to MSN’s sandbox site, sandbox.msn.com, which gives an inside look to what is going on in the world of MSN, in development is Search Clustering. “MSRA SRC is a tool for searching web with the Search Result Clustering (SRC) technique that was developed at Web Search and Mining Group in MSR, Asia. On-the-fly it clusters a search engine’s search results into different groups, and provides meaningful and readable names for these groups. SRC changes the traditional representation of search results into a non-linear way, so as to facilitate the user’s browsing.

Traditional clustering techniques don’t work for this problem because the documents are short, the cluster names should be readable and the algorithm should be efficient for on-the-fly calculation. The method takes on the whole problem in a different way and overcomes the difficulties in traditional clustering methods. It tries to first identify salient topics by identifying distinct and independent keywords, and then classifies the search results into these topics…”

Exclusive MSN Features

A while back, the community got together to compile a list of questions to present to MSN. One of the questions asked is how MSN differentiates from the other search engines, now and in the future. MSN had this to say:

Instant Answers – MSN Search displays answers instantly from Encarta and our Music service so you spend less time navigating to the information you need

Features and Tools – Most up-to-date index with a competitive index size and tools like Search Builder and Near Me to help personalize consumers’ search experience

Multiple access points – Whether it’s MSN.com redesigned and optimized for Web search, or the MSN Toolbar Suite with Windows Desktop Search, with MSN we make it easy to get started searching no matter what the entry point

“We have our eye on becoming number one in the search space, but we realize this will take some time. Today we believe we have the best search engine for MSN users and are continuing to enhance and grow our feature set based on consumer feedback.”

MSN Keywords

Like Google’s AdSense, MSN uses an ad network they recently launched, called MSN Keywords. Similar to AdSense, MSN Keywords is an auction-type program in which advertisers “bid” on certainly keywords in a control-panel based adCenter. This is a pilot program right now, testing in France, Hong Kong, and Singapore. Depending on popularity of the program, we may see MSN Keywords roll out in the UK and US as early as 2006.

MSN Search has many features and services available. Optimizing for this search engine is actually relatively easy, as long as your site features lots of original content.

There are many other search engines out there, and even those that fall into specialized categories, like MetaSearch engines, which use multiple search engine results to compile an overview of listed results, instead of pulling results from its own databases. I’ll be covering MetaSearches in depth in my next article, and how they can be important in your SEO efforts.

Google+ Comments

Google+ Comments