Search, like many things in life, is all about Rank. And right now Microsoft is a distant third, a dark and unfamiliar place for a company so used to dominating the computer industry. Indeed, Microsoft is struggling to remain relevant in the search world, peering over Yahoo’s shoulder while trying to catch a glimpse of the real superstar. Of course, I’m talking about Google, the undisputed Internet Search King. As of June 2008, Google accounted for a whopping 70% of all U.S. searches, which is up 6% from June 2007. Yahoo came in a distant second with a 20% share, while MSN had just five and a half percent and barely held off ask.com for third place.
It’s not like Microsoft hasn’t tried though. A few months back, I wrote about Microsoft’s attempt to regain a share of the search advertising market by offering cash rebates to consumers who searched for and purchased products through their search engine, Live Search. Many "experts" took this as Microsoft thumbing their nose at the notion that the search engine with the best results will have the most success. They assumed Microsoft thought they could buy users away from Google. It seems, so far at least, that even though money talks, users still want the best search results. Microsoft would have to do more.
A month later, Microsoft answered with the purchase of the semantic search engine Powerset, which attempts to comprehend the full meanings of phrases typed in to a search box. Google still bases its results on individual words, while doing little to understand their meaning. However, Powerset is limited to searches within Wikipedia, and experts wonder whether the technology will ever be applied within a major search engine (although its iPhone application has garnered significant praise). Nevertheless, Microsoft seems to be doing all its talking with its wallet (see its failed acquisition of Yahoo) with little innovation coming from within the company itself.
Enter BrowseRank, a collaborative effort from Microsoft’s own researchers and scientists from various Asian universities. Seeing as PageRank is at the heart of Google’s success, it was obvious that Microsoft had to tackle this algorithm head on if it ever wanted to seriously compete with the reigning Search King. In the sections to come, I will detail both PageRank and BrowseRank in an effort to determine which comes out on top.
As a disclaimer, let me just point out that PageRank is not the only way Google determines site importance. “It’s important to keep in mind that PageRank is just one of more than 200 signals we use to determine the ranking of a website,” Google said. “Search remains at the core of everything Google does, and we are always working to improve it.” Having said that, Google still relies heavily upon it, so let’s begin the analysis.
PageRank is a trademark of Google and is one of several link analysis algorithms, which use the link graph of the web to determine page importance. HITS is another popular link analysis algorithm. Basically, PageRank calculates the number of links to a specific page and examines the importance of each of those pages. Here is Google’s description:
In essence, Google interprets a link from page A to page B as a vote, by page A, for page B. But, Google looks at more than the sheer volume of votes, or links a page receives; it also analyzes the page that casts the vote. Votes cast by pages that are themselves “important” weigh more heavily and help to make other pages “important.”
PageRank relies on other factors, such as the relevance of keywords on the page and the number of visits to the page obtained from the Google Toolbar, that have been known to be prime targets for manipulation. Because of this, Google is not divulging the details of several other factors that influence PageRank.
The algorithm itself assesses the probability that a person who is randomly clicking on links will end up at a specific page. This calculation is done recursively in order to obtain the most accurate final value. The initial approximation would be evenly divided among the number of pages being examined (the total must equal 1). So for example, if there are 10 web pages, each would have an initial PageRank of 0.1. When you add links to the equation, things start to get hairy.
First of all, the PageRank of one page (X) is determined by the pages linking to it. For each page linking to X: the initial PageRank is divided by the number of outbound links it has (in our case, it would be 0.1 divided by outbound links). This value is then added to values of all the other pages linking to X. This will give you the PageRank of X. Then we have to adjust for the damping factor, which is the probability that a person randomly clicking on links will continue to do so, as well as pages that have no links to other pages.
If you would like to further explore the intricacies of PageRank, then I urge you to check out its Wikipedia page. There are plenty of links and references to help you along your educational journey. That being said, I hope you like math.
The people at Microsoft naturally think that PageRank has a number of issues that need to be solved. In their proposal, they mention that webmasters can manipulate the system by adding a large number of hyperlinks and creating what’s called a link farm. These pages are solely designed to inflate the importance of the sites to which they link so that they appear higher in the SERPs. This method is a favorite of web spammers because it greatly distorts a link analysis algorithm’s ability to calculate page importance.
Another problem with PageRank is that it does not take into account the amount of time a person spends on a web page while they are randomly clicking links. Microsoft feels this is a crucial factor in determining page importance. “The more visits of the page made by the users and the longer time periods spent by the users on the page, the more likely the page is important,” the researchers said. They propose using a more dependable data source, called the user browsing graph, and a more powerful mathematical model.
The user browsing graph utilizes user behavior data obtained from web servers, which collect it from Internet browsers. The data includes the URL, the method of arriving at the URL (hyperlink, input in browser, or bookmark), and the time spent on the page. A graph is then built where “vertices represent web pages and directed edges represent real transitions between web pages by users.” Please see the figure below for clarification. Note that the time isn’t included in the graph portion of the image, but will be included in actual processing.
As far as developing the BrowseRank algorithm, the main difference between it and PageRank is that BrowseRank implements a continuous-time Markov process and PageRank uses a discrete-time Markov process (a.k.a. a Markov chain). I also suggest reading the proposal I linked to at the beginning of this section for a more thorough discussion. I’m afraid this subject matter is not only beyond the scope of this article, but beyond my capacity to comprehend.
So far we’ve examined the shortcomings of PageRank and how BrowseRank intends to solve them, but experts have a few concerns with BrowseRank as well. Here is an excerpt from the proposal I mentioned in the last section:
Some websites like adobe.com are ranked very high by PageRank. One reason is that adobe.com has millions of inlinks for Acrobat Reader and Flash Player downloads. However, web users do not really visit such websites very frequently and they should not be regarded more important than the websites on which users spend much more time (like myspace.com and facebook.com).
On the surface, this seems to be very encouraging for BrowseRank. It would definitely give more control to web users and more credence to their concept of a more democratic web ranking system when juxtaposed with PageRank’s “links as votes” notion. But are social media sites really the most important sources for relevant information? Most of their content is irrelevant to the majority of web users.
And unless Microsoft blends their search results somehow (consider how Google mixes Google News with its organic results), sites like Digg could be manipulated to make temporary information more important than it should be. They could then wait longer to see whether spikes in traffic are sustained. There’s also the fact that some low quality pages are good at answering really common questions. Until search engines get better at answering these questions and previewing pages in their results, this will remain an issue.
Aaron Wall of Seobook.com says PageRank does have an advantage in the way people tend to link to informational resources. Since Google’s search results are geared toward informational sites, searchers are then more likely to click on paid ads during a commercial search. Wall says, “Google also has the ability to arbitrarily police links and/or strip PageRank scores to 0 with the intent to fearmonger and add opportunity cost to anyone who gathers enough links pointing at a (non-corporate owned) commercial domain.”
Of course, with issues looming over both PageRank and BrowseRank, there’s always the possibility that Microsoft and Google squash their competition and combine both algorithms into a larger formula, right? “It is also possible to combine link graph and user behavior data to compute page importance," the researchers said in their proposal. "We will not discuss more about this possibility in this paper, and simply leave it as future work.” Oh well, it looks like the competition will go on until someone cries mercy.