I’ve linked to the United States Patent and Trademark Office in case you want to browse; I’ve also linked to each individual patent below.
Identifying a web page as belonging to a blog
This patent covers the methods Live search uses to differentiate blogs from static HTML pages.
Ranking Method Using Hyperlinks in Blogs
This method permits a bias towards links from blogs in order to modify Page Rank and deliver higher quality search results.
Vision-based document segmentation
This patent explains how Microsoft differentiates unrelated content on one page.
In the conclusion we will draw parallels between Google five years ago and Live Search today.
Identifying a Web Page as Belonging to a Blog Patent
On June 19, 2006 Microsoft filed a patent titled "Identifying a web page as belonging to a blog." The patent lists Dennis Craig Fetterly and Steve Shaw-Tang Chien as the inventors. It was published on December 20, 2007, and received United States patent application number 20070294252.
SEO by the Sea did some interesting coverage on it.
Microsoft’s patented technology is used to identify whether a page is part of a blog, based a number of characteristics. Once pages are identified as a blog, they are further classified by importance based on link data.
Microsoft states this algorithm can be trained by a human.
"Blogging has grown rapidly on the internet over the last few years. Weblogs, referred to as blogs, span a wide range, from personal journals read by a few people, to niche sites for small communities, to widely popular blogs frequented by millions of visitors, for example. Collectively, these blogs form a distinct subset of the internet known as blogspace, which is increasingly valuable as a source of information for everyday users."
The web host of the page; if the site is hosted on Blogger.com, Blogdrive, Blogstyles, LiveJournal Blogs, Myspace, Typepad, Yahoo 360, Wunderblogs or other blog hosting domains, then it probably belongs to a blog.
Common words or phrases used on blogs like: permalink, comment, subscribe, archives, blogroll, responds, related posts, powered by, trackback, posted at, posted on and other common terms including those in foreign languages indicate the page probably belongs to a blog.
Another sign of a page belonging to a blog is the presence of the URL target that identifies it as a blog: www.site.com/blog.
Outgoing links that point to wordpress.com, drupal.org, blogger.com and other blog platforms are also indicative.
Feeds such as RSS or ATOM raise the possibility of the page belonging to a blog.
The patented method uses other predetermined models (which might involve training by a human) to identify pages as blogs.
The method analyzes internal link pattern to spot common patterns used by blogs.
Microsoft may also place a lot of emphasis on links and link patterns to identify blogs: "parsing HTML documents to find the content based-features is relatively expensive in terms of processing and time, while link-based information does not require parsing of documents."
Once Microsoft spiders a web page and puts it through this algorithm, it assigns a score based on the factors listed above. If a web page reaches a specific score, it goes through more analysis to be classified as blog or non-blog page.
The patent also makes a controversial statement: "Search engines are increasingly implementing features that restrict the results for queries to be from blog pages."
The reason for this statement may be explained by Jakob Neilsen of useit.com in his article "Write Articles, Not Blog Postings." He argues that short summary news articles have little long term value.
If a searcher wants to learn about past news, but only gets a short blog post that talks about specifics without a summary, an explanation of events and other content expected of a detailed news article, then the user’s experience of search goes down in value. On the other hand, a detailed article that takes the reader by the hand from A to Z offers much more long term value (assuming it’s linked to). Therefore, news articles are better than "news blog posts."
This is a good argument in support of Microsoft’s statement, but what about the thousands of useful blog posts that don’t focus on news? I refer here to blogs like Copyblogger, Seobook, Wolfhowl, seomoz, grokdotcom and many others in different industries? Ignoring their content only because they are blogs doesn’t make sense. User experience will deteriorate if Microsoft ignores them.
On March 30, 2007 Microsoft filed another patent titled "Ranking method using hyperlinks in blogs." It listed Steve Chien and Dennis Fetterly as the inventors, and assigned the patent to Microsoft. It was published October 2, 2008, and received US patent application number 20080243812.
A method for static ranking of web documents is disclosed. Search engines are typically configured such that search results having a higher PageRank.RTM. score are listed first. A modified scoring technique is provided whereby the score includes a reset vector that is biased toward web pages linked to blogs. This requires identifying web pages as either blogs or non-blogs.
This algorithm identifies blogs and then recalculates pagerank based on links coming from those blogs.
The problem that Microsoft set out to fix with this algorithm is the one that many SEOs rely on, namely, getting high PR links for the sole purpose of manipulating search results.
Microsoft states that most blogs are run by humans, thus most links from blogs are editorial and can be trusted more than regular links. This is not correct, since many blogs are auto-generated and there’s a small industry called "pay-per post." No doubt many bloggers are genuine in nature, but many are not. Link buying is still a part of the economy that drives the SEO industry, and blog posts offer a perfect opportunity to camouflage bought links as natural.
SEO by the SEA reports that authors of this patent performed experiments with 472 million pages and found results to be cleaner than with Page Rank alone. Authors also state they may put weight on blog subscribers.
Google has also modified PR with Hilltop, LocalRank, TrustRank, Topic Sensitive Trust Rank and other measures to combat the problem Microsoft is facing. The Pagerank system can be easily manipulated with links from high PR pages, thus more emphasis must go towards other indicators.
Microsoft’s Vision Based Document Segmentation
On September 23, 2008 Microsoft was granted a patent titled "Vision-based document segmentation." The patent was filed on July 28, 2003, and listed Ji-Rong Wen, Shipeng Yu. Deng Cai, and Wei-Ying Ma as the inventors (it was assigned to Microsoft, of course). It was awarded US patent number 7,428,700.
There are many web pages that contain useful but unrelated information all together on one page. This information may be featured in blocks of text, located on different parts of the page (top/bottom/left/right) or presented in other forms. Microsoft aims to differentiate and then possibly rank content by identifying it as unrelated. It uses the following cues:
Font size and font types
Color of fonts
Other unique identifiers
The patent does not mention CSS analysis, which is the dominant styling language. Many websites formatted with CSS have the simplest layouts in pure HTML and make it impossible to identify background information by its font size and position on the page (other than being contained within div tags). I do not know if Microsoft has technology that can take into account style sheet information.
Search engines are stylesheet blind and view pages in simplest forms. You can get a snippet of how search engines view pages by turning off CSS support in your browser.
You can download the Vision-based document segmentation white paper for more information on this algorithm. It is full of mathematical equations, so if you’re an algebra whiz:
Microsoft hopes to catch up with big G by patching holes in Live Search. But it faces a time hurdle; it entered the search game five years later than Google (Google 1998, MS 2003). Live Search is going through stages that are similar to the ones that Google did. For example, Webmaster World forum members reported a large shakedown of search results that brought completely irrelevant, spam pages to surface. It was compared to Google’s results a few years ago. Partway down the first page of this thread, one poster commented:
The results are similar to Google but back into year 2003! before the Florida updated.
Live search is going through development stages that are similar to Google’s, but without positive widespread press coverage. Instead, Microsoft is getting a lot of bad mouthing, from all sides (including SEO Chat and other search engine focused publications).
There’s another barrier for Microsoft.
When Google started, instead of emphasizing the commercial aspects, it focused on the technology, which proved to be a winning approach. I believe that’s something Microsoft misses. In Balmer’s (CEO) speech to the troops he mentioned something about "winning more search advertisers" (paraphrasing).
That is exactly why Microsoft is having a hard time. They view search as a cash cow, not as a useful application that makes people’s lives easier. While Google spoils users, Microsoft buys distribution deals with tech giants. At one point Microsoft bought corporate home pages in thousands of corporations in hopes of getting more usage. Instead of spending dough on "bribery," Microsoft could hire 200 more people and task them with "creation of killer features that make users come back." That I believe is their biggest problem. Google is better; why should I use Live Search? Name me one good reason.
Until Microsoft figures out how to lure users (or until it buys Yahoo), their share of the search engine market will remain low.