Microsoft`s Live Search Patents and Algorithms Related to Blogs - How Microsoft's patented method identifies blogs
(Page 2 of 4 )
The web host of the page; if the site is hosted on Blogger.com, Blogdrive, Blogstyles, LiveJournal Blogs, Myspace, Typepad, Yahoo 360, Wunderblogs or other blog hosting domains, then it probably belongs to a blog.
Common words or phrases used on blogs like: permalink, comment, subscribe, archives, blogroll, responds, related posts, powered by, trackback, posted at, posted on and other common terms including those in foreign languages indicate the page probably belongs to a blog.
Another sign of a page belonging to a blog is the presence of the URL target that identifies it as a blog: www.site.com/blog.
Outgoing links that point to wordpress.com, drupal.org, blogger.com and other blog platforms are also indicative.
Feeds such as RSS or ATOM raise the possibility of the page belonging to a blog.
The patented method uses other predetermined models (which might involve training by a human) to identify pages as blogs.
The method analyzes internal link pattern to spot common patterns used by blogs.
Microsoft may also place a lot of emphasis on links and link patterns to identify blogs: "parsing HTML documents to find the content based-features is relatively expensive in terms of processing and time, while link-based information does not require parsing of documents."
Once Microsoft spiders a web page and puts it through this algorithm, it assigns a score based on the factors listed above. If a web page reaches a specific score, it goes through more analysis to be classified as blog or non-blog page.
The patent also makes a controversial statement: "Search engines are increasingly implementing features that restrict the results for queries to be from blog pages."
The reason for this statement may be explained by Jakob Neilsen of useit.com in his article "Write Articles, Not Blog Postings." He argues that short summary news articles have little long term value.
If a searcher wants to learn about past news, but only gets a short blog post that talks about specifics without a summary, an explanation of events and other content expected of a detailed news article, then the user's experience of search goes down in value. On the other hand, a detailed article that takes the reader by the hand from A to Z offers much more long term value (assuming it's linked to). Therefore, news articles are better than "news blog posts."
This is a good argument in support of Microsoft's statement, but what about the thousands of useful blog posts that don't focus on news? I refer here to blogs like Copyblogger, Seobook, Wolfhowl, seomoz, grokdotcom and many others in different industries? Ignoring their content only because they are blogs doesn't make sense. User experience will deteriorate if Microsoft ignores them.
Next: Ranking method using hyperlinks in blogs patent >>
More MSN Optimization Articles
More By Ivan Strouchliak