Topic Sensitive PageRank

There has been speculation in the search engine optimization community about the potential implementation of Topic-Sensitive PageRank by Google. Topic-Sensitive PageRank (TSPR) is considered by some to be a further refinement of Google’s well known PageRank system.

The idea of a TSPR originated with Taher H. Haveliwala of the Stanford University Department of Computer Science in 2002. In a paper entitled Topic-Sensitive Page-Rank, the concept was first introduced at the Proceedings of the Eleventh International World Wide Web Conference, in 2002. The paper, in its entirety, can be found here.

Keep in mind that the paper offered at the conference was one method of implementation only. The actual calculation used might be quite different. What is important is the overall concept of TSPR, and its implications for the search engine optimization community.

How is Topic-Sensitive PageRank different from PageRank?

To fully understand the implications of Topic-Sensitive PageRank, we need to first briefly examine the current Google PageRank (PR) system.

PageRank (spelled as one word) is a Google trademarked technology. It was designed as a numerical system of ranking the relative importance of web pages, and created at Stanford University in California by Google founders Larry Page and Sergey Brin.

The concept they used was, in Google’s own words, to calculate the “uniquely democratic nature of the web by using its vast link structure as an indicator of an individual page’s value.”

If Google’s definition is taken literally, the entire system rests on the incoming and outgoing links from the billions of web pages that form the Internet. On the surface, the system seems simple enough. If web page A links to web page B, Google considers web page A as actually voting for the importance of web page B.

While the system of PageRank is far more complicated than that simple definition, it serves our purposes for this article.

Topic-Sensitive PageRank is a theoretical attempt to make the system of PageRank potentially more accurate. Instead of a single PR ranking, the TSPR calculation would create several different PageRanks, for each particular topic. The topics used for the tabulations would be representative of the theme of each specific webpage.

The idea would be to use a number pre-calculated and topic biased vectors to create a number of PageRanks for each web document. In that sense, the PR would be theme sensitive.

Instead of the PageRank being based on all incoming links, the Topic-Sensitive PageRank would be heavily weighted toward links that were related to the page’s main subject area. Links from sites not directly on topic, with the specific webpage, would be assigned much less weight in the calculation.

How does Topic-Sensitive PageRank Work?

The currently used PageRank calculation is conducted once, and is completely independent of the search query. Because the PageRank is only marginally related to the search term, the PR remains constant at a single ranking.

In the Topic-Sensitive PageRank system, heavily linked pages bearing no informational relationship with the search term, will be given less weight for that topic. On the other hand, pages receiving only a few incoming links, but from very related sites, will be given much more consideration for that term. The result will be a higher TSPR for that site, for that specific search query, despite a much lower PR under the current system.

In order to create the specific representative basis topics, which would be limited in scope, not all possible query terms would be used. Specifically, the topic areas proposed by the TSPR system designers, would draw on the Open Directory Project (DMOZ) data base. Only a limited number of themes would be used to pre-calculate the TSPRs.

Like the currently used Google PageRank system, the Topic-Sensitive PageRank would also be pre-computed, to save time in the search query processing. Since there are multiple themes to calculate, each page would be scored against multiple topics. Instead of one PR number, there would be many numbers, based on the total number of themes used in the computation.

At the time of the search query, all of the pre-calculated PR numbers are used together, to create a composite number for that specific topic. As each topic is weighed differently, relative to each web page, the PR number could vary widely from search to search.

The first step in the calculation process is to generate the weighted topic vectors. These would be calculated offline prior to updates. The exact mathematical formula is very complex, and may not even be the one considered for use. As always, Google won’t be discussing their formula in public. Because of the possibility of using a completely different formula, from the one presented in the Stanford presentation, there is no point in studying it precisely.

What is important to know, however, is the system as proposed in the Stanford presentation, is heavily based on the categories found in the Open Directory Project (DMOZ). The confidence placed in the DMOZ, is based on the assumption that the data in that directory is lacking bias, due to the editors being volunteers. Many observers might question that assumption. Despite the best efforts of the volunteer editors to provide the best possible directory, intentional or unintentional errors can still be made.

The initial bias toward themes, as found in the Open Directory Project, is only the first part of the calculation process. The first part created the weighted PageRanks. The second part is computed for the individual search engine query.

What is also important to know is that each web page will end up with multiple PageRanks depending upon the keywords being searched.

The calculation for each individual search query could be performed in one of two ways. Again, keep in mind that we can’t be certain which way Google would choose to make the tabulation.

The first way, and the example used in the research paper, is to make the calculation a uniform one. All users searching a particular keyword, or combination of keywords, would receive similar results. The system, based on uniformity, would be easier to implement.

The second way, would be to make the results individualized to the search engine user. By taking into consideration prior searches, and surfing habits of a user, that person’s query could be personalized. The resulting returns would be based on that user’s individual interests. Such a system would presuppose the use of surfer tracking techniques.

In the currently used PageRank system, incoming links have a fixed PR value. For example, a PageRank 6 page passes along that PR value to the receiving page, regardless of the themes of the pages. A PR6 page about cats can pass strong PageRank along to a page on travel, whether related to the subject or not.

In a Topic-Sensitive PageRank environment, the same PR6 page about cats would be weighted in many ways, against the various categories used in the Open Directory Project. The amount of TSPR would vary from one receiving web page to another. The travel page would not receive as much passed along PR, from the cats page, as would a page on animal care.

Incoming links would require an examination of the relationship of the sending page’s theme to that of the receiving page. In other words, the value of an incoming link can vary widely from page to page. A PR6 page unrelated to your page’s theme would carry far less TSPR value than a PR3 page more similar to your theme.

For webmasters seeking reciprocal links, the value of a page would depend heavily on the type of site under consideration. A page on cats would not get very much TSPR value from a link exchange with a travel page.

In order to build a strong PageRank, and become an “authority site,” incoming links would need to be from similarly weighted pages. The need to find links with pages that are in your website’s area of interest becomes very important under the TSPR calculation. Not just any old links would work anymore.

If the current system of Google PageRank were replaced by the Topic-Sensitive PageRank system, a new approach would have to be taken to search engine optimization. There is some speculation that the change may be taking place now, or in the near future. Should that indeed be the case, SEOs would have to learn the new system.

The current PageRank calculation is relatively familiar to search engine optimization specialists. Google has published the original formula used for the calculations, and the theory behind it. While that original equation may have been modified, the overall theory has remained fairly constant. The single PR number is universally recognized and understood.

If a Topic-Sensitive PageRank calculation is fully introduced, the task of optimizing PageRank will become more difficult. Simply adding incoming links, regardless of source, will no longer be as effective. In fact, it may even be counterproductive. Link exchanges will lose their former level of effectiveness.

Webmasters will need to seek out incoming links from similar themed websites. Constantly updating a website with fresh and relevant content, may become even more important than it is at present, in order to attract natural linking.

Searches could become more personally tailored, as the PageRank will be more specific to a given page. On the other hand, the relative importance of PageRank to the overall search engine algorithm, would have to be considered. That issue is one of debate within the SEO community. If PR is given only a small share importance in the algorithm, then TSPR won’t make a large difference to the results.

Without some idea of the formula that might be used in the Topic-Sensitive PageRank calculation, specifics are difficult to ascertain. All that can be discovered are the potential implications of the system.

In which ever way the system is implemented, TSPR will have a profound effect on how PageRank is assigned by Google.

Google+ Comments

Google+ Comments