The Google Freshness Factor

There is a patent application in the US Trademark Office from Monika Henzinger, published in July, 2005, that certifies that she has figured out a way of determining a document’s “freshness.” In an attempt to associate this new term with Google’s other patented terminology (namely PageRank and TrustRank), forum posters are now referring to the concept as “FreshRank.”

The abstract of this patent application states that one of the problems of determining the freshness of a document indexed in a search engine is that the “last-modified-since” attribute isn’t always correct. Some webmasters have figured out they can change the modify date, but obviously a pattern of abuse developed. It doesn’t fool Google, because what Google looks for is actual modified content. As far as how Google determines how old or “fresh” a document may be is still somewhat of a secret. Lately, in the estimation of many, Google has done a very poor job in determining which web sites present as the freshest content in relation to relevancy.

This brings to mind a pertinent question. How does the freshness factor rank in determining relevancy? It has been determined by some that it doesn’t necessarily matter how fresh a document is to Google, especially if that document has many inbound links pointing to it. Henzinger is attempting to patent a more explicit form of freshness, since not all search engines use the “last modified since” attribute anyways, and stating that search engines need a more reliable way of determining overall updated content.

Unfortunately, with the implementation of the duplicate content penalty, we’ve been seeing problems with the freshness attribute of documents. With Google, in particular, the filter employed to whittle out duplicate content doesn’t appear to be taking into consideration the actual origin of the content. For many, this is becoming a great frustrating point. With the onslaught of the technological advances that Google has placed into the public realm within the last decade, it seems impractical and almost ridiculous that they would leave out the very concept of being able to determine the source of the fresh content. Yahoo and MSN do not appear to have this particular problem, so why does Google?

{mospagebreak title=Google’s Removal Tool and Duplicate Content}

Another particular problem recently presented to the freshness factor, is Google’s own Removal Tool. Experiences with this tool have been much on the side of unpleasant, if at all useful. For some, the Google removal tool has been often mentioned “as a cure against many diseases.” Diseases such as duplicate content or temporary redirects, for example. While I have used it from time to time, I have done so with a cautionary tone, and never used it on a commercial website; rather only on my personal website or blog. Some of the side effects observed are definitely worth mentioning here, and I know I’m not alone.

If you’ve ever used the removal tool, you’ll notice that the page count of the website in question has not been changed, but rather the pages simply fail to show up. Why is this? Because with the URL Removal Tool, these URLs have not been deleted; they have only been filtered out. So even though these pages appear to have been removed, they are certainly still in the database somewhere.

The period of time Google uses to remove these URLs from their index is anywhere between three and six months. I say from three to six months, even though the Google documentation tells us 180 days; in my personal experience, it has been more like 90 days. Regardless of the period of time, rest assured, they are actually still there. How do I know this? Two reasons: one, I mentioned before that the number of pages are still listed as the same amount before the pages were removed; two, after the removal period, they show right back up in the index, as if they’d never left.

Further, it has been observed that on a site where a set of pages tagged with <meta name=”robots” content=”noindex,follow”>. After natural cycles of crawling, those pages being phased out by the spider, because of the robots meta tag, it was shown that these pages still held their PageRank. Whereas, for pages in which the Removal Tool was used, those pages show absolutely no PageRank, as opposed to the Googlebot removed pages showing with a PageRank of 3 or 4. So, although none of these pages are indexed, the pages that use the <meta name=”robots” content=”noindex,follow”> pages still carry PR and are probably capable to transfer PR to other pages, where the tool-removed pages do not.

Consider, also that links from removed pages are actually dead links. On the example site mentioned above there are rows of pages linked as page1->page2->page3->… After removing page2 with the removal tool, page3 and all subsequent pages weren’t spidered.

{mospagebreak title=D’oh! Google Bungles the Content}

Well, this is all very interesting, you say, but what has this all have to do with the Freshness Factor? That’s a good question, and one I’ll answer now. So consider those pages that have been filtered out, but then reappear after the removal waiting period, with obsolete, stale content. If the webmaster felt that the pages needed to be removed in the first place, chances are good that the content isn’t very fresh. Throw in the dead links, and you truly have stale content.

The problem here is that even though these are obviously dead pages, with dead navigation, they still show up in the index, and many times ranked higher than pages with current and fresh content. This appears to me as a major problem. So, with outdated content placed in with what could be considered fresh and relevant results, is the Google Freshness Factor disappearing? Not only in light of these pages, but coupled with the content that has been whittled out due to the duplicate content filter, when the results that remain are not even the original source of the content, surely we aren’t imagining things here?

But consider this: some websites don’t need to be updated. Does this mean they aren’t fresh? Not likely. Think for a moment about that government bill that was put into action 10 years ago. No new fresh content there. What about the scientific formula for penicillin? I don’t think that changes much. And what about the historical account of the Trojan War? Pretty much stays the same. Most businesses adopt terms and conditions or a privacy policy that is designed to stay the same.

There is no mention in the patent application of gauging whether the page continues to be a currently relevant citation, even if it is not all that important as to whether a page itself updates regularly. Even Google mentions how fresh a website’s content is as a problem in one of its own recent patent applications. What does Freshness actually mean? Is it really a question of a document’s freshness or is it more about its relevancy?

{mospagebreak title=What’s Google Up To Here?}

So how does Google determine the freshness of a document in the first place? Let’s look at it, shall we? We’ve already pointed out the capability of fudging the “last-modified-since” attribute, so that means that it cannot be used to accurately gauge the freshness of a document on its own merit.

Document freshness can be defined as a combination of elements, such as:

  • The frequency of all web page changes (last-modified-since)

  • The actual amount of the change to a page itself; whether it is a structural change, or simple, but irrelevant, changes

  • Changes in keyword distribution or density

  • The actual number of new inbound links

  • The change or update of anchor text

  • The number of other pages in the database that relate to the same keywords

  • The amount of duplicate content out there

  • The numbers of new links to low trust web sites (for example, a domain may be considered low trust for having too many outbound links on one web page, or linking to link farms or free for all pages)

There could also be many other factors involved, and it’s not always beneficial or advisable to change the content of your web pages regularly, but it is very important to keep your pages fresh regularly and that may not necessarily mean a content change. In fact, if you change your content too drastically or too many pages at once, you could be subject to the sandbox phenomenon that affect new sites in the index.

{mospagebreak title=What Happens When Documents Are Too New}

In a section of one of their patent filings, Google states, “A significant change over time in the set of topics associated with a document may indicate that the document has changed owners and previous document indicators, such as score, anchor text, etc., are no longer reliable.

“Similarly, a spike in the number of topics could indicate spam. For example, if a particular document is associated with a set of one or more topics over what may be considered a ‘stable’ period of time and then a (sudden) spike occurs in the number of topics associated with the document, this may be an indication that the document has been taken over as a ‘doorway’ document.

“Another indication may include the sudden disappearance of the original topics associated with the document. If one or more of these situations are detected, then [Google] may reduce the relative score of such documents and/or the links, anchor text, or other data associated the document.”

I think what Google is attempting to establish here is reliability, and trustworthiness. While freshness may play a part in it, there is far more at stake here than simply how recent the content is, or even where the content originated. In my opinion, it’s not really about freshness; it’s always been about inbound links.

[gp-comments width="770" linklove="off" ]