What in the WORLD is going on at Google? Since early last year, there have been widespread reports of Google dropping caching on current pages but keeping pages two or more years old that don't even exist on the site any longer. There have also been widespread reports of Google indexing some pages on a site, but ignoring others, with no clear reason why the spider is doing what it is. What's the deal?
In this article I'll shed some light on what happened, why it happened, what you can do about it … and take a look at what I believe will be the most significant update in Google history coming up early this year.
In January of last year (2006) Google went through "The Big Daddy" update.
Unfortunately, since then things at Google have been … unstable, for lack of a better way of putting it. The reason for this is simple. Google servers ran out of space.
I know that sounds crazy, even bizarre, but it's true, and it was admitted by none other than Google's CEO in April (the full story is here). His exact words were, "Those machines are full. We have a huge machine crisis."
For the CEO of a search engine company to admit that his servers are so full that they've got a crisis is huge. If history is any indicator at all, he was probably UNDERSTATING the true extent of the problem.
This then begs the question, what did Google do about it? Obviously they didn't just let their servers fill up until they crashed; we know that didn't happen. So what did they do to at least hide the problem from search users?
They started by making changes to the spider. The spider would no longer even attempt to index every page of a site. Instead, it would index only "entry pages," or those pages that could be gotten to from another source (links from other sites) or had a "high likelihood" of being clicked on if the page came up in a search (how that was determined I don't know).
By drastically reducing the number of pages that the spider would send indexing data back to the Google servers, they drastically cut the rate of growth of their index database.
The problem however is that I have reason to believe that those changes had some rather significant bugs. This was then compounded by an application that the Google engineers wrote to go through the database of cached pages to remove "no longer needed cached images."
Unfortunately, it would appear that the application had some rather severe bugs that caused current and useful pages to be dropped from the cache while some older and non-useful pages were kept.