Google`s Latest Moves in Information Indexing (Page 1 of 4 )
Sometimes Google does something with very little fanfare that stirs considerable interest. In this article, I’m going to discuss several of their recent moves. If you’re curious about their attempts to index more of the web or make their indexing more useful for searchers, keep reading; you’ve come to the right place.
SEOs have known for the longest time that HTML forms are potentially problematic. Any content that requires a user to fill out a form to peruse will trip up search engine spiders and remain unindexed. That's perfectly fine if that's what you want to have happen. Not all online content is for sharing, and if your content is valuable enough to encourage subscribers to pay good money for it, as happens with certain medical and legal indexes, you may not want general search engines to root around in your index and turn it up free for the asking.
Google wants to change that. In a recent post to the Google Webmaster Central Blog, the search engine revealed that "we have been exploring some HTML forms to try to discover new web pages and URLs that we otherwise couldn't find and index for users who search on Google." They make certain automated entries into the form based in part on content from the site, and "If we ascertain that the web page resulting from our query is valid, interesting, and includes content not in our index, we may include it in our index much as we would include any other web page."
The googlebot's new abilities stem from Google's purchase of Transformic in 2005. Transformic was working on exactly this problem. Anand Rajaraman, writing for Datawocky, mentioned working with one of Transformic's major researchers (Alon Halevy, who also made the recent Google blog post) back in 1995. He noted that Transformic was attempting to solve two problems with their technology. First, they needed to be able to determine which web forms were worth penetrating. Then, "If we decide to crawl behind a form, how do we fill in values in the form to get at the data behind it?" Rajaraman asked. Check boxes and radio buttons were no big deal, but with "free-text inputs, the problem is quite challenging - we need to understand the semantics of the input box to guess possible valid inputs."
This latest move is Google's way of crawling what has often been referred to as the Hidden, Deep, or Invisible Web. Google insists that it will continue to respect robots.txt files. But the move is not without its problems, and a number of observers have expressed concerns. I'll be covering those issues in the next section.
Next: Webmasters Unprepared >>
More Search Engine News Articles
More By Terri Wells