SEOs have known for the longest time that HTML forms are potentially problematic. Any content that requires a user to fill out a form to peruse will trip up search engine spiders and remain unindexed. That’s perfectly fine if that’s what you want to have happen. Not all online content is for sharing, and if your content is valuable enough to encourage subscribers to pay good money for it, as happens with certain medical and legal indexes, you may not want general search engines to root around in your index and turn it up free for the asking.
Google wants to change that. In a recent post to the Google Webmaster Central Blog, the search engine revealed that “we have been exploring some HTML forms to try to discover new web pages and URLs that we otherwise couldn’t find and index for users who search on Google.” They make certain automated entries into the form based in part on content from the site, and “If we ascertain that the web page resulting from our query is valid, interesting, and includes content not in our index, we may include it in our index much as we would include any other web page.”
The googlebot’s new abilities stem from Google’s purchase of Transformic in 2005. Transformic was working on exactly this problem. Anand Rajaraman, writing for Datawocky, mentioned working with one of Transformic’s major researchers (Alon Halevy, who also made the recent Google blog post) back in 1995. He noted that Transformic was attempting to solve two problems with their technology. First, they needed to be able to determine which web forms were worth penetrating. Then, “If we decide to crawl behind a form, how do we fill in values in the form to get at the data behind it?” Rajaraman asked. Check boxes and radio buttons were no big deal, but with “free-text inputs, the problem is quite challenging – we need to understand the semantics of the input box to guess possible valid inputs.”
This latest move is Google’s way of crawling what has often been referred to as the Hidden, Deep, or Invisible Web. Google insists that it will continue to respect robots.txt files. But the move is not without its problems, and a number of observers have expressed concerns. I’ll be covering those issues in the next section.
Several people who made comments to the Google blog itself pointed out a variety of issues. One big one is that webmasters may not be prepared for the googlebot’s changing approach. SuperJason noted that “I may have to exclude googlebot from my error list. I’m guessing it’s pretty easy for it to put in some bogus content.” He also wondered whether Google didn’t already have enough to index.
Another poster noted that it wouldn’t entirely solve the problem of missing a lot of content, since so much of it is now being published using Ajax programming, which Google can’t crawl yet. Olaf Lederer, yet another commenter, said that he feared the opposite problem: Google would find tons of “duplicate content” that isn’t really duplicated, but merely appears to be because of this technique. That’s a major concern, since Google penalizes duplicate content.
But Olly raised one of the strongest objections to the new crawling technique: “Inserting random text into form fields is just plain wrong. That’s what spammers do.” Since many webmasters block that kind of behavior for just that reason, some might end up blocking the googlebot altogether, which will hamper Google in its continuing mission to completely index the world’s knowledge.
Danny Sullivan, however, lauded Google’s new technology. “The move is potentially good for searchers, in that it will open up material often referred to [as] being part of the ‘deep web’ or ‘invisible web’ as it was hidden behind forms…It should be noted that Google’s not the first to do something like this. Companies like Quigo, BrightPlanet and WhizBang Labs were doing this type of work years ago.” Google is the first major search engine to do this kind of exploration, though.
Webmasters need to be aware that the googlebot just might explore past HTML forms, and act accordingly. If you have content that you do not want indexed, you need to take the appropriate measures to specifically block it; you can’t use the shortcut of putting it behind an HTML form. And you should probably watch your logs a little more closely for unusual activity if this concerns you.
In this section I’m going to mention something that’s raised a bit of controversy among some content providers and online retailers, and justifiably so. You may have heard it referred to as Google’s second search box. You may or may not have actually seen it yet, though, since in my experience it comes up for relatively few searches. Here’s a screen shot of it; I searched for Wikipedia to turn it up:
You can use the second search box to perform a search limited to the site. I tested it with several other content providers; the second search box also came up for The Economist and The New York Times, but not for The Wall Street Journal. The logic behind this is that someone searching for a major, well-known content provider (or retailer, since it also comes up for Best Buy) will want to conduct a keyword search on that site. So Google makes it easier to do that by turning up a search box right in the results. There is a command you can use on Google to limit a search to a particular site, but this is more user-friendly; you don’t need a black belt in Google Fu to use this.
So why are site owners of various stripes unhappy about this? I’ll use a search on The New York Times as an example. I put “job hunting” into the second search box. In addition to links to articles in the NYT on job hunting, Google displayed relevant text ads on the right hand side. These ads led to job hunting sites that compete with the Times’ own classified listings. This could lead, at least indirectly, to fewer people using the Times for job hunting, fewer employers advertising with the Times, and a smaller bottom line for the company.
Does that sound like a stretch? Then let me show you something that’s a little less of a stretch. As I mentioned parenthetically, this second search box also turns up for Best Buy. Say I’m searching for Best Buy because I’m in the market for a cell phone. When I put “cell phone” into the second search box, here’s the list of sponsored links that show up to the right of the main results:
Every single one of these companies is competing with Best Buy for the money I’m going to spend on a cell phone. The second and fourth listings seem to be accurately targeted by geography, referring to Miami-Ft. Lauderdale and Florida. And the last one in this screen shot even shows a Google Checkout button, to make things easier for me. Someone feeling truly cynical – or paranoid – might say that Google is helping their advertisers steal customers away from where they originally intended to go, and Google Checkout is providing the search engine with a little extra financial incentive to do so.
Our company CTO told me about this one. It seems to be a lot less controversial than other topics I’ve discussed in this article. It could certainly help students with their history homework, to say nothing of re-enacters and other serious amateur historians. It’s a new command, “view:timeline,” that you can use at the end of a query for a historical perspective.
It looks as if this is actually still in Google Labs. It seems to be one of the different ways that Google sorts and arranges data in order to give you a deeper perspective. Here’s a screen shot:
Near the top you can see options for several different views (list view, info view, timeline view, and map view). It’s the timeline view that we’re focusing on. As you can see in the image, you have a timeline with sections you can click on, and a search box that lets you set a filter to a particular time period. You can use the arrows if you want to move back or forward in time from the default.
For example, say I want to find out more about the US in the early 1800s. I click on that link in the timeline, and the results change. The first result leads to the US Constitution, with explanatory notes. The second link, from FullBooks.com, seems to go to a book chapter focused on that era. The fourth link, from kansasheritage.org, details treaties between the Potawatomi and the US. On the second page, I find information on the Louisiana Purchase, a link from the Thomas Jefferson Papers, and more.
The view:timeline command works with more than just countries. I tried it for IBM and Sears Roebuck; sure enough, I got timelines and relevant results. It also works with certain concepts that have “history” behind them. I turned up timelines for automobiles, computers, and space travel; for you sports fans, view:timeline also works with baseball. The view:timeline command did not work with Google itself, however. I guess the company doesn’t have quite enough history yet. Indeed It doesn’t, by certain strict definitions; my college history professors told me that the dividing line between history and current events is 25 years before the current date.
Whatever you may think of Google’s latest moves, it’s a fair bet that the company will be around in some form long enough to develop a respectable timeline and history. It has changed the way we look for information, and continues to evolve.