The Yahoo SLURP Crawler

As SEOs and webmasters, we’re always looking for ways to get the search engine spiders to crawl our sites, and the deeper, the better. This article shows you how to target Yahoo’s crawler and convince it to stop by regularly.

The search engine wars are fought with strategies, alliances, and robots. As Yahoo! primes itself to be the number one contender for market share after Google, websites that want to optimize for Yahoo must study how Yahoo ranks pages and how it indexes pages. The Yahoo web crawler SLURP should be studied; your site server logs should have recorded visits from various robots, including SLURP. If you do not have records of SLURP visiting your site, then this article will give tips on how to get SLURP to crawl (hopefully deep crawl) your site.

The Preamble

Yahoo SLURP evolved from Inktomi SLURP. The Yahoo SLURP robot is an upgrade from Inktomi’s SLURP. Yahoo used Inktomi’s search engine to replace Google, which used to take care of its search results. This officially triggered the second search engine wars (the first was won by Google without it declaring hostilities).

Yahoo has at least 130 million registered users on its network. Granted, Google is the definitive search engine, but Yahoo is large enough that it should not be ignored.

SLURP crawls websites, scans their contents and meta tags, and travels down the links contained on the page. It then brings back information for the search engine to index. Yahoo SLURP 2.0 stores the full text of the page it crawls in its memory and then returns to Yahoo’s searchable database. This is one of the semi-unique points of Yahoo SLURP; not all search engine crawlers store the entire text of the pages they crawl.

While SLURP has some features unique to it, it also obeys the robots.txt command. This command is very important since it ensures that you have control over which pages the crawler searches and indexes. This lets you protect the sensitive pages which you need to keep secure, pages which contain information you would rather not have in the hands of hackers (who regularly try and infiltrate search engines databases), or pages which you don’t want indexed at all (for whatever reason).

Another good thing about the robots.txt file is that it enables you to exclude specific robots, so you can inhibit the Googlebot but enable SLURP to crawl a particular page. This can be useful if you have optimized different pages for separate search engines. This may occur in order to give you flexibility, but a search engine may think you have duplicate pages and may penalize you. So careful use of the robots.txt file should definitely be on our list of how to make your website more search engine friendly. So how do you use the robots.txt file? You open notepad and type in the following lines:

  User-Agent: Slurp
  Disallow: whatsisname.html
  Disallow: page_optimized_for_google.html
  Disallow: credit_card_list.html
  Disallow: whatnot.html

Save it as robots.txt and upload it into your root directory. You can disallow as many pages for each crawler robot as you want, but to disallow certain pages for another crawler, you start a new line of code.

  User-Agent: Slurp
  Disallow: whatsisname.html
  Disallow: page_optimized_for_google.html
  Disallow: credit_card_list.html
  Disallow: whatnot.html
  User-Agent: Googlebot
  Disallow: page_optimized_for_yahoo.html
  Disallow: credit_card_list.html
  Disallow: whatnot.html

If you want to disallow all crawlers, you replace the name of the user agent with the wildcard command (*)

Robots.txt is useful for not getting banned on search engines and can also be used to pinpoint crawlers when they come calling. Only crawlers request Robots.txt, and these requests show up on the server logs.

Another way of shutting out SLURP is by using the noindex meta-tag. Yahoo SLURP obeys this command in the document’s head, and the code inserted in between the head tags of your document is

  <META NAME=”robots” CONTENT=”noindex”>

This snippet will ensure that that Yahoo SLURP does not index the document in the search engine database. Another useful command is the nofollow meta-tag. The code inserted is

  <META NAME=”robots” CONTENT=”nofollow”>

This snippet ensures that the links on the page are not followed.

Dynamic Page Indexing

This is the real charm of SLURP. Most search engine crawlers don’t bother crawling and indexing dynamic pages (.php, .asp, .jsp) since their content is subject to rapid change, which makes the process of indexing useless. Yahoo SLURP, however, does daily crawls in order to refresh the content on their indexed dynamic pages. It also does bi-weekly crawls which enables the search engine to discover new content and add it to its website incrementally. This enables a complex site’s URLs, generated by forms and content management software, to be indexed.

This frequent crawls show up in your server logs as frequent download requests, as the crawler moves, stops, and restarts. Yahoo says that these frequent download requests should not be a cause for alarm.

SLURP’s ability to index dynamic pages and to constantly refresh its content is a great relief to web designers (like me) who like having dynamic pages to enable fast loading and rapid updating. Websites which were not search engine friendly are suddenly in contention to be ranked number one.

However, the down side to this is that SLURP may never deliberately crawl your dynamic pages, unless you trigger the crawler via techniques which Yahoo encourages (to the benefit of their bottom line).

Getting Framed

Yahoo SLURP also has the ability to support frames, although it will not follow the SRC tag links to stand alone framesets; it only follows the HREF tags (as all good crawlers do).

After having said all this about Yahoo SLURP, there is now the little issue of getting your site crawled by this particular search engine spider. There are some ways to go about this task, and here we begin to see the inklings of what would be the order of the day in a search engine market dominated by Yahoo! (who seems to be very, very concerned about its bottom line).

Linking

The first strategy is good old linking; just get a link on a site on which Yahoo! regularly crawls, and voila. You have SLURP knocking on your door. This can be done by corresponding with a site which ranks well on Yahoo, or by submitting your web site to directories which SLURP regularly crawls (you can find these by searching for “directories” on Yahoo). If SLURP deep crawls (crawls lots of pages instead of just one or two) your site regularly, you have a high chance of getting a good ranking on the key word or topic for which you have optimized your site.

Yahoo Companion Toolbar

This is supposed to trigger the SLURP robot to crawl your site. And it also enables searchers to search within your site, offering value for your audience and attracting Yahoo SLURP as well.

Sitematch

This is done by paying Yahoo’s fees and submitting your site. This guarantees you will be added to the index (at a price) but is no guarantee of your website’s ranking in the SERPs.

This is a scary service, and some reviewers speculate that it is a foretaste of what site owners would face in a market dominated by Yahoo. It is carried over from Overture (which Yahoo purchased) and involves an annual fee for submitted pages. The URLs are submitted into Yahoo’s index and are then crawled by SLURP every 48 hours.

However, apart from the one off fee, there is a cost per click fee charged for each lead driven to your site (so you better have deep pockets)

Apart from SLURP visiting every two days, you also get listed on searches done on about.com, Excite, Overture and other Yahoo partners. However there is no guarantee of a high ranking, and frankly I do not like this method (because I absolutely love free stuff).

There is a way to submit  your site for free,  however Yahoo does not guarantee that websites submitted through such means will ever be crawled by SLURP.

By now you should know enough about SLURP to spot it, track it, attract it, and prevent it from crawling specific pages of your site.

Google+ Comments

Google+ Comments