Spider Guts

What’s inside the spiders? To get a good ranking in search engines, a good understanding of the fundamentals of SEO and how search robots crawl web pages is essential. The author includes valuable information such as a list of core elements considered by a typical search engine when calculating page relevance.

In the quest for that elusive nirvana of search engine friendliness, we frequently find ourselves searching for “instant fix” ways to improve a page’s ranking without considering the big picture; that is, without looking at the problem of optimizing a web page as a whole and instead looking at several separate optimization steps as part of routine markup development or copy writing. While SEO experts do not tend to fit this mold, the average web developer certainly does. How many web pages have you “optimized” by simply adding keyword and description meta tags, and stopped right there? I imagine a hand count at this point would supply a fairly substantial number. An even better question at this point might be, “How many of you have tried to provide SEO for a web page without having even a basic understanding of search robot logic or what it expects to see in your pages?” Once again, I suspect we would have a healthy hand count.

The steps to optimize a page are well known to the SEO community, and many articles by authors far more knowledgeable than myself on the subject are available to web developers. So with all this knowledge out there, why are there so many developers who lack a big picture understanding of the subject? One word: fundamentals. It is crucial to know how the technology behind the scenes works, but like any other skill, the bulk of people attempting to learn that skill do not start at the bottom. They start somewhere that makes sense in trying to solve a particular problem and then they build up from that point.

If a developer held a greater understanding of the fundamentals of SEO and how search robots went about crawling web pages, they would in turn have a greater understanding of how to populate those alt attributes and meta tags. The objective of this article is to provide a general overview of how search robots (also called spiders) go about crawling and indexing web pages.

There are a number of things that a spider expects to see when it looks at a web page, many of which are optional but still important in the big picture. The following is a list that describes the core elements considered by a typical search engine when calculating page relevance.

  1. Title Tag – The title tag should contain a title relevant to the page, not just “Home Page” or “Contact Us”. The title should be used for up to five keywords.

  2. Headings – Search engines view <h> tags as terms of emphasis, meaning additional weight is given to terms that appear inside them. Keywords should appear in <h> tags.

  3. Bold – Also viewed as terms of emphasis, but with less weight than headings.

  4. Alt Text – Brief descriptive sentences should be used in image alt attributes. At least one keyword should appear in each alt attribute.

  5. Keyword Meta Tag – Some engines use the keyword meta tags directly, some use them as part of a validation process ensuring that the keywords closely match the page content. The latter is the more typical scenario for modern engines. Keywords should be chosen carefully and be specific to the page they appear on.

  6. Description Meta Tag – Most search engines use this tag in a similar fashion as the keyword tags. Each page should have a unique description. The description should contain a few keywords and briefly summarize the content that appears on the page with a high degree of accuracy.

  7. Keyword Placement – Terms that are higher up on a page are more heavily weighted.

  8. Keyword Proximity – Terms that are close together are probably related, and thus the site will show up in searches for those terms.

  9. Comment Tags – Some search engines use comment tags for content, particularly in graphics rich/text poor sites.

  10. Page Structure Validation – Proper coding is likely to be of better overall quality, and thus rewarded.

  11. Traffic/Visitors – Search engines keep track of how many people follow their links. The more a link is followed for a given search, the more relevant the link is assumed to be.

  12. Link Popularity – Also known as PageRank, this is a measure of how many web pages on the Internet link to your site and the relevance of those pages to the page they are linking to. The popularity of the linking site is also evaluated.

  13. Anchor Text for Inbound Links – This is a measure of the relevance of the anchor text from the referring site.

  14. Page Last Modified – Newer content is regarded as “fresh” and is treated as more relevant.

  15. Page Size – Engines tend to weigh content at the start of a document more than content further down. If a page is too long, typically more than 50k in markup only, then it should be broken up into multiple pages.

  16. Keywords in URL – URLs are considered important by engines. Use of hyphens rather than underscores in filenames and using keywords in filenames and directories improves a pages potential relevance.

These elements are all poured into an algorithm by the search engine that produces a very specific result: a relevance score for a page based on a given set of keywords. Evaluating page relevance is a constant reciprocal process that involves crawling around all pages indexed by a particular engine and evaluating the relevance of their content and the relevance of references to that content. The items listed above are things search engines expect to find in a page as well as factors that are not necessarily expected, but are considered if available (such as inbound links).

The next set of variables weighed by search engines are negatives. These will negatively effect the performance of a page on a search engine without exception. Avoiding these items is crucial to reaching and, more importantly, maintaining a high rank on a search engine.

  1. Broken Links – Internally or outgoing, search engines do not view pages with broken links as pages with fresh content, and are going to be scored as less relevant for their keywords.

  2. Spam – This refers to any attempt to trick a search engine, such as using irrelevant keywords to draw extra hits, placing invisible content on the page to boost keyword density, and using meta refreshes (often in combination with irrelevant keywords) to draw a user in for an irrelevant search and then direct them to the page you want them to see. These techniques can result in a ban from search engines if they catch them.

  3. Excessive Search Engine Submittal – Over submitting a site to a search engine will likely result in a ban. Submit no more than once every three months, according to Google.

  4. Empty Alt Attributes – Empty alt tags is a major accessibility issue as well as just poor coding, and will affect a page negatively.

  5. Excessive Punctuation – Excess punctuation in the Title and Description tags wastes valuable space and may cause a problem with the engine.

These negative factors could greatly effect an otherwise relevant page, of course some of them preclude the page actually being relevant, particularly spam. The biggest pitfalls for an otherwise optimized page is simple typographical errors, broken links (usually due to stale content) and oversight in markup. Simple mistakes could mean the difference between top ten and top fifty for a search on an engine, a difference that could mean thousands of dollars per day in lost revenue for many websites.

Imagine if a site like Amazon.com failed to use alt attributes and stopped using <h> tags (replaced by images, for example). Searches that would typically show the site as a number one result could start bringing the site up as a number fifty result.

Conclusion

Approaching SEO as a holistic process rather than simply a combination of steps is critical. It is simply not enough to use an effective Title tag on every page and stop, or to use keyword relevant URLs and then stop. To achieve and maintain top rating on all pages for the appropriate sets of keywords, a page must be optimized completely for the way search engines weigh content relevance, and that involves taking everything discussed earlier in the article seriously.

Google+ Comments

Google+ Comments