Contrary to what some people may believe, search engine spiders, web bots, crawlers and Web robots are merely metaphors for computer code used to analyze and parse a Web site into components that can be analyzed by more computer code. No creature actually visits a Web site nor does it physically crawl through a site. The real spider is an algorithm that exists on a Web server that requests Web page code pretty much the same as a page is requested by a user’s browser when the user clicks on a hyperlink.
Every search engine company uses a spider algorithm to gather information. Google calls their spider GoogleBot, MSN’s spider is called MSNbot and Yahoo’s is named Slurp. Some search engines do not use a spider of their own. AOL and Netscape both use data from Google. Meta search engines may gather information from several sources and sometimes will filter it using their own algorithms. Dogpile and Metacrawler are meta search engines.
Spiders gather information about a Web site by requesting a Web page based upon a URL it finds elsewhere on the Web. The algorithm never actually sees a Web page—it can only read the HTML code, which looks similar to what you see when you use the View Source tools on a browser’s toolbar. Once the code has been pulled into the spider’s server, additional algorithms parse, dissect and analyze the code to identify content, hyperlinks and HTML page elements used by the algorithm to identify and classify the type of information found on the page.
Over the past several years a spider has taken on a second major function. The first is to find web pages with desirable content. The second is to identify techniques being used by web site owners who intentionally attempt to deceive search engine spiders in the hope of achieving a higher ranking. These methods are know as black hat SEO techniques, and while they can give a Web site a short-term boost in rankings, they can also cause a site to be penalized or banned when detected. Google has embraced an aggressive search for black hat techniques, but Yahoo also looks for questionable techniques that can be used to penalize or ban a web site. The list of commonly utilized methods that the search engines do not like is always growing, and whenever a new detection method is added to their algorithms, a lot of site owners find their sites penalized–frequently for innocent design or coding techniques that resemble or duplicate black hat techniques. A method that is legitimate one year may create a penalty the next year.
Search Engine Spiders and Hyperlinks
Hyperlinks and a good linking structure are critically important in order for a spider to find all the pages in a Web site. Search engines collect every hyperlink found on every Web page on the Internet. They use them not only to find other pages with in a web site, but also to find other Web sites that a site owner links to. A link to another site is a popularity vote for that site. It adds value to the receiving site and helps to boost its rankings, and it also provides a path for search engines to find another site.
Search Engine Spiders and JavaScript and Flash
Links within a site need to be either image or text hyperlinks. Most spiders cannot follow the JavaScript links typically found in drop-down menus, nor can they follow Flash menus. In general, search engines cannot read or execute JavaScript, although Google can find URLs in JavaScript code if they are fully-formed URLs. JavaScript is designed to be a client-side interactive scripting technology, which means it executes (runs) in a user’s browser. A spider is not designed to execute code, so any interactive feature built into a web page is not seen by a spider. I’ve seen designers who thought that they could customize HTML title tags and other page elements using JavaScript, but this will never work with search engine spiders and typically creates problems.
Although search engine spiders cannot execute JavaScript, there is a caveat. JavaScript redirects that automatically redirect users to another page should never be used. A notorious black hat SEO company located in Las Vegas used to very successfully artificially boost web site rankings using a JavaScript redirect trick. Google learned how to identify this technique and in the summer of 2004 banned almost all of the SEO company’s clients’ Web sites. Since that event, it has been dangerous to use JavaScript redirects due to the risk that Google may accidentally mistake a legitimate use of redirects for one designed to deceive search engines.
Search Engine Spiders and Content
"Content is king" has been the mantra of the search engine industry ever since its inception. While there are other techniques that can boost a Web site’s rankings on search engine results pages, the best long term results are usually obtained by providing search engine spiders with good informational content. Search engines do not care who you are, how large your company is, or what you offer. They primarily crave informational content. Developing good content should be part of any Web site design, but it is also important to make sure that the content is easy for the spiders to find and parse out of a page. Keep your code simple and do not nest HTML tables more than three deep (a table within a table within a table). The simpler the web page code, the easier it will be for a search engine algorithm to find the content and the keywords that are critical to good rankings.
Search Engine Spiders and Cookies
Cookies are small blocks of text stored by a browser and used to identify a user as that person moves from page to page in a web site. Not all Web sites use cookies. A cookie frequently contains a randomly generated number called a session ID that is stored when a user enters a site. Cookies can be critically important for the functionality of an e-commerce site. Every Web page request stands alone and without a session ID stored in a cookie there may be no way to identify a user who places an item in a shopping cart or moves throughout a Web site. If you cannot identify the user, you cannot associate the user with items in a shopping cart.
It’s important to understand that spiders cannot store cookies, so spiders are treated similarly to a user who has cookies turned off in their browser. Many shopping carts and e-commerce sites are designed to deal with users with cookies turned off by placing the randomly generated session ID on the trailing edge of every page URL. Techniques must be put in place to avoid this issue with search engines, because a search engine spider will generate a new identifying number each time they visit the site. A problem is created when the spider stores the page URLs as they index the site. The same Web page can sometimes end up duplicated dozens of times in a search engine database because each URL becomes unique with the addition of the identifying code. Duplicate URLs for the same page means that the site will receive duplicate content penalties.
The important part to remember is that spiders cannot store cookies, so if your web site’s design depends upon the use of cookies, and search engine rankings are an important factor, you will likely have to use methods to accommodate search engine spiders. The most common technique condoned by Google is to detect when a search engine spider is indexing a site and deliver page URLs without an attached session ID. This is not very difficult to do with most e-commerce sites, but is often an overlooked factor with web site developers.
Search Engine Spiders and Images
The important part to remember about images is that spiders cannot read text in images. Content placed in images is therefore invisible to search engine spiders. If all of the content on a page in contained in images, there is no detectable content on the page.
Search Engine Spiders and the Need for Clean Code
Both MSN and Google stress the need for good coding techniques on their Webmaster Guidelines pages. Good, clean HTML and XHTML code helps a search engine algorithm parse a page much easier. Many web pages that render properly in browsers actually contain errors or sloppy coding techniques. Browsers are very forgiving and will correct many web page errors. Spiders do not try to make sense out of bad code. If they cannot parse the code, they typically abandon the web page. A simple way to help assure the use of industry standard coding techniques is to run each page through the W3C Markup Validation Service. The W3C sets the coding standards for most client-side code on the Internet. Their online tool is simple to use and helps teach good code development practices.
We hope this helps to clarify your understanding of search engine spiders. Remember search engine spiders do not care who you are, what you do or what you offer. Adhering to some simple guidelines and giving spiders what they crave usually results in better ranking positions and more traffic to your Web site.