It seems like every month, I notice a new search engine bot crawling my web sites and aggressively using up a lot of my bandwidth. Google’s own bot can easily take a gigabyte of bandwidth a month, if you have a decent size website with at least 300 pages of content. But AWStats does not identify all bots, so you have to look at the Hosts section and see how much bandwidth your top hosts are taking. An aggressive spider will appear at the top of the list. This will let you know the IP address of the host. Most spiders though use multiple IP addresses, so what you really want to know is the actual agent name. An easy way to track down this is to look at the actual webserver logs and search for the IP address you have listed in AWStats. In cPanel, there is the Latest Visitors script (under Web/FTP Stats) which gives you the last 300 visitors to your site. Once you find the agent name, then do a search on Google for it.

Most spiders will be documented by the sites that own them. In general it is a good idea to let spiders search your site, but if they take too much bandwidth or are making your site slower than usual, then you have to take some action to either slow them down or ban them from specific areas of your site or entirely. Depending on the spider’s documentation, you might be able to deter or reduce crawling requests using the robots.txt file. Some spiders obey only the meta tags in the html header. It is best to use the robots.txt file since this change is easier to do than editing all your html files.

If all else fails, you are left with blocking the entire IP range that the bot uses. This is a last resort option and you should be extra careful in figuring out the exact IP addresses to block since this will make your site unreachable to any of those IP addresses.