Thursday, June 24, 2004

Blocking web slurping

I noticed in the web logs for one of my web sites, that someone was running a program to slurp all of the content to their computer. Here are some excerpts of the logs:

xx.xx.xx.xx - - [07/Jun/2004:03:50:08 -0700] "GET /indilist.php?PHPSESSID=823c1a9974e1972a0d9420d3291805df HTTP/1.0" 200 21221 "-" "BlackWidow-Spider1 (+http://www.xxxxxxx.com)"
xx.xx.xx.xx - - [07/Jun/2004:03:50:09 -0700] "GET /famlist.php?PHPSESSID=823c1a9974e1972a0d9420d3291805df HTTP/1.0" 200 21175 "-" "BlackWidow-Spider1 (+http://www.xxxxxxx.com)"
xx.xx.xx.xx - - [07/Jun/2004:03:50:11 -0700] "GET /sourcelist.php?PHPSESSID=823c1a9974e1972a0d9420d3291805df HTTP/1.0" 200 19183 "-" "BlackWidow-Spider1 (+http://www.xxxxxxx.com)"
xx.xx.xx.xx - - [07/Jun/2004:03:50:12 -0700] "GET /placelist.php?PHPSESSID=823c1a9974e1972a0d9420d3291805df HTTP/1.0" 200 34963 "-" "BlackWidow-Spider1 (+http://www.xxxxxx.com)"

I don't mind that search engines crawl my web sites (That is why I made this information public), but I don't want people to slurp all of the data down.

I put the following mod_rewrite rule to block people using this software from downloading my content.


RewriteEngine On
RewriteCond %{http_USER_AGENT} ^blackwidow
RewriteRule /* - [F]

This will only work if they don't change the User Agent (which I kow that they can do) If they change the User Agent, I will either list the new User Agent, or I will start blocking ip addresses.

There is one down side with this solution. When this rule fires, the web server will not return the error document becuase it the permissions are set up so that it does not have permissions to access the error document. I will fix that if it is easy. In some ways I don't care about their user experience, since they shouldn't be doing this anyway.