Paul's Time Sink: Protecting Your Website's Content: Dealing with Content Scraping

Thursday, June 24, 2004

Protecting Your Website's Content: Dealing with Content Scraping

I noticed in the web logs for one of my web sites, that someone was running a program to scrape all of the content to their computer. Here are some excerpts of the logs:

xx.xx.xx.xx - - [07/Jun/2004:03:50:08 -0700] "GET /indilist.php?PHPSESSID=823c1a9974e1972a0d9420d3291805df HTTP/1.0" 200 21221 "-" "BlackWidow-Spider1 (+http://www.xxxxxxx.com)"
xx.xx.xx.xx - - [07/Jun/2004:03:50:09 -0700] "GET /famlist.php?PHPSESSID=823c1a9974e1972a0d9420d3291805df HTTP/1.0" 200 21175 "-" "BlackWidow-Spider1 (+http://www.xxxxxxx.com)"
xx.xx.xx.xx - - [07/Jun/2004:03:50:11 -0700] "GET /sourcelist.php?PHPSESSID=823c1a9974e1972a0d9420d3291805df HTTP/1.0" 200 19183 "-" "BlackWidow-Spider1 (+http://www.xxxxxxx.com)"
xx.xx.xx.xx - - [07/Jun/2004:03:50:12 -0700] "GET /placelist.php?PHPSESSID=823c1a9974e1972a0d9420d3291805df HTTP/1.0" 200 34963 "-" "BlackWidow-Spider1 (+http://www.xxxxxx.com)"

I don't mind that search engines crawl my web sites (That is why I made this information public), but I don't want people to scrape all of the data down.

I put the following mod_rewrite rule to block people using this software from downloading my content.

RewriteEngine On
RewriteCond %{http_USER_AGENT} ^blackwidow
RewriteRule /* - [F]

This will only work if they don't change the User Agent (which I kow that they can do) If they change the User Agent, I will either list the new User Agent, or I will start blocking ip addresses.

There is one down side with this solution. When this rule fires, the web server will not return the error document becuase it the permissions are set up so that it does not have permissions to access the error document. I will fix that if it is easy. In some ways I don't care about their user experience, since they shouldn't be doing this anyway.

1 comment:

Todd KulickJune 25, 2004 at 5:48 PM
So I keep getting interesting traces of port scanning madness in my logs. I don't even watch my apache logs on a daily basis. I suppose I should. Anyway, I keep getting about one very thorough port scan per week or two. I've taken to making a point of firing off an email to abuse at the corresponding ISP. Doing my little part to save the world I suppose...
So which is worse: port scanning script kiddies or lifeless, blood-sucking spammers? ;)
ReplyDelete
Replies

Thursday, June 24, 2004

Protecting Your Website's Content: Dealing with Content Scraping

1 comment:

Mastering Matter: Seamless Smart Home Integration with Network Segmentation