Thursday, January 20, 2005

Google and comment spam


Google and Yahoo have agreed to a new mechanism to help combat blog comment spam. They are going to respect rel="nofollow" attribute of links. When a link has this attribute these search engines will not follow the links.


I think that this will help to keep the rank of these websites down in the search results, but I don't think that this really solves the problem. These websites that have these urls in their comments will still be visible to users. I see these comments as a sort of graffiti on these web sites. I wouldn't want to leave a graffiti on a wall even if I have a note that says don't read this. (Which is what the nofollow attributes state.)




There are two solutions that I think are better:


  1. Have a Baysian spam filter for the comments. pLog has a great implementation of this. This is keeping all of the comment spams that I get from even apearing.

  2. Having Google or some other company keep track of all of the urls that are posted in comments. If a comment contains urls that have been posted too many times, the blog software could reject the comment.

  3. Use SURBL. This is something that I wrote about a while ago, and have been using for part of my email spam solution. This looks at uris in the body of a message and allows you to block messages that have spam uri listed. and will indicate if those url. SURBL uses different sources to get the list of urls, like SpamCop.


Update:


I have had great luck with pLog's Baysian spam filter. It has not mis-categorized any valid comments. I think that it helps that it trains on the body of the posts themselves. This helps since it gets a good corpus of ham content.