Thursday, January 20, 2005

Google and comment spam

Google and Yahoo have agreed to a new mechanism to help combat blog comment spam. They are going to respect rel="nofollow" attribute of links. When a link has this attribute these search engines will not follow the links.

I think that this will help to keep the rank of these websites down in the search results, but I don't think that this really solves the problem. These websites that have these urls in their comments will still be visible to users. I see these comments as a sort of graffiti on these web sites. I wouldn't want to leave a graffiti on a wall even if I have a note that says don't read this. (Which is what the nofollow attributes state.)

There are two solutions that I think are better:

  1. Have a Baysian spam filter for the comments. pLog has a great implementation of this. This is keeping all of the comment spams that I get from even apearing.

  2. Having Google or some other company keep track of all of the urls that are posted in comments. If a comment contains urls that have been posted too many times, the blog software could reject the comment.

  3. Use SURBL. This is something that I wrote about a while ago, and have been using for part of my email spam solution. This looks at uris in the body of a message and allows you to block messages that have spam uri listed. and will indicate if those url. SURBL uses different sources to get the list of urls, like SpamCop.


I have had great luck with pLog's Baysian spam filter. It has not mis-categorized any valid comments. I think that it helps that it trains on the body of the posts themselves. This helps since it gets a good corpus of ham content.

1 comment:

  1. I'm not a big fan of the second solution. It's tough to choose a number of comments that you allow with a URL.
    The first solution seems pretty good, and would probably solve my comment spam problems. Do you have any statistics about the percentage of comments that are miscategorized?