Monday, August 21, 2006

OCR for SpamAssassin


I have been getting pretty annoyed with the image spam that has been getting through my mail filters.  These are some interesting emails.  They generally include an image of some text at the top of the email, and then a bunch of random text (In one case, the text was related to resumé writing and English as a second language)


I saw this page that describes how to setup an optical character recognition (OCR) SpamAssassin plugin to help identify these spam messages.  This plugin uses gocr uses as the OCR engine.  I have set this up, and will see how well it stops these messages from getting through.


Technorati Tags: , , ,

3 comments:

  1. Do you really think you need to OCR the GIFs to figure this out?
    I'm actually really disappointed in SpamAssassin for not being able to deal with these messages.
    Seems like a rule for "Contains more than 5 GIF images" would catch 100% of these messages. Have you seen such a filter around anywhere?

    ReplyDelete
  2. The image spams that I have seen only have one image. The image usually is meant look like a regular email.
    There are some spamassassin rules that do attempt to mark spams that have gif images, but don't do anything with the content of those images. For example in 70_sare_stocks has the SARE_GIF_ATTACH rule that increases the score if the email has an attached gif image.

    ReplyDelete
  3. The issues that I have seen have a single image GIF/PNG and the content surrounding it and the file name is obfuscated and randomized so you need to get the text from the image to deal with this.

    ReplyDelete