My comments on

http://pralab.diee.unica.it/en/ImageCerberus

IC is an effort to dig a hole in the water, because the problem of image spam 
with obfuscated text cannot be solved by ocr.

My approach is a "better safe than sorry" best practice that anyone can 
implement with existing software:

1. do not display inline the content of attachments and linked resources;
2. give high spam score (>=5) to any email with very low text to image ratio.

On pdf and similar attachments, reject anything with built in macros or scripts.

R

On Tue, Oct 16, 2018 at 06:49, Olivier <olivier.nic...@cs.ait.ac.th> wrote:

> Brent,
>
> I have Fuzzy OCR installed and running, but the only rule that was
> trigered 22 times during the past 40 days was FUZZY_OCR_WRONG_CTYPE,
> meaning that the image type does not match the content-type set for
> MIME.
>
> That is still a valid catch, but not based on the OCR'ed text.
>
> One of my holdback with FuzzyOCR is that you have to provide an
> independant word list, while we have a very good tool to analyze text
> contents: SpamAssassin itself. So I would much prefer FuzzyOCR to feed
> the OCR'ed text back to SA for further analysis (the way pdfAssassin is
> working). But then, we need a way to detect that the OCR process has
> worked, that some more or less valid text, in a valid language has been
> extracted.
>
> Another approach I like is the one of Image Cerberus (dig in
> http://prag.diee.unica.it/amilab) which uses meta data of the image
> (size, histogram of colours, etc.) to classify the image as probable
> spam or probable ham and then implements Bayes classifier.
>
> As for your question about the place for image scanning, if your MTA has
> the resources to do so, why not? And if FuzzyOCR is not yet the ultimate
> OCR solution, it is still improving, so why give-up a tool that can
> help?
>
> Regards,
>
> Olivier
> --

Reply via email to