My comments on http://pralab.diee.unica.it/en/ImageCerberus
IC is an effort to dig a hole in the water, because the problem of image spam with obfuscated text cannot be solved by ocr. My approach is a "better safe than sorry" best practice that anyone can implement with existing software: 1. do not display inline the content of attachments and linked resources; 2. give high spam score (>=5) to any email with very low text to image ratio. On pdf and similar attachments, reject anything with built in macros or scripts. R On Tue, Oct 16, 2018 at 06:49, Olivier <olivier.nic...@cs.ait.ac.th> wrote: > Brent, > > I have Fuzzy OCR installed and running, but the only rule that was > trigered 22 times during the past 40 days was FUZZY_OCR_WRONG_CTYPE, > meaning that the image type does not match the content-type set for > MIME. > > That is still a valid catch, but not based on the OCR'ed text. > > One of my holdback with FuzzyOCR is that you have to provide an > independant word list, while we have a very good tool to analyze text > contents: SpamAssassin itself. So I would much prefer FuzzyOCR to feed > the OCR'ed text back to SA for further analysis (the way pdfAssassin is > working). But then, we need a way to detect that the OCR process has > worked, that some more or less valid text, in a valid language has been > extracted. > > Another approach I like is the one of Image Cerberus (dig in > http://prag.diee.unica.it/amilab) which uses meta data of the image > (size, histogram of colours, etc.) to classify the image as probable > spam or probable ham and then implements Bayes classifier. > > As for your question about the place for image scanning, if your MTA has > the resources to do so, why not? And if FuzzyOCR is not yet the ultimate > OCR solution, it is still improving, so why give-up a tool that can > help? > > Regards, > > Olivier > --