Hi, While going through old stuff, I noticed I have been using a modified version of PDFassassin.
I did not even know that I ported it to my new mail server. What it does, basically, it extracts the text from the PDF attachment and stuff it back to SA for further anlysis. The only difference with the original PDFassassin is that the text is added at the end of the original message, so that it does not launch a second instance of SA to check it. Similarily. images are attached at the end of the original message, for further scanning by whatever your image scanner is. That way, PDFassassin does not try to identify spam, but only extracts the various parts of a PDF document, for SA to analyze. I wonder if someone would be interested in reviewing what I have done? In the same way, I am wondering if something similar exists for all the (open|libre|MS)office documents? Finally, I am wondering if fuzzyOCR still has any interest? Like above, I'd like to see it push the stings it can identify to the body of the message, for further analysis by SA, rather than having it's own list of spam words. Best regards, Olivier