>On Tue, 16 Oct 2018 11:49:54 +0700 Olivier wrote:
>> One of my holdback with FuzzyOCR is that you have to provide an
>> independant word list, while we have a very good tool to analyze
>> text contents: SpamAssassin itself. So I would much prefer
>> FuzzyOCR to feed the OCR'ed text back to SA for further analysis
>> (the way pdfAssassin is working).
On 16.10.18 13:34, RW wrote:
>That works as long as the OCR remains very accurate. What happened
>before was that the deployment of OCR lead spammers to make their
>text much less readable.
On Tue, 16 Oct 2018 15:48:34 +0200 Matus UHLAR - fantomas wrote:
I think that original reason was that available OCR programs were not
reliable enough.
I have tested gocr, ocrad and tesseract some >10 years ago, with not
very satisfying results, gocr being best at that time.
Since then, google took tesseract and made it much better.
I believe tht currently it would bve viable to push ocr output to
spamassassin for processing with bayes and other rules.
On 16.10.18 18:42, RW wrote:
Bayes might work, but I wouldn't like to see it added to body text
because corrupted text could look like obfuscation.
it should be pushed back to body text just for filters like bayes.
The same could/should be done for attachhed .doc, .pdf files etc.
--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
42.7 percent of all statistics are made up on the spot.