Re: spam mail with flagged style images

Melkhior Thu, 20 Aug 2009 08:54:57 -0700


Paul Houselander (SME) wrote:
> 
> However it seems to have evolved again and tesseract is not extracting 
> any useable words.
> 
> Obvisouly to early to tell how effective it is but ill update the list 
> of my findings.
>


I have received several of those.

You can get tesseract to recognize many words in those messages. I have
documented an efficient preprocessing for those images at
<http://fuzzyocr.own-hero.net/wiki/GoodOcrSettings>. There's also a ticket
#2946 on a related subject <http://fuzzyocr.own-hero.net/ticket/2946>.

Basically, tesseract can be very efficient if it has enough data to work on,
and the default behavior of FuzzyOcr is suboptimal in that case. You need to
avoid conversion to PNM, by preprocessing directly from the JPG to a TIFF
file suitable for Tesseract processing.

If anyone is interested, please say so on the ticket, I'll try to produce a
clean patch to FuzzyOcr & attach it & my scansets/preps to the ticket.
-- 
View this message in context: 
http://www.nabble.com/spam-mail-with-flagged-style-images-tp25059821p25064803.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.

Re: spam mail with flagged style images

Reply via email to