Paul Houselander (SME) wrote: > > However it seems to have evolved again and tesseract is not extracting > any useable words. > > Obvisouly to early to tell how effective it is but ill update the list > of my findings. >
I have received several of those. You can get tesseract to recognize many words in those messages. I have documented an efficient preprocessing for those images at <http://fuzzyocr.own-hero.net/wiki/GoodOcrSettings>. There's also a ticket #2946 on a related subject <http://fuzzyocr.own-hero.net/ticket/2946>. Basically, tesseract can be very efficient if it has enough data to work on, and the default behavior of FuzzyOcr is suboptimal in that case. You need to avoid conversion to PNM, by preprocessing directly from the JPG to a TIFF file suitable for Tesseract processing. If anyone is interested, please say so on the ticket, I'll try to produce a clean patch to FuzzyOcr & attach it & my scansets/preps to the ticket. -- View this message in context: http://www.nabble.com/spam-mail-with-flagged-style-images-tp25059821p25064803.html Sent from the SpamAssassin - Users mailing list archive at Nabble.com.