Hi, This ticket in FuzzyOcr http://fuzzyocr.own-hero.net/ticket/15 is proposing to send the text resulting from from the OCR process back to SA for analysis.
I fully second that idea but I am wondering *what* text to push back: depending on teh scanset being used the same image will decode as: [20834] dbg: FuzzyOcr: ocrdata=>>. Uíagra tl.7g [20834] dbg: FuzzyOcr: . Cíalís t2.6g [20834] dbg: FuzzyOcr: [20834] dbg: FuzzyOcr: <<=end [20834] dbg: FuzzyOcr: ocrdata=>><<=end [20834] dbg: FuzzyOcr: ocrdata=>>. Uíagra tl.7g [20834] dbg: FuzzyOcr: . Cíalís t2.6g [20834] dbg: FuzzyOcr: [20834] dbg: FuzzyOcr: <<=end [20834] dbg: FuzzyOcr: ocrdata=>>' Viagra tl.79 [20834] dbg: FuzzyOcr: ' CiaIis t2.69 [20834] dbg: FuzzyOcr: <<=end The last scanset is the one prefered by FuzzyOcr when we let it do the word analysis, but the first may even be enough for SA. So the question really is: when can we say that the OCR is giving clean enough results that could be used by SA? We should not give SA the result of all scansets, else that would artificially raise the spam score. On another hand, for a photgraphy, OCR text may look like the following this, this should never be pushed to SA, so how to decide? [19120] dbg: FuzzyOcr: ocrdata=>>. ._ . [19120] dbg: FuzzyOcr: _\ [19120] dbg: FuzzyOcr: | _ [19120] dbg: FuzzyOcr: _ | [19120] dbg: FuzzyOcr: [19120] dbg: FuzzyOcr: _? _4'| [19120] dbg: FuzzyOcr: , _ ,. . . [19120] dbg: FuzzyOcr: [19120] dbg: FuzzyOcr: __ - . . _ [19120] dbg: FuzzyOcr: _ . . . [19120] dbg: FuzzyOcr: .._ _ . [19120] dbg: FuzzyOcr: [19120] dbg: FuzzyOcr: <<=end Best regards, Olivier