FuzzyOcr: Pushing OCR'ed text back to SA

Olivier Nicole Thu, 15 Nov 2007 22:45:26 -0800

Hi,

This ticket in FuzzyOcr http://fuzzyocr.own-hero.net/ticket/15 is
proposing to send the text resulting from from the OCR process back to
SA for analysis.


I fully second that idea but I am wondering *what* text to push back:
depending on teh scanset being used the same image will decode as:

[20834] dbg: FuzzyOcr: ocrdata=>>. Uíagra tl.7g
[20834] dbg: FuzzyOcr: . Cíalís t2.6g
[20834] dbg: FuzzyOcr: 
[20834] dbg: FuzzyOcr: <<=end

[20834] dbg: FuzzyOcr: ocrdata=>><<=end

[20834] dbg: FuzzyOcr: ocrdata=>>. Uíagra tl.7g
[20834] dbg: FuzzyOcr: . Cíalís t2.6g
[20834] dbg: FuzzyOcr: 
[20834] dbg: FuzzyOcr: <<=end

[20834] dbg: FuzzyOcr: ocrdata=>>' Viagra tl.79
[20834] dbg: FuzzyOcr: ' CiaIis t2.69
[20834] dbg: FuzzyOcr: <<=end

The last scanset is the one prefered by FuzzyOcr when we let it do the
word analysis, but the first may even be enough for SA.

So the question really is: when can we say that the OCR is giving
clean enough results that could be used by SA? We should not give SA
the result of all scansets, else that would artificially raise the
spam score.

On another hand, for a photgraphy, OCR text may look like the
following this, this should never be pushed to SA, so how to decide?

[19120] dbg: FuzzyOcr: ocrdata=>>. ._ .
[19120] dbg: FuzzyOcr: _\
[19120] dbg: FuzzyOcr: | _
[19120] dbg: FuzzyOcr: _ |
[19120] dbg: FuzzyOcr: 
[19120] dbg: FuzzyOcr: _? _4'|
[19120] dbg: FuzzyOcr: , _ ,. . .
[19120] dbg: FuzzyOcr: 
[19120] dbg: FuzzyOcr: __ - . . _
[19120] dbg: FuzzyOcr: _ . . .
[19120] dbg: FuzzyOcr: .._ _ .
[19120] dbg: FuzzyOcr: 
[19120] dbg: FuzzyOcr: <<=end

Best regards,

Olivier

FuzzyOcr: Pushing OCR'ed text back to SA

Reply via email to