Nix wrote:
On 31 May 2007, Graham Murray said:

Nix <[EMAIL PROTECTED]> writes:

(And, let's be blunt, the pure this-word-is-spammy recognition part of
FuzzyOCR is much less smart than the Bayesian system already present
in SA: FuzzyOCR should really use the Bayesian system to determine the
spamminess of words, I suppose...)
Or even just act as a MIME part 'decoding' system (like Base64) and feed
all words it finds in images into Bayes, along with all other text in
the mail, rather than generating a score itself.

Perhaps so, but if so those words should have a score-multiplier of some
sort applied, because the fact that those words originated in images is
itself an obfuscation technique that should be noted in the score.
This has been discussed here again and again and again

first of all, these 10 words found in an image cannot stand against the bayes poisoning found in all these messages - so it would literally be useless for bayes filtering secondly, the hit rate of the OCR is pretty bad, so we cannot use exact matches - that's exactly why this app is named FUZZYocr, compared to the original version which wasn't fuzzy - that's why we have such high hit rates with it because it can still find these words even if one or two letters are wrong - try to do that with regular expressions and it gets ugly and big quite fast....

FuzzyOCR is perfect just the way it is. It might need some tweaking, yes, but then it can do exactly what you want. If you want an upper limit, just hack the source and add it - it's not too hard. I've added a few tweaks myself - for example dont stop if the minimum words was found with one scanset but continue unil the minimum+10 have been found.. I dont want it to stop at 2 words if a later scanset could find 15

I agree, an upper bound would be quite interesting for a few folks (actually I dont mind having a fuzzyocr hit with 20+ hits, that's just perfect actually because the FP rate was zero so far) and it shouldn't be too hard to add - so you might officially request this for a next version or like I said - just do it yourself. If you cant do it, I might have a look and give you a hint into the right direction, even tough I'm not really a good perl programmer


Matt

Reply via email to