A rant about FUZZY_OCR

Adam Katz Sun, 26 Apr 2009 11:37:41 -0700

> On Fri, Apr 24, 2009 at 05:14:21PM -0400, Adam Katz wrote:
>> I wouldn't trust FUZZY_OCR with anything.  12 points is *WAY* too high
>> for any single thing.  I had to disable this plugin a year or three
>> ago because it assigned 20+ points to legit screenshots in ham (and
>> that was /after/ I trimmed its flagging words file down in size)!

Henrik K wrote:
> You do realize that it's configurable? Who to blame if you just run
> things blindly.

I expect the defaults to at least border on sane.  As noted before, I've
tried and failed to configure it.  Could you point me at where the
configuration options are specified, specifically focr_threshold?  All I
see is the installation manual and the .cf file, neither of which is
terribly informative (like say the perldoc pages for other plugins).

Searching for it http://google.com/search?q=FUZZY_OCR finds an
OVERWHELMING MAJORITY of hits describing false positives and
configuration issues.  The official documentation didn't even make it to
the top 100 hits in Google, and after finding it on the SA wiki (google
hit #59), I found it sparse at best (I had to dive into the svn repo!).

The FAQ, which features only two answered questions, includes an
un-answered question about how to cap the score, which IMHO is a
mission-critical feature.

I don't know if I still have the example of the bad hit from those years
ago, but it made absolutely no sense, hitting dozens of "words found"
that did not exist ... and this was a PNG screen capture, not even a
photo or a JPEG-compressed image.  My company deals with screen captures
a LOT, and I just can't afford for such a poorly designed plugin to run
amok the way Fuzzy OCR does.

It's extremely disturbing that there are several tests (which is a good
thing), but none of them are designed to test for false positives, or
even to help you tweak the detection threshold.  You're left guessing
what reasonable levels are, especially when the config file (the best
docs I could find) points you at the manual (which I believe is the
install guide, which doesn't even include the string "thresh").

The last release was two years ago, and even on the svn trunk, the word
list hasn't been updated ... ever (excepting minor tweaks like a
threshold change from 0.1 to 0.01).  How is this fair?

The claim that FUZZY_OCR can't use the Bayesian database is a weak one,
too; just make a custom prefix to the tokens it creates (I don't know
SA's bayes token syntax, but other implementations use things like
"subject:foo" to indicate that the word "foo" in the subject differs
from the word "foo" elsewhere, so you could have "fuzzyocr:foo"
instead).  Implement the fuzziness by inserting a dozen tokens for each
possible parsing.)  This would solve the issue of stale or inappropriate
word lists.

Finally, I have no way of testing the thing live.  Since FUZZY_OCR is a
dynamically scored rule, I can't just push it to 0.001 and see the hits,
the way I can with the BAYES_XX thresholds for example.  (Sure, I can
make all score-changing values 0.001, but I'm not sure that would
properly test it, and given my past experiences, I wouldn't be surprised
if this still causes problems.)

It's a great idea, but I'd like to see it mature some first, especially
with respect to its documentation, test emails, word list, and live testing.

A rant about FUZZY_OCR

Reply via email to