R: two supposedly identical SA boxes, with slightly different report output -- help find the diff?

Giampaolo Tomassoni Wed, 29 Aug 2007 02:44:17 -0700

> -----Messaggio originale-----
> Da: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Per conto di
> snowcrash+sa
> 
> hi andy,
> 
> > For what it's worth, the fuzzyocr hashing is of very limited value,
> and in
> > many cases is a severe performance hit. I found that scanning the
> hashes,
> > due to the "fuzzy" nature, is more costly than just rescanning the
> file
> > with OCR, as *each* *and* *every* hash must be checked iteratively.
> 
> now, *that's* an interesting point to consider.
> 
> i'd be interested in what, then, the 'goal' of the hashing/comparison
> *is*?
> 
> is it performance, and it just missed the mark for the reasons you
> state?  or is it something else?


The main purpose of the FuzzyOcr's db was of course to avoid computing the
OCR passes needed to decode the image text for known images. The problem is
that the cache content is not searched for an exact match of the key values
(which are image type, width, height, number of colors and color
frequencies): it looks for the best match of these values within a given
range. This has a number of drawbacks:

 a) range search defeats look-up indexing in the db,
    thereby resulting in browsing the whole db for a match;

 b) range search also increases false positive matches
    on the db content;

 c) the db caches OCR results, thereby a mach on it may return
    an unwanted/imprecise result if you tweak FuzzyOcr config
    and/or words files.

The first drawback may yield high processing times and even timeouts when
you have a medium-loaded mail server, the second one is probably the worst
problem to most of us and the latter is, well, another problem.

So, yes: FuzzyOCR's cache was meant to increase performances and, yes again,
it basically missed the mark.

The solution is to simply discard the cache db and run the OCR phases on
every and each image: on most but the less loaded servers this is the most
effective way to deal with it. Most of us are used to turn glitches off
while keeping the good work... :)

Giampaolo


> dunno.
> 
> but, your point bears some benchmarking ...
> 
> thx!

R: two supposedly identical SA boxes, with slightly different report output -- help find the diff?

Reply via email to