On Tue, 28 Aug 2007, snowcrash+sa wrote:

> aha!
> 
>  in FuzzyOcr.cf,
> 
>       -       focr_hashing_learn_scanned 1
>       +       focr_hashing_learn_scanned 0
> 
> then,
> 
>       rm Fuzzy*db*
> 

...

> 
> i did not realize that if the HASH ore-exists, then the images' total
> score hits -- and is reused frm the hash db, but thata none of the
> word-hit data is stored/resed.

For what it's worth, the fuzzyocr hashing is of very limited value, and in 
many cases is a severe performance hit. I found that scanning the hashes, 
due to the "fuzzy" nature, is more costly than just rescanning the file 
with OCR, as *each* *and* *every* hash must be checked iteratively.

Because of the "fuzzy" nature, you can't just check the db to "see if this 
hash exists." You have to go through and compare the generated hash to 
every hash in the db, and it considers it a match if it's "close enough".

It's severely less computationally expensive to just rescan the damn 
image. It won't matter if you only get a couple hundered emails per day, 
but once the number of stored hashes reaches a reasonably low number, it 
becomes faster to rescan the image than to go through every single stored 
hash to see if you've already scanned a similar image. 

Andy

---
Andy Dills
Xecunet, Inc.
www.xecu.net
301-682-9972
---

Reply via email to