On Tue, 28 Aug 2007, snowcrash+sa wrote: > aha! > > in FuzzyOcr.cf, > > - focr_hashing_learn_scanned 1 > + focr_hashing_learn_scanned 0 > > then, > > rm Fuzzy*db* >
... > > i did not realize that if the HASH ore-exists, then the images' total > score hits -- and is reused frm the hash db, but thata none of the > word-hit data is stored/resed. For what it's worth, the fuzzyocr hashing is of very limited value, and in many cases is a severe performance hit. I found that scanning the hashes, due to the "fuzzy" nature, is more costly than just rescanning the file with OCR, as *each* *and* *every* hash must be checked iteratively. Because of the "fuzzy" nature, you can't just check the db to "see if this hash exists." You have to go through and compare the generated hash to every hash in the db, and it considers it a match if it's "close enough". It's severely less computationally expensive to just rescan the damn image. It won't matter if you only get a couple hundered emails per day, but once the number of stored hashes reaches a reasonably low number, it becomes faster to rescan the image than to go through every single stored hash to see if you've already scanned a similar image. Andy --- Andy Dills Xecunet, Inc. www.xecu.net 301-682-9972 ---