RE: [Devel-spam] FuzzyOcr 3.5.1 released

Giampaolo Tomassoni Mon, 08 Jan 2007 10:59:58 -0800

From: Andy Dills [mailto:[EMAIL PROTECTED]
> 
> ...omissis...
>
> > I understand that the "order" keyword in select is potentially 
> expensive, but
> > necessary because matches occur generally towards the most 
> recent entries,
> > thus increasing the possibility of a match earlier on.  When 
> your hash count
> > is in the thousands, earlier matches mean less queries to the 
> database, and
> > potentially faster results.
> 
> It's not just the order directive, it's the iteration throughout the 
> entire database.
> 
> Consider when the database grows to >50k records. For a new image that 
> doesn't have a hash, that's 50k records that must be sorted then 
> sent from 
> the DB server to the mail server, then all 50k records must be checked 
> against the hash before we decide that we haven't seen this image before. 
> That just isn't a workable algorithm. If iteration throughout the entire 
> database is a requirement, hashing is a performance hit rather than a 
> performance gain.
> 
> A better solution might be a seperate daemon that holds the hashes in 
> memory, to which you submit the hash being considered.


Other ways could be the ones depicted in my recent post (Message-ID: <[EMAIL 
PROTECTED]>), in which close images are basicly clustered together thanks to a 
surrogate index.

giampaolo

> 
> Honestly, I have been extremely impressed with having hashing turned 
> completely off.
> 
> Andy
> 
> ---
> Andy Dills
> Xecunet, Inc.
> www.xecu.net
> 301-682-9972
> ---

RE: [Devel-spam] FuzzyOcr 3.5.1 released

Reply via email to