On Mon, 8 Jan 2007, Jorge Valdes wrote:

> I do understand that in large environments, optimizations have to be made in
> order not to kill server performance, and expiration is probably something
> that could be done at "more convenient times".  I will commit a script that
> can safely be run as a cronjob soon.

Excellent.

> I understand that the "order" keyword in select is potentially expensive, but
> necessary because matches occur generally towards the most recent entries,
> thus increasing the possibility of a match earlier on.  When your hash count
> is in the thousands, earlier matches mean less queries to the database, and
> potentially faster results.

It's not just the order directive, it's the iteration throughout the 
entire database.

Consider when the database grows to >50k records. For a new image that 
doesn't have a hash, that's 50k records that must be sorted then sent from 
the DB server to the mail server, then all 50k records must be checked 
against the hash before we decide that we haven't seen this image before. 
That just isn't a workable algorithm. If iteration throughout the entire 
database is a requirement, hashing is a performance hit rather than a 
performance gain.

A better solution might be a seperate daemon that holds the hashes in 
memory, to which you submit the hash being considered.

Honestly, I have been extremely impressed with having hashing turned 
completely off.

Andy

---
Andy Dills
Xecunet, Inc.
www.xecu.net
301-682-9972
---

Reply via email to