-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
jdow wrote: > From: "Andy Dills" <[EMAIL PROTECTED]> > >> On Sun, 7 Jan 2007, Andy Dills wrote: >> >>> On Sun, 7 Jan 2007, decoder wrote: >>> >>>> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 >>>> >>>> >>>> Hello all, >>>> >>>> >>>> since 3.5.0 RC1 was released, we fixed many bugs, thanks to >>>> the >>> many >>>> testers and bug reporters :) so big thanks. >>> >>> >>> I have something I'm curious about, having run FuzzyOcr in a >>> medium size (3-400k messages per day) mail cluster for about a >>> week now. >>> >>> Why do you do database maintenance with every unmatched check? >>> >>>> From Hashing.pm: >>> >>> unless ($match) { my $then = time - >>> ($conf->{focr_db_max_days}*86400); ---> $sql = qq(select >>> * from $db.$dbfile order by $dbfile.check); my $sth = >>> $ddb->prepare($sql); $sth->execute; while (my @row = >>> $sth->fetchrow_array) { my $hash2 = $row[1] || "0:0:0:0"; >>> $hash2 .= "::$row[0]"; if (within_threshold($digest,$hash2)) { >>> $txt = 'Approx'; $key = $row[0]; $next = $row[5] + 1; >>> $when = $row[7] || $now; $ret = $dbfile eq >>> $conf->{focr_mysql_hash} ? $row[8] : $row[5]; $dinfo = $row[9] >>> || ''; infolog("Found[$dbfile]: Score='$row[8]' Info: >>> '$row[9]'"); last; } } # Expire old records... ---> $sql >>> = qq(delete from $db.$dbfile where $dbfile.check < $then); >>> debuglog($sql,2); $ddb->do($sql); } >>> >>> >>> Those two queries are extremely expensive in a larger >>> envrionment...I have commented this code segment out on our >>> cluster, and have written a quick maintenance script that runs >>> once per day...dropped the response time from 2-3s to .01-.05s >>> on queries, and eliminated the suddenly large and >>> customer-annoying mailqueues. >> >> Sorry to follow up to my own post, but now that I read this >> segment a little closer I realize that I'm basically commenting >> out the matching capability of the Hashing mechanism, eliminating >> all value of the Hashing in the first place. >> >> So...I guess my point is, unless there is a better way of >> determining the match than checking every single hash in the >> database (hoping that you find one that is close enough along the >> way), it's more efficient (in larger environments at least) to >> just scan each mail message without hashing enabled. >> >> Thoughts? >> >> Andy > > Hash the hashes and store them in a suitable tree? I explained before that you cannot hash the hashes because a cryptographic hash is tolerance resistant. A fuzzy matching on such a hash of the actual hash is impossible then. Chris > {^_^} -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.1 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFokUVJQIKXnJyDxURAlWWAKCBlIaLmg6ToOLuWQJ/As5LlWPBpQCfUoGG rrSlnywraE1RLwK3YjEWqoc= =7b3V -----END PGP SIGNATURE-----