-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

jdow wrote:
> From: "Andy Dills" <[EMAIL PROTECTED]>
>
>> On Sun, 7 Jan 2007, Andy Dills wrote:
>>
>>> On Sun, 7 Jan 2007, decoder wrote:
>>>
>>>> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
>>>>
>>>>
>>>> Hello all,
>>>>
>>>>
>>>> since 3.5.0 RC1 was released, we fixed many bugs, thanks to
>>>> the
>>> many
>>>> testers and bug reporters :) so big thanks.
>>>
>>>
>>> I have something I'm curious about, having run FuzzyOcr in a
>>> medium size (3-400k messages per day) mail cluster for about a
>>> week now.
>>>
>>> Why do you do database maintenance with every unmatched check?
>>>
>>>> From Hashing.pm:
>>>
>>> unless ($match) { my $then = time -
>>> ($conf->{focr_db_max_days}*86400); --->        $sql = qq(select
>>> * from $db.$dbfile order by $dbfile.check); my $sth  =
>>> $ddb->prepare($sql); $sth->execute; while (my @row =
>>> $sth->fetchrow_array) { my $hash2 = $row[1] || "0:0:0:0";
>>> $hash2 .= "::$row[0]"; if (within_threshold($digest,$hash2)) {
>>> $txt   = 'Approx'; $key   = $row[0]; $next  = $row[5] + 1;
>>> $when  = $row[7] || $now; $ret   = $dbfile eq
>>> $conf->{focr_mysql_hash} ? $row[8] : $row[5]; $dinfo = $row[9]
>>> || ''; infolog("Found[$dbfile]: Score='$row[8]' Info:
>>> '$row[9]'"); last; } } # Expire old records... --->        $sql
>>> = qq(delete from $db.$dbfile where $dbfile.check < $then);
>>> debuglog($sql,2); $ddb->do($sql); }
>>>
>>>
>>> Those two queries are extremely expensive in a larger
>>> envrionment...I have commented this code segment out on our
>>> cluster, and have written a quick maintenance script that runs
>>> once per day...dropped the response time from 2-3s to .01-.05s
>>> on queries, and eliminated the suddenly large and
>>> customer-annoying mailqueues.
>>
>> Sorry to follow up to my own post, but now that I read this
>> segment a little closer I realize that I'm basically commenting
>> out the matching capability of the Hashing mechanism, eliminating
>> all value of the Hashing in the first place.
>>
>> So...I guess my point is, unless there is a better way of
>> determining the match than checking every single hash in the
>> database (hoping that you find one that is close enough along the
>> way), it's more efficient (in larger environments at least) to
>> just scan each mail message without hashing enabled.
>>
>> Thoughts?
>>
>> Andy
>
> Hash the hashes and store them in a suitable tree?
I explained before that you cannot hash the hashes because a
cryptographic hash is tolerance resistant. A fuzzy matching on such a
hash of the actual hash is impossible then.


Chris
> {^_^}

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFokUVJQIKXnJyDxURAlWWAKCBlIaLmg6ToOLuWQJ/As5LlWPBpQCfUoGG
rrSlnywraE1RLwK3YjEWqoc=
=7b3V
-----END PGP SIGNATURE-----

Reply via email to