-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Giampaolo Tomassoni wrote:
> From: decoder [mailto:[EMAIL PROTECTED]
>> Hello all,
>>
>>
>> since 3.5.0 RC1 was released, we fixed many bugs, thanks to the
>> many testers and bug reporters :) so big thanks.
>
> Excellent work. Thank you for your efforts in bringing it to us.
>
> Anyway, I'm wondering why the image hashing is made that way,
> leading to:
>
> 1) a variable-length key and
>
> 2) possibly even a very long one (depending on focr_hash_max).
>
> This pretty inefficient to handle on SQL backends and, infact,
> FuzzyOcr.mysql must define the "key" columns as varchar(255)...
If I had more time I'd develope a better hashing system, but I don't :(
>
> I see that the "problem" is due to the way the hashing is calculate
> in FuzzyOcr/Hashing.pm:
>
> <code-snip> my $cnt = 0; my $c = scalar(@stdout_data); my $s =
> (stat($pfile))[7] || 0; $hash = sprintf "%d:%d:%d:%d",$s, defined
> $pic->{height} ? $pic->{height} : 0, defined $pic->{width}  ?
> $pic->{width}  : 0, $c; if ($Threshold{max_hash}) { foreach
> (@stdout_data) { $_ =~ s/ +/ /g; my(@d) = split(' ', $_); $hash .=
> sprintf("::%d:%d:%d:%d:%d",@d); if ($cnt++ ge $Threshold{max_hash})
> { last; } } } </code-snip>
>
> Why not use some form of digest? In example, something like this
> could be more interesting to me: <code-snip> my $cnt = 0; my $c =
> scalar(@stdout_data); my $s = (stat($pfile))[7] || 0; $hash =
> sprintf "%d:%d:%d:%d",$s, defined $pic->{height} ? $pic->{height} :
> 0, defined $pic->{width}  ? $pic->{width}  : 0, $c; if
> ($Threshold{max_hash}) { use Digest; my $hctx = Digest->new('MD5');
>  my $clrcnt = 0; foreach (@stdout_data) { my(@d) = split(/ +/, $_);
>  $hctx->add(pack('CCCN', $d[0], $d[1], $d[2], $d[4])); if
> (++$clrcnt >= $Threshold{max_hash}) { last; } }
> $hctx->add(pack('N', $clrcnt)); $hash .= '::' . $hctx->hexdigest; }
>  </code-snip>
>
> Which basicly creates a digest on the first (most frequent)
> $Threshold{max_hash} palette colors instead of simply enumerating
> them. The output will be around 40-45 characters and will stick
> with this length irregardless of the value of the focr_hash_max
> setting.
>
> Please note I'm not a perl wizard no a SA developer, so there is
> space for optimizations here. In example, Digest->new('MD5') could
> probably even be globally definited and there initialized, and a
> $hctx->reset issued when a new digest have to be computed.
>
> What are your thoughts about?
The point is, if you use a digest, then you need an exact match, no
matter if you digest the image directly, or any of the parameters,
because digests are designed to not accept any tolerance. But the
FuzzyOcr matching algorithm depends on accepting tolerance, the hashes
are never matched 100% exactly. That is why hashing any of the
parameters will not work.

Generally, any hashing algorithm is acceptable for FuzzyOcr as long as
it has tolerance built in. Spammers never send the same pictures
around, they are generated on the fly.


Chris


>
> Regards,
>
> giampaolo
>

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFoSoJJQIKXnJyDxURAvVuAKCwJWgArxWYcY5OTlap+13sB8C9sACdHxOo
KflJrH4H1zMFFJj1yFB3Eb8=
=ST+n
-----END PGP SIGNATURE-----

Reply via email to