From: decoder [mailto:[EMAIL PROTECTED] > Hello all, > > > since 3.5.0 RC1 was released, we fixed many bugs, thanks to the many > testers and bug reporters :) so big thanks.
Excellent work. Thank you for your efforts in bringing it to us. Anyway, I'm wondering why the image hashing is made that way, leading to: 1) a variable-length key and 2) possibly even a very long one (depending on focr_hash_max). This pretty inefficient to handle on SQL backends and, infact, FuzzyOcr.mysql must define the "key" columns as varchar(255)... I see that the "problem" is due to the way the hashing is calculate in FuzzyOcr/Hashing.pm: <code-snip> my $cnt = 0; my $c = scalar(@stdout_data); my $s = (stat($pfile))[7] || 0; $hash = sprintf "%d:%d:%d:%d",$s, defined $pic->{height} ? $pic->{height} : 0, defined $pic->{width} ? $pic->{width} : 0, $c; if ($Threshold{max_hash}) { foreach (@stdout_data) { $_ =~ s/ +/ /g; my(@d) = split(' ', $_); $hash .= sprintf("::%d:%d:%d:%d:%d",@d); if ($cnt++ ge $Threshold{max_hash}) { last; } } } </code-snip> Why not use some form of digest? In example, something like this could be more interesting to me: <code-snip> my $cnt = 0; my $c = scalar(@stdout_data); my $s = (stat($pfile))[7] || 0; $hash = sprintf "%d:%d:%d:%d",$s, defined $pic->{height} ? $pic->{height} : 0, defined $pic->{width} ? $pic->{width} : 0, $c; if ($Threshold{max_hash}) { use Digest; my $hctx = Digest->new('MD5'); my $clrcnt = 0; foreach (@stdout_data) { my(@d) = split(/ +/, $_); $hctx->add(pack('CCCN', $d[0], $d[1], $d[2], $d[4])); if (++$clrcnt >= $Threshold{max_hash}) { last; } } $hctx->add(pack('N', $clrcnt)); $hash .= '::' . $hctx->hexdigest; } </code-snip> Which basicly creates a digest on the first (most frequent) $Threshold{max_hash} palette colors instead of simply enumerating them. The output will be around 40-45 characters and will stick with this length irregardless of the value of the focr_hash_max setting. Please note I'm not a perl wizard no a SA developer, so there is space for optimizations here. In example, Digest->new('MD5') could probably even be globally definited and there initialized, and a $hctx->reset issued when a new digest have to be computed. What are your thoughts about? Regards, giampaolo