From: decoder [mailto:[EMAIL PROTECTED]
> Hello all,
> 
> 
> since 3.5.0 RC1 was released, we fixed many bugs, thanks to the many
> testers and bug reporters :) so big thanks.

Excellent work. Thank you for your efforts in bringing it to us.

Anyway, I'm wondering why the image hashing is made that way, leading to:

  1) a variable-length key and

  2) possibly even a very long one (depending on focr_hash_max).

This pretty inefficient to handle on SQL backends and, infact, FuzzyOcr.mysql 
must define the "key" columns as varchar(255)...

I see that the "problem" is due to the way the hashing is calculate in 
FuzzyOcr/Hashing.pm:

<code-snip>
    my $cnt = 0;
    my $c = scalar(@stdout_data);
    my $s = (stat($pfile))[7] || 0;
    $hash = sprintf "%d:%d:%d:%d",$s,
        defined $pic->{height} ? $pic->{height} : 0,
        defined $pic->{width}  ? $pic->{width}  : 0,
        $c;
    if ($Threshold{max_hash}) {
        foreach (@stdout_data) {
            $_ =~ s/ +/ /g;
            my(@d) = split(' ', $_);
            $hash .= sprintf("::%d:%d:%d:%d:%d",@d);
            if ($cnt++ ge $Threshold{max_hash}) {
                last;
            }
        }
    }
</code-snip>

Why not use some form of digest? In example, something like this could be more 
interesting to me:

<code-snip>
    my $cnt = 0;
    my $c = scalar(@stdout_data);
    my $s = (stat($pfile))[7] || 0;
    $hash = sprintf "%d:%d:%d:%d",$s,
        defined $pic->{height} ? $pic->{height} : 0,
        defined $pic->{width}  ? $pic->{width}  : 0,
        $c;
    if ($Threshold{max_hash}) {
        use Digest;
        my $hctx = Digest->new('MD5');
        my $clrcnt = 0;
        foreach (@stdout_data) {
            my(@d) = split(/ +/, $_);
            $hctx->add(pack('CCCN', $d[0], $d[1], $d[2], $d[4]));
            if (++$clrcnt >= $Threshold{max_hash}) {
                last;
            }
        }
        $hctx->add(pack('N', $clrcnt));
        $hash .= '::' . $hctx->hexdigest;
    }
</code-snip>

Which basicly creates a digest on the first (most frequent) 
$Threshold{max_hash} palette colors instead of simply enumerating them. The 
output will be around 40-45 characters and will stick with this length 
irregardless of the value of the focr_hash_max setting.

Please note I'm not a perl wizard no a SA developer, so there is space for 
optimizations here. In example, Digest->new('MD5') could probably even be 
globally definited and there initialized, and a $hctx->reset issued when a new 
digest have to be computed.

What are your thoughts about?

Regards,

        giampaolo

Reply via email to