FuzzyOCR hashdb tagging commonly-used images like spacer.gif as spam

Kelly Jones Sun, 17 Dec 2006 12:49:31 -0800

We turned on FuzzyOCR's experimental "hashdb" function, but had to
turn it off again after it tagged the following images (hashes) as
spam:


8:1:1:1::1:1:1:1:1
14:1:1:1::0:0:0:0:1

These appear to be "spacer.gif"-like images: small images commonly
used in HTML messages for formatting purposes.

Has anyone else run into this issue?

Related questions:

1. How does FuzzyOCR compute an image hash? Skimming FuzzyOcr.pm shows
this isn't a SHA1/MD5 of the image, but instead depends on ppmhist and
identify (ImageMagick)?

2. How do I FuzzyOCR-hash a given image? The naive way fails:

perl -le 'require "FuzzyOcr.pm"; ($foo, $bar) =
FuzzyOcr::calc_image_hash("filename.gif"); print "$foo,$bar"'

3. If a spammer attaches 1 spam image + 5 good images and the message
gets flagged as spam, do all *six* images get entered into the hashdb?
The log files imply so. Would this explain why commonly-used images
are in the hashdb?

--
We're just a Bunch Of Regular Guys, a collective group that's trying
to understand and assimilate technology. We feel that resistance to
new ideas and technology is unwise and ultimately futile.

FuzzyOCR hashdb tagging commonly-used images like spacer.gif as spam

Reply via email to