We turned on FuzzyOCR's experimental "hashdb" function, but had to turn it off again after it tagged the following images (hashes) as spam:
8:1:1:1::1:1:1:1:1 14:1:1:1::0:0:0:0:1 These appear to be "spacer.gif"-like images: small images commonly used in HTML messages for formatting purposes. Has anyone else run into this issue? Related questions: 1. How does FuzzyOCR compute an image hash? Skimming FuzzyOcr.pm shows this isn't a SHA1/MD5 of the image, but instead depends on ppmhist and identify (ImageMagick)? 2. How do I FuzzyOCR-hash a given image? The naive way fails: perl -le 'require "FuzzyOcr.pm"; ($foo, $bar) = FuzzyOcr::calc_image_hash("filename.gif"); print "$foo,$bar"' 3. If a spammer attaches 1 spam image + 5 good images and the message gets flagged as spam, do all *six* images get entered into the hashdb? The log files imply so. Would this explain why commonly-used images are in the hashdb? -- We're just a Bunch Of Regular Guys, a collective group that's trying to understand and assimilate technology. We feel that resistance to new ideas and technology is unwise and ultimately futile.