Someone over on the mimedefang list is working on an OCR mechanism for
scanning the image to text.
Another person also brought up the idea of hashing the images and doing
something like an IRBL or razor approach, but everyone came to the same
conclusion you're coming to now.
But there was one suggestion that seemed interesting:
Averaging out regions of the image to their base colors, and matching
based on that. They suggested 4 regions, but I think that's too broad.
I think may be 16 or 64 regions might be better (4x4 or 8x8). Within
each grid section, you average out the color values of the pixels, and
you're left with big pixelated blotch of 16 or 64 squares. This will
wash out the minor variations of pixels that defeats hashing
mechanisms.
From there, I wouldn't hash the big blotched image, I would record the
16 or 64 values, and directly use those, plus the rough image size, as
your matching data (rough image size meaning: it's ok to be + or - 32
or 64 pixels on each size metric, to again account for image
variations, but you don't want to compare a 8x8 pixel image to a
640x640 pixel image). If the image in the email matches an image in
the database, then I wouldn't automatically reject it or mark it as
spam -- this is a VERY rough comparison of the images. Instead, I
would just give the message +3 or +4 to its score.
May not be perfect, but it may be interesting. I also wonder if it'd
be useful to do more regions (16x16 or 64x64?), and base the result on
"how many regions matched".
On Apr 21, 2006, at 10:16 AM, Dirk Bonengel wrote:
Hi,
as Rob McEwen already pointed out Bill Stearns offered image hash data
for such a project. I did write such a plugin (Bill did publish his
data via DNS, thanks again!) but am somewhat disappointed by the
results (so I didn't bother publishing the plugin).
The point is that the most annoying image spams (i.e. those you want
to catch) are deliberatly defective or altered so that simple hashing
of the image MIME parts doesn't really work. Seems to me that spammers
already practise hash busting methods on images, presumably cos some
big ISP(s) do check image hashes already,
Still, if disired I can post that plugin somewhere (with appropriate
words of caution)...
Dirk
John D. Hardin schrieb:
All:
A few posts back was a suggestion for checking the MD5 checksum of
attached images against a blacklist to catch the current wave of
attached-image-only stock pump-and-dump scam spams.
Taking that to its logical conclusion suggests the creation of a
public Image Realtime Block List along the lines of what SURBL
provides for URLs, and extending SA to MD5-sum attached images and
check them against the block list.
Is this a good idea? Is this a bad idea? Is it pointless, as spammers
would just generate per-message images the way they are probably
generating per-message random Bayes poison now? Is it already covered
by Razor et. al.?
Comments are solicited!
--
John Hardin KA7OHZ ICQ#15735746 http://www.impsec.org/~jhardin/
[EMAIL PROTECTED] FALaholic #11174 pgpk -a [EMAIL PROTECTED]
key: 0xB8732E79 - 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
----------------------------------------------------------------------
-
Senator, when you took your oath of office, you placed your hand on
the Bible and swore to uphold the Constitution. You didn't place your
hand on the Constitution and swear to uphold the Bible.
-- Jamie Raskin, Professor of Law at American
University, testifying before the Maryland Senate
----------------------------------------------------------------------
-