Re: Bayes and multipart messages

Amir 'CG' Caspi Thu, 09 Jan 2014 21:55:15 -0800

On Thu, January 9, 2014 9:46 pm, Karsten Bräckelmann wrote:
> Unfortunately, well, for the scumbags, the shorter it gets, the less
> likely it is to be understood. Fallen for. Or even understood to be
> actual language.


Well, not really true, because of the rising resurgence of spammers using
image-based spam, i.e. the number of words in text/plain or text/html is
very low, and all of the spam content is embedded in a binary attached
image, which uses either regular links or even imagemap links to direct
victims to the final spam site.

In fact, now that I think about it, almost all of my bayes_00 FNs are
these image spams, which have very little text... but the text content is
usually pretty generic (like "unsubscribe here" and/or a mailing address)
so one would still think it should hit near 50, not 00.  This is why I
want to see what the matched tokens are and why I'm still suspicious of a
problem in my DB.

Nonetheless, this kind of image spam is a (re-)rising problem, one that is
designed to circumvent Bayes and which is quite difficult to catch via
content rules.  (This is also why I homebrew "spammy template" rules which
hit on commonalities in some of these image spams.)  The FuzzyOCR plugin
would be a way of dealing with that, and has been discussed on this list
relatively recently, but is not currently maintained and, unfortunately
(and unavoidably), eats major CPU.  Even trying to restrict it to emails
that have very little text but at least one largish image wouldn't work
that well, since spammers could always inject a bunch of displayable
nonsense text (but with a white-on-white color, for example, so it
wouldn't be visible even though it would be "displayed"), so it's not a
straightforward problem.

> Rather unlikely, because auto-learn thresholds do include quite some
> additional constraints.

They do, but I've seen some FNs being autolearned as ham even after I
started actively managing my SA installation, so it could have been a
growing effect, i.e. a few spams got autolearned as ham, which turned into
a few more, which turned into a few more, etc...

Thanks for the info on the tokens, I'll give it a shot when I get a chance.

Cheers.

--- Amir

Re: Bayes and multipart messages

Reply via email to