On Thu, January 9, 2014 9:46 pm, Karsten Bräckelmann wrote: > Unfortunately, well, for the scumbags, the shorter it gets, the less > likely it is to be understood. Fallen for. Or even understood to be > actual language.
Well, not really true, because of the rising resurgence of spammers using image-based spam, i.e. the number of words in text/plain or text/html is very low, and all of the spam content is embedded in a binary attached image, which uses either regular links or even imagemap links to direct victims to the final spam site. In fact, now that I think about it, almost all of my bayes_00 FNs are these image spams, which have very little text... but the text content is usually pretty generic (like "unsubscribe here" and/or a mailing address) so one would still think it should hit near 50, not 00. This is why I want to see what the matched tokens are and why I'm still suspicious of a problem in my DB. Nonetheless, this kind of image spam is a (re-)rising problem, one that is designed to circumvent Bayes and which is quite difficult to catch via content rules. (This is also why I homebrew "spammy template" rules which hit on commonalities in some of these image spams.) The FuzzyOCR plugin would be a way of dealing with that, and has been discussed on this list relatively recently, but is not currently maintained and, unfortunately (and unavoidably), eats major CPU. Even trying to restrict it to emails that have very little text but at least one largish image wouldn't work that well, since spammers could always inject a bunch of displayable nonsense text (but with a white-on-white color, for example, so it wouldn't be visible even though it would be "displayed"), so it's not a straightforward problem. > Rather unlikely, because auto-learn thresholds do include quite some > additional constraints. They do, but I've seen some FNs being autolearned as ham even after I started actively managing my SA installation, so it could have been a growing effect, i.e. a few spams got autolearned as ham, which turned into a few more, which turned into a few more, etc... Thanks for the info on the tokens, I'll give it a shot when I get a chance. Cheers. --- Amir