Mike Batchelor <[EMAIL PROTECTED]> writes: > Note the random words within the <font> tags at the end of the spam. I > think they lowered its Bayes score, which dropped it below my threshold > overall. That, and the lack of any other text aside from the links...
Yes, the random words in that email work against Bayes (I think this particular exploit has been well-known for a while, and spammers are definitely getting better at it). It's why I've never believed that Bayes is a panacea. This goes back to the SA philosophy: we don't rely on any one technique since no one technique is prefect. SpamAssassin only uses Bayes as a small subset of the rules. And while random words fool simple checksum systems like Razor1, they don't fool Razor2, DCC, or Pyzor. They also don't fool RBLs. This spam was lucky enough not to be listed anywhere yet, but most won't be. > Is this tactic likely to succeed for them, rendering our Bayesian > classifiers ineffective? What do you think? Well, it can work. (It does take a fairly smart spammer to pull it off.) Here's the list of tokens matched (sorted by probability). It does look like they've managed to construct a message with a very low score, all around. debug: bayes token 'bg.jpg' => 0.997298245614035 debug: bayes token 'take-me-off' => 0.994923076923077 debug: bayes token 'dairy' => 0.988731707317073 debug: bayes token 'URI' => 0.965041112956667 debug: bayes token 'studs' => 0.958 debug: bayes token 'N:HX-Mail-Format-Warning:RFCNNNN' => 0.958 debug: bayes token 'HX-Mail-Format-Warning:header' => 0.958 debug: bayes token 'HX-Mail-Format-Warning:formatting' => 0.958 debug: bayes token 'HX-Mail-Format-Warning:RFC2822' => 0.958 debug: bayes token 'HX-Mail-Format-Warning:Bad' => 0.958 debug: bayes token 'amazing' => 0.942726598001046 debug: bayes token 'images' => 0.926845523863797 debug: bayes token 'H*r:501' => 0.925317612750241 debug: bayes token 'index.html' => 0.903306785729035 debug: bayes token 'H*c:HHHH' => 0.882993935307079 debug: bayes token 'N:H*M:NNNNNNNNNNNNNN' => 0.151858612440554 debug: bayes token 'N:H*r:NNN' => 0.146710592234121 debug: bayes token 'N:H*r:N.NN.N' => 0.143944051542205 debug: bayes token 'H*r:8.12.2' => 0.131883948595905 debug: bayes token 'N:H*M:NNNNN' => 0.107562836072879 debug: bayes token 'N:HX-Sieve:N.N' => 0.0489090909090909 debug: bayes token 'HX-Sieve:cmu-sieve' => 0.0489090909090909 debug: bayes token 'verbally' => 0.0256190476190476 debug: bayes token 'gels' => 0.0256190476190476 debug: bayes token 'hairiness' => 0.0173548387096774 debug: bayes token 'modulating' => 0.0131219512195122 debug: bayes: score = 0.564805986629628 (Any Bayesian classifier can produce this type of list once you build a corpus. And there's no magic to Bayes that prevents spammers from doing this too to figure out how many words need to be counterbalanced, etc.) Note that my Bayes database picked up on some tokens that were actually added by your personal software (like X-Sieve and some of the other tokens, like the Message-ID). If I remove those, my Bayes probability would go up. I didn't get this particular spam, so I don't know if Bayes would have worked for me or not. Probably not enough to catch it as spam, though. Some enhancements to Bayes might be in order. Daniel -- Daniel Quinlan anti-spam (SpamAssassin), Linux, and open http://www.pathname.com/~quinlan/ source consulting (looking for new work) ------------------------------------------------------- This SF.Net email is sponsored by: INetU Attention Web Developers & Consultants: Become An INetU Hosting Partner. Refer Dedicated Servers. We Manage Them. You Get 10% Monthly Commission! INetU Dedicated Managed Hosting http://www.inetu.net/partner/index.php _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk