Mike Batchelor <[EMAIL PROTECTED]> writes:

> Note the random words within the <font> tags at the end of the spam.  I 
> think they lowered its Bayes score, which dropped it below my threshold 
> overall.  That, and the lack of any other text aside from the links...

Yes, the random words in that email work against Bayes (I think this
particular exploit has been well-known for a while, and spammers are
definitely getting better at it).  It's why I've never believed that
Bayes is a panacea.  This goes back to the SA philosophy: we don't rely
on any one technique since no one technique is prefect.  SpamAssassin
only uses Bayes as a small subset of the rules.

And while random words fool simple checksum systems like Razor1, they
don't fool Razor2, DCC, or Pyzor.  They also don't fool RBLs.  This spam
was lucky enough not to be listed anywhere yet, but most won't be.
 
> Is this tactic likely to succeed for them, rendering our Bayesian 
> classifiers ineffective?  What do you think?

Well, it can work.  (It does take a fairly smart spammer to pull it
off.)

Here's the list of tokens matched (sorted by probability).  It does look
like they've managed to construct a message with a very low score, all
around.

debug: bayes token 'bg.jpg' => 0.997298245614035
debug: bayes token 'take-me-off' => 0.994923076923077
debug: bayes token 'dairy' => 0.988731707317073
debug: bayes token 'URI' => 0.965041112956667
debug: bayes token 'studs' => 0.958
debug: bayes token 'N:HX-Mail-Format-Warning:RFCNNNN' => 0.958
debug: bayes token 'HX-Mail-Format-Warning:header' => 0.958
debug: bayes token 'HX-Mail-Format-Warning:formatting' => 0.958
debug: bayes token 'HX-Mail-Format-Warning:RFC2822' => 0.958
debug: bayes token 'HX-Mail-Format-Warning:Bad' => 0.958
debug: bayes token 'amazing' => 0.942726598001046
debug: bayes token 'images' => 0.926845523863797
debug: bayes token 'H*r:501' => 0.925317612750241
debug: bayes token 'index.html' => 0.903306785729035
debug: bayes token 'H*c:HHHH' => 0.882993935307079
debug: bayes token 'N:H*M:NNNNNNNNNNNNNN' => 0.151858612440554
debug: bayes token 'N:H*r:NNN' => 0.146710592234121
debug: bayes token 'N:H*r:N.NN.N' => 0.143944051542205
debug: bayes token 'H*r:8.12.2' => 0.131883948595905
debug: bayes token 'N:H*M:NNNNN' => 0.107562836072879
debug: bayes token 'N:HX-Sieve:N.N' => 0.0489090909090909
debug: bayes token 'HX-Sieve:cmu-sieve' => 0.0489090909090909
debug: bayes token 'verbally' => 0.0256190476190476
debug: bayes token 'gels' => 0.0256190476190476
debug: bayes token 'hairiness' => 0.0173548387096774
debug: bayes token 'modulating' => 0.0131219512195122
debug: bayes: score = 0.564805986629628

(Any Bayesian classifier can produce this type of list once you build a
corpus.  And there's no magic to Bayes that prevents spammers from doing
this too to figure out how many words need to be counterbalanced, etc.)

Note that my Bayes database picked up on some tokens that were actually
added by your personal software (like X-Sieve and some of the other
tokens, like the Message-ID).  If I remove those, my Bayes probability
would go up.  I didn't get this particular spam, so I don't know if
Bayes would have worked for me or not.  Probably not enough to catch it
as spam, though.

Some enhancements to Bayes might be in order.

Daniel
 
-- 
Daniel Quinlan                     anti-spam (SpamAssassin), Linux, and open
http://www.pathname.com/~quinlan/   source consulting (looking for new work)


-------------------------------------------------------
This SF.Net email is sponsored by: INetU
Attention Web Developers & Consultants: Become An INetU Hosting Partner.
Refer Dedicated Servers. We Manage Them. You Get 10% Monthly Commission!
INetU Dedicated Managed Hosting http://www.inetu.net/partner/index.php
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to