Re: SpamAssassin Rules Regarding Abuse of New Top Level Domains

Bill Cole Mon, 19 Oct 2015 15:22:54 -0700

On 19 Oct 2015, at 15:22, Larry Goldman wrote:

I found that much of the SPAM had a BAYES_00 score of -1.9, which wasdefeating the contribution of the other tests. A closer inspection ofthe raw source revealed invisible gibberish text which, I assume, isdesigned to thwart the default BAYES_00 test — very cleaver. I havesince changed the score of that test to 0.


[ Coming at what John Hardin said from a different angle ]

The BAYES_?? rules are not discrete tests in themselves, but ratherscoring ranges in a 0-100% spam probability scale. As such, it makes nosense to adjust one of the scores like BAYES_00 and leave all the otherswhere they were, since they are just steps that should progressmonotonically: BAYES_00 should have the lowest (most negative) scorevalue and BAYES_99 should have the highest (most positive). BAYES_999 isan exception to this, in that it always triggers in addition to BAYES_99and so is a supplement for the most spammy of all spam. If you scoreBAYES_50 anything other than 0, you are essentially asserting that yourBayes DB is skewed (as it may well be!) If you don't scoreBAYES_{00..99} in a monotonically ascending way, you are rejecting thebasic soundness of the Bayesian classifier and probably should insteaddisable it instead.

How a particular piece of email scores in the SpamAssassin Bayesianclassifier is entirely dependent on the details of past mail receivedand learned by the specified Bayes database being used. There is nothingin the Bayes DB by default and different users of the same host may usedifferent DBs (or not, depending on site-specific configs) and so scoreidentical mail differently. A message may score entirely neutral(BAYES_50) if received today yet score at either extreme (BAYES_00 orBAYES_99+BAYES_999) tomorrow or yesterday. An empty Bayes DB will not bescored against. A mis-trained Bayes DB can be worse than useless.

When a piece of spam hits BAYES_00 or a piece of ham hits BAYES_99, thebest response is NOT to change the score of the individual pseudo-rule,it is to retrain the DB: maybe with just that one message and any othermis-scored ones you notice so as to fix it over time, maybe with a wipeto rebuild with a fresh hand-classified corpus of local spam and ham.

And don't worry about the gibberish tail on that spam. It actually doesnot do much to Bayesian classification unless it gets reused enough thatthe gibberish itself becomes de facto spamsign AND is full of words thatdon't appear so much in regular ham. For example, there's a particularspammer whose junk includes what looks like biblical passages tackedonto the end of ~75% of his messages, which has assured that he rarelyescapes rejection. (Not a lot of business users exchanging biblepassages...)

Re: SpamAssassin Rules Regarding Abuse of New Top Level Domains

Reply via email to