On 19 Oct 2015, at 15:22, Larry Goldman wrote:

I found that much of the SPAM had a BAYES_00 score of -1.9, which was defeating the contribution of the other tests. A closer inspection of the raw source revealed invisible gibberish text which, I assume, is designed to thwart the default BAYES_00 test — very cleaver. I have since changed the score of that test to 0.

[ Coming at what John Hardin said from a different angle ]

The BAYES_?? rules are not discrete tests in themselves, but rather scoring ranges in a 0-100% spam probability scale. As such, it makes no sense to adjust one of the scores like BAYES_00 and leave all the others where they were, since they are just steps that should progress monotonically: BAYES_00 should have the lowest (most negative) score value and BAYES_99 should have the highest (most positive). BAYES_999 is an exception to this, in that it always triggers in addition to BAYES_99 and so is a supplement for the most spammy of all spam. If you score BAYES_50 anything other than 0, you are essentially asserting that your Bayes DB is skewed (as it may well be!) If you don't score BAYES_{00..99} in a monotonically ascending way, you are rejecting the basic soundness of the Bayesian classifier and probably should instead disable it instead.

How a particular piece of email scores in the SpamAssassin Bayesian classifier is entirely dependent on the details of past mail received and learned by the specified Bayes database being used. There is nothing in the Bayes DB by default and different users of the same host may use different DBs (or not, depending on site-specific configs) and so score identical mail differently. A message may score entirely neutral (BAYES_50) if received today yet score at either extreme (BAYES_00 or BAYES_99+BAYES_999) tomorrow or yesterday. An empty Bayes DB will not be scored against. A mis-trained Bayes DB can be worse than useless.

When a piece of spam hits BAYES_00 or a piece of ham hits BAYES_99, the best response is NOT to change the score of the individual pseudo-rule, it is to retrain the DB: maybe with just that one message and any other mis-scored ones you notice so as to fix it over time, maybe with a wipe to rebuild with a fresh hand-classified corpus of local spam and ham.

And don't worry about the gibberish tail on that spam. It actually does not do much to Bayesian classification unless it gets reused enough that the gibberish itself becomes de facto spamsign AND is full of words that don't appear so much in regular ham. For example, there's a particular spammer whose junk includes what looks like biblical passages tacked onto the end of ~75% of his messages, which has assured that he rarely escapes rejection. (Not a lot of business users exchanging bible passages...)

Reply via email to