On 22 Dec 2018, at 18:39, Damian wrote:

Hi all,

is there someone who has a good grasp around the mathematics of Bayes
learning with respect to SpamAssassin?

Justin Mason would be the best person to discuss this. I do not know if he still reads this list.

I assume that training a fresh BayesStore with a set of spam and ham
samples is mathematically sound.

Nope.

I mean, it probably is sound for the initial static set of spam and ham it is trained with, until more training and expiration happens. So what? It will NEVER be a mathematically sound Bayesian classifier for the mail it is asked to classify. Never.

It is imperfect for any ongoing collection of spam and ham. There is no such thing as a valid sample of email which applies to the spam/ham classification of tomorrow's email. There are significant qualitative and quantitative differences over time for any target and across targets in any period of time. Exactly identical messages sent to multiple addresses may be ham for one target and spam for another.

What bothers me a little is the
expiration logic.

Again the question is: so what?

As is shown almost every week on this list and almost every morning in the update to the default rules channel, spam is a moving target. As investment managers are required to say in the US: past performance is not an indicator of future results.

The "Bayes" classifier SA is an empirically useful tool, not an academic project. A better implementation might be one that conforms more rigorously to the underlying math, or it might not. A better implementation would do a better job classifying today's mail based on whatever training it has and remembers than the existing implementation.

Reply via email to