On 22 Dec 2018, at 18:39, Damian wrote:
Hi all,
is there someone who has a good grasp around the mathematics of Bayes
learning with respect to SpamAssassin?
Justin Mason would be the best person to discuss this. I do not know if
he still reads this list.
I assume that training a fresh BayesStore with a set of spam and ham
samples is mathematically sound.
Nope.
I mean, it probably is sound for the initial static set of spam and ham
it is trained with, until more training and expiration happens. So what?
It will NEVER be a mathematically sound Bayesian classifier for the mail
it is asked to classify. Never.
It is imperfect for any ongoing collection of spam and ham. There is no
such thing as a valid sample of email which applies to the spam/ham
classification of tomorrow's email. There are significant qualitative
and quantitative differences over time for any target and across targets
in any period of time. Exactly identical messages sent to multiple
addresses may be ham for one target and spam for another.
What bothers me a little is the
expiration logic.
Again the question is: so what?
As is shown almost every week on this list and almost every morning in
the update to the default rules channel, spam is a moving target. As
investment managers are required to say in the US: past performance is
not an indicator of future results.
The "Bayes" classifier SA is an empirically useful tool, not an academic
project. A better implementation might be one that conforms more
rigorously to the underlying math, or it might not. A better
implementation would do a better job classifying today's mail based on
whatever training it has and remembers than the existing implementation.