Re: Is the SA Bayes implementation mathematically sound?

RW Sat, 22 Dec 2018 17:44:29 -0800

On Sun, 23 Dec 2018 00:39:02 +0100
Damian wrote:

> Hi all,
> 
> is there someone who has a good grasp around the mathematics of Bayes
> learning with respect to SpamAssassin?
> 
> I assume that training a fresh BayesStore with a set of spam and ham
> samples is mathematically sound.


It's not so sound that a lot of guesswork, and trial and error weren't
needed. 

> What bothers me a little is the
> expiration logic.
> 
> The purpose of expiration seems to be a practical one, we don't want
> the BayesStore grow too much. But is there a conceptual counterpart?
> One such concept could be:
> Maintain the store as if it were trained from scratch with spam and
> ham mails up to N days into the past.

If you want to do that you can keep spam and ham corpora, trim them
periodically and recreate the database from scratch. This works for
manually maintained corpora, but wont scale to high volume
auto-training.

I do it on my own mail for a different reason, I like the idea of being
able to remove the influence of very old email.  

> However if I am not mistaken, that is not the implementation.
> 
> The nspam and nham magic counters mostly only increase. They will
> decrease when a message is forgotten or relearnt, but they will not
> decrease on expiration.

Nor should they as that would affect the frequencies of the tokens
that haven't been expired. Core tokens that are never expired produce
the same frequencies they would have had if no expiry had taken place.
That's the big advantage of token expiry. 

Ideally expiry should be light enough to only remove ephemeral tokens
that wont be seen in future and very rare tokens that hardly every
change anything. Beyond that it's just a compromise. Those that do have
to compromise on retention aren't going to want a huge reduction in
return for a minor change in theoretical correctness.


> If I am not mistaken there are conceptual differences between some
> BayesStore implementations. PgSQL will expire tokens if configured,
> but it will not expire seen messages. Redis on the other hand expires
> both tokens and seen messages (with a huge ttl difference between
> those two in the default configuration, on top of that).

The seen message information is just a flag that keeps track of
whether a particular email was previous trained as spam or ham;
expiring it  prevents very old emails from being forgotten or
retrained correctly. It doesn't have anything to do with classification.

Re: Is the SA Bayes implementation mathematically sound?

Reply via email to