On Sun, 23 Dec 2018 00:39:02 +0100 Damian wrote: > Hi all, > > is there someone who has a good grasp around the mathematics of Bayes > learning with respect to SpamAssassin? > > I assume that training a fresh BayesStore with a set of spam and ham > samples is mathematically sound.
It's not so sound that a lot of guesswork, and trial and error weren't needed. > What bothers me a little is the > expiration logic. > > The purpose of expiration seems to be a practical one, we don't want > the BayesStore grow too much. But is there a conceptual counterpart? > One such concept could be: > Maintain the store as if it were trained from scratch with spam and > ham mails up to N days into the past. If you want to do that you can keep spam and ham corpora, trim them periodically and recreate the database from scratch. This works for manually maintained corpora, but wont scale to high volume auto-training. I do it on my own mail for a different reason, I like the idea of being able to remove the influence of very old email. > However if I am not mistaken, that is not the implementation. > > The nspam and nham magic counters mostly only increase. They will > decrease when a message is forgotten or relearnt, but they will not > decrease on expiration. Nor should they as that would affect the frequencies of the tokens that haven't been expired. Core tokens that are never expired produce the same frequencies they would have had if no expiry had taken place. That's the big advantage of token expiry. Ideally expiry should be light enough to only remove ephemeral tokens that wont be seen in future and very rare tokens that hardly every change anything. Beyond that it's just a compromise. Those that do have to compromise on retention aren't going to want a huge reduction in return for a minor change in theoretical correctness. > If I am not mistaken there are conceptual differences between some > BayesStore implementations. PgSQL will expire tokens if configured, > but it will not expire seen messages. Redis on the other hand expires > both tokens and seen messages (with a huge ttl difference between > those two in the default configuration, on top of that). The seen message information is just a flag that keeps track of whether a particular email was previous trained as spam or ham; expiring it prevents very old emails from being forgotten or retrained correctly. It doesn't have anything to do with classification.