Re: Is the SA Bayes implementation mathematically sound?

Damian Sun, 23 Dec 2018 03:10:29 -0800

Am 23.12.18 um 02:35 schrieb RW:

>> The purpose of expiration seems to be a practical one, we don't want
>> the BayesStore grow too much. But is there a conceptual counterpart?
>> One such concept could be:
>> Maintain the store as if it were trained from scratch with spam and
>> ham mails up to N days into the past.
> 
> If you want to do that you can keep spam and ham corpora, trim them
> periodically and recreate the database from scratch. This works for
> manually maintained corpora, but wont scale to high volume
> auto-training.


Yes, scalability is an issue, but mostly a practical one. My question is
rather about the mathematical foundations of the classifier and whether
SpamAssassin has lost track of its roots. If there were a statistics
exercise with a very tiny corpus and very short mails and the professor
wanted us to calculate the probability of a tiny mail being ham or spam
via Bayes theorem, then there would be only one correct solution.

If the professor wanted to reuse the exercise with different parameters
for various exams, maybe he would tell his assistant to create a little
software so that he could play around with training corpora and tiny
input mails in order to make the solution look elegant for an exam. This
software needs to be correct.

I just presume that SpamAssassin, at its core, along with the naive
Bayes combiner, might be such a software, as long as there is no
expiration involved and while ignoring usual precision/rounding errors.

Assume we take the professor's elegant exercise, feed it to a
stripped-down version of SpamAssassin that outputs probabilities instead
of scores, and see that the probability is mathematically correct.
Now we throw away all tokens and seen messages, and relearn one spam and
ham each. We query the software for a solution for the same input mail
as before. Will it coincide with the correct one? Probably not. We can
make a game out of it. Prepare carefully designed corpora and test input
mails and see how far off the solution can get. I believe that it can
get pretty far off with small corpora, and so did the SpamAssassin
authors when they introduced min_spam_num and min_ham_num. The more
important questions are:

Is there a useful guaranteed boundary of how far off one can get?
There is a boundary of +-100%, but that is not useful.

Is the boundary decreasing with respect to nham+nspam counts, or
is the boundary decreasing with respect to the number of tokens?

> I do it on my own mail for a different reason, I like the idea of being
> able to remove the influence of very old email.  
> 
>> However if I am not mistaken, that is not the implementation.
>>
>> The nspam and nham magic counters mostly only increase. They will
>> decrease when a message is forgotten or relearnt, but they will not
>> decrease on expiration.
> 
> Nor should they as that would affect the frequencies of the tokens
> that haven't been expired. Core tokens that are never expired produce
> the same frequencies they would have had if no expiry had taken place.
> That's the big advantage of token expiry. 

This seems plausible to me, but one plausible feature does not prove the
correctness of the whole system in the above sense. I have another,
hopefully plausible design argument that is not implemented:

There might be tokens that are very rare, but when they occur, the mail
is almost certainly spam. Because they are so rare, they always expire
before they can help classify other mails. So shouldn't SpamAssassin
also keep the strongest tokens?

> Ideally expiry should be light enough to only remove ephemeral tokens
> that wont be seen in future and very rare tokens that hardly every
> change anything. Beyond that it's just a compromise. Those that do have
> to compromise on retention aren't going to want a huge reduction in
> return for a minor change in theoretical correctness.

Does this mean that there has been evaluation how minor or non-minor
such change can theoretically be?

>> If I am not mistaken there are conceptual differences between some
>> BayesStore implementations. PgSQL will expire tokens if configured,
>> but it will not expire seen messages. Redis on the other hand expires
>> both tokens and seen messages (with a huge ttl difference between
>> those two in the default configuration, on top of that).
> 
> The seen message information is just a flag that keeps track of
> whether a particular email was previous trained as spam or ham;
> expiring it  prevents very old emails from being forgotten or
> retrained correctly. It doesn't have anything to do with classification.

After a seen messageid has expired, it can be relearnt, and when it is
relearnt, it increases nspam or nham and its token counters. So it does
have to do with classification in an indirect sense.

Re: Is the SA Bayes implementation mathematically sound?

Reply via email to