On Fri, 20 Mar 2015 22:08:23 -0400
David F. Skoll wrote:

> Bayes expiry is a tricky thing.  To do expiry in a way that can be
> justified mathematically, you really should expire messages, not
> individual tokens. Otherwise, you're skewing the probabilities.

The only token probabilities that can be skewed by token expiry are
those than get expired and are then subsequently relearned. Even then
when those tokens are relearned the probabilities will end up
more or less correct provided that the ham/spam ratio in subsequent
training is similar to the overall ratio in the database. The skewing
of probabilities in relearned tokens is no worse than the skewing on new
tokens seen for the first time, and the latter happens whether you
expire or not. 

Given that the effect of skewing from relearned tokens can be made
arbitrarily  small compared with skewing on new tokens, I don't see
much of a theoretical case for preferring message expiry over token
expiry on the grounds you mention - provided that you only expire
ephemeral and sufficiently low frequency tokens.

The real reason for preferring  message expiry over token expiry is that
it adapts better to changing token frequencies. Token expiry gives more
accurate token probabilities for core tokens with static frequencies.

  

Reply via email to