On Fri, 20 Mar 2015 22:08:23 -0400 David F. Skoll wrote:
> Bayes expiry is a tricky thing. To do expiry in a way that can be > justified mathematically, you really should expire messages, not > individual tokens. Otherwise, you're skewing the probabilities. The only token probabilities that can be skewed by token expiry are those than get expired and are then subsequently relearned. Even then when those tokens are relearned the probabilities will end up more or less correct provided that the ham/spam ratio in subsequent training is similar to the overall ratio in the database. The skewing of probabilities in relearned tokens is no worse than the skewing on new tokens seen for the first time, and the latter happens whether you expire or not. Given that the effect of skewing from relearned tokens can be made arbitrarily small compared with skewing on new tokens, I don't see much of a theoretical case for preferring message expiry over token expiry on the grounds you mention - provided that you only expire ephemeral and sufficiently low frequency tokens. The real reason for preferring message expiry over token expiry is that it adapts better to changing token frequencies. Token expiry gives more accurate token probabilities for core tokens with static frequencies.