On Sat, 12 Dec 2015 13:29:40 +0100 Axb wrote: > On 12/12/2015 01:08 PM, Reindl Harald wrote: > >> I hate stale data... that's all
But you do keep stale data in the retained tokens, what you are getting rid of is the contribution from old mails that's least likely to make a difference to any classifications. Expiry is about managing database size; if it were about expiring stale information it would be implemented differently. > > practical reasons? > > it's a computer > performance... If I keep accessing X years of stale data my scanning > times go to the roof. The time taken to look-up n tokens from a database containing m tokens shouldn't strongly depend on m. There's something wrong if it does. > > financial reasons? > > if you mean performance > > no... money.. If I see 15 million msgs/day and keep the Bayes data > which those millions provided over a decade or more, I'd be in the TB > amount of data... I couldn't really justify requesting servers with > TBs RAM. Accounting would put me in the looney house. The number of tokens depends on how many you train, not on how many you scan.