Kai Schaetzl wrote: > Many of my Bayes db's (not SQL) can't be expired anymore because the > --force-expire run can't find a delta that is big enough or so. I tried > with several settings for max_size that would either expire only a few or > most of the db and some steps in-between. Always the same problem. Is > there a way or script that can *really* enforce an expire? > Not really. If it can't find a delta, then SA can't find any way of doing a sensible expiry, and there's not likely any "good" expiry that can be performed.
First, understand that bayes expiry works by picking a time, and dropping everything that hasn't been used since that time. Essentially discarding all the stale data, and keeping the fresh stuff. I forget the exact threshold, but I think anything that's more recent than 12 hours is always kept by the algorithm, to avoid flushing out stuff that's clearly being used regularly. When no delta can be found, generally this means you've got a bayes DB that's got a large set of data, all at more-or-less the same timestamp, and very little other data. If it were to pick an expiry date that winds up dropping the large chunk, your bayes disables itself because there's no longer enough data, making it more-or-less the same as deleting the bayes DB entirely. On the other hand there's no data older than the large chunk left, so any other selection winds up expiring nothing. One cause of this is one big blob of hand training that created the "chunk" in the dates. In this situation, SA will eventually drop the chunk, but only after there's enough recently-used tokens to differentiate it. > I don't want to throw them away as they are working quite fine. But I also > don't want them to grow indefinitely (now at around 3-4 million tokens). > Hmm, do you have a high mail volume? Another possibility is a diverse large volume of mail is quite likely to create a broad set of constantly fresh tokens all under the 12 hour threshold. On the plus side, this essentially makes SA automatically scale the bayes DB to fit your mail volume. It also shouldn't grow indefinitely, as any tokens that aren't being used regularly will wind up being expired after they go unused for a day or so. On the down side, your bayes DB can be large. It might be worth looking at a force expire with debug on (sa-learn -D --force-expire). The time ranges and their respective reduction counts could really tell which scenario is what's happening to you. Can you post the reduction count table? > Years ago I used a very difficult way to expire such a problem db by > dumping it and then removing the "correct" data with a script and then > importing it again. And after that it would expire like normal again. > However, this was quite tricky and I think it wouldn't work with the > current format anymore. > >