On 11/29/07 5:00 AM, "Matt Kettler" <[EMAIL PROTECTED]> wrote:
> However, it's important to note that the bayes_expiry_period does not > dictate token life. It dictates how often expiry check will run > automatically. Basically, SA looks at the database, finds out when the > last expire ran, and if more than bayes_expiry_period has elapsed, it > kicks off an auto-expire. Since you're manually expiring every 3 hours, > your modified bayes_expiry_period never comes into effect. > > When expiry (either due to the bayes_expiry_period or a manual > force-expire) runs, it checks if the database has more than > "bayes_expiry_max_db_size" tokens in it, SA will attempt to reduce the > database to 75% of bayes_expiry_max_db_size, keeping the most recently > used tokens. > > In your case, you have a high learning volume, so this means that every > 3 hours (due to your manual sa-learn --force-expire), your database is > going to be reduced to the 100,000 most-recently used tokens. That's what I expected, but that's not what happens... I have no problem being proven wrong, and perhaps there's something else going on here, but based on digging through the documentation, the code (including running sa-learn under the perl debugger down into the expire module), and observations of actually running a forced expire, this is not true. It does try to reduce down to 75% of bayes_expiry_max_db_size, but will not expire any tokens younger than bayes_expiry_period, even on a force expire. With "bayes_expiry_max_db_size 150000" set, if I run a force expire on a database that is less than bayes_expiry_period old, but with millions of tokens, *no* tokens are expired. If I run it on one older than bayes_expiry_period, only tokens older than bayes_expiry_period are expired. At the bottom is a sample debug output from sa-learn forced expire. You'll notice that the target is to reduce the number of tokens down to 112,500 (token count: 4933086, final goal reduction size: 4820586). However, look at the reduction table, and the final results: "3702653 entries kept, 1230354 deleted". As I read the algorithm documentation, and the code (as best I remember without looking at it right now), goes something like this: For the first pass, it calculates the number of tokens that would be expired for bayes_expiry_period*1, bayes_expiry_period*2, *4, ... Expoentially up to the max exponent of 9. It then picks the one that would expire closest to bayes_expiry_max_db_size, without dropping below 75% of bayes_expiry_max_db_size. The smallest exponent is bayes_expiry_period*1 - it expires entries older than bayes_expiry_period*1 because that is closest to .75*150,000 - even though that leaves 3.5 million. For subsequent passes, if the estimated cutoff is less than bayes_expiry_period, then use the above algorithm again. As further evidence of this, with a static database (spamd not running), and bayes_expiry_period set to the default of 12 hours, I ran a force expire. It expired only tokens older than 12 hours, leaving about 7 million. I ran it again. No tokens were removed. I then dropped bayes_expiry_period down to 6 hours and reran the expire. It then expired another 3.5 million tokens. It never dropped to anything approaching 150,000. All results were verified by comparing the sa-learn debug output with "sa-learn --dump magic", and done using the DBM module. One noteable thing in the debug output is "bayes: can't use estimation method for expiry, unexpected result, calculating optimal atime delta (first pass)", which comes from BayesStore.pm around line 300 ("$self->{expiry_period}" and $start are bayes_expiry_period): if ( (time() - $vars[4] > 86400*30) || ($vars[8] < $self->{expiry_period}) || ($vars[9] < 1000) || ($newdelta < $self->{expiry_period}) || ($ratio > 1.5) ) { dbg("bayes: can't use estimation method for expiry, unexpected result, calculating optimal atime delta (first pass)"); [snip] $newdelta = $start * $max_expire_mult; <<<<<<<<<<<<<<<<<<<<<<<<<<< dbg("bayes: first pass decided on $newdelta for atime delta"); } else { # use the estimation method dbg("bayes: can do estimation method for expiry, skipping first pass"); } This code is triggered because "newdelta: 10981", which is less than bayes_expiry_period (BayesStore.pm line 281 cacluates newdelta): # Estimate new atime delta based on the last atime delta my $newdelta = 0; if ( $vars[9] > 0 ) { # newdelta = olddelta * old / goal; # this may seem backwards, but since we're talking delta here, # not actual atime, we want smaller atimes to expire more tokens, # and visa versa. # $newdelta = int($vars[8] * $vars[9] / $goal_reduction); } Again, the code appears to say "if we're expiring anything younger than bayes_expiry_period, then recalculate so nothing younger than bayes_expiry_period is expired". Here's the log: [21506] dbg: bayes: expiry starting [21506] dbg: bayes: expiry check keep size, 0.75 * max: 112500 [21506] dbg: bayes: token count: 4933086, final goal reduction size: 4820586 [21506] dbg: bayes: first pass? current: 1196349002, Last: 1196338557, atime: 21600, count: 1049110, newdelta: 4700, ratio: 4.59492903508688, period: 21600 [21506] dbg: bayes: can't use estimation method for expiry, unexpected result, calculating optimal atime delta (first pass) [21506] dbg: bayes: expiry max exponent: 9 [21506] dbg: bayes: atime token reduction [21506] dbg: bayes: ======== =============== [21506] dbg: bayes: 21600 1230354 [21506] dbg: bayes: 43200 0 [21506] dbg: bayes: 86400 0 [21506] dbg: bayes: 172800 0 [21506] dbg: bayes: 345600 0 [21506] dbg: bayes: 691200 0 [21506] dbg: bayes: 1382400 0 [21506] dbg: bayes: 2764800 0 [21506] dbg: bayes: 5529600 0 [21506] dbg: bayes: 11059200 0 [21506] dbg: bayes: first pass decided on 21600 for atime delta [21506] dbg: bayes: untie-ing [21506] dbg: bayes: files locked, now unlocking lock [21506] dbg: locker: safe_unlock: unlocked /home/smfs/.spamassassin/bayes.mutex [21506] dbg: bayes: expiry completed bayes: synced databases from journal in 0 seconds: 927 unique entries (927 total entries) expired old bayes database entries in 432 seconds 3702653 entries kept, 1230354 deleted token frequency: 1-occurrence tokens: 83.22% token frequency: less than 8 occurrences: 12.56% Wes