Re: Mondo bayes_toks - millions of entries

Wes Thu, 29 Nov 2007 09:39:52 -0800

On 11/29/07 5:00 AM, "Matt Kettler" <[EMAIL PROTECTED]> wrote:


> However,  it's important to note that the  bayes_expiry_period  does not
> dictate token life. It dictates how often expiry check will run
> automatically. Basically, SA looks at the database, finds out when the
> last expire ran, and if more than bayes_expiry_period has elapsed, it
> kicks off an auto-expire. Since you're manually expiring every 3 hours,
> your modified bayes_expiry_period never comes into effect.
>  
> When expiry (either due to the bayes_expiry_period or a manual
> force-expire) runs, it checks if the database has more than
> "bayes_expiry_max_db_size" tokens in it, SA will attempt to reduce the
> database to 75% of bayes_expiry_max_db_size, keeping the most recently
> used tokens.
> 
> In your case, you have a high learning volume, so this means that every
> 3 hours (due to your manual sa-learn --force-expire), your database is
> going to be reduced to the 100,000 most-recently used tokens.

That's what I expected, but that's not what happens...

I have no problem being proven wrong, and perhaps there's something else
going on here, but based on digging through the documentation, the code
(including running sa-learn under the perl debugger down into the expire
module), and observations of actually running a forced expire, this is not
true.  

It does try to reduce down to 75% of bayes_expiry_max_db_size, but will not
expire any tokens younger than bayes_expiry_period, even on a force expire.
With "bayes_expiry_max_db_size 150000" set, if I run a force expire on a
database that is less than bayes_expiry_period old, but with millions of
tokens, *no* tokens are expired.  If I run it on one older than
bayes_expiry_period, only tokens older than bayes_expiry_period are expired.

At the bottom is a sample debug output from sa-learn forced expire.  You'll
notice that the target is to reduce the number of tokens down to 112,500
(token count: 4933086, final goal reduction size: 4820586).  However, look
at the reduction table, and the final results: "3702653 entries kept,
1230354 deleted".

As I read the algorithm documentation, and the code (as best I remember
without looking at it right now), goes something like this:  For the first
pass, it calculates the number of tokens that would be expired for
bayes_expiry_period*1, bayes_expiry_period*2, *4, ... Expoentially up to the
max exponent of 9.  It then picks the one that would expire closest to
bayes_expiry_max_db_size, without dropping below 75% of
bayes_expiry_max_db_size.  The smallest exponent is bayes_expiry_period*1 -
it expires entries older than bayes_expiry_period*1 because that is closest
to .75*150,000 - even though that leaves 3.5 million.  For subsequent
passes, if the estimated cutoff is less than bayes_expiry_period, then use
the above algorithm again.

As further evidence of this, with a static database (spamd not running), and
bayes_expiry_period set to the default of 12 hours, I ran a force expire.
It expired only tokens older than 12 hours, leaving about 7 million.  I ran
it again.  No tokens were removed.  I then dropped bayes_expiry_period  down
to 6 hours and reran the expire.  It then expired another 3.5 million
tokens.  It never dropped to anything approaching 150,000.

All results were verified by comparing the sa-learn debug output with
"sa-learn --dump magic", and done using the DBM module.

One noteable thing in the debug output is "bayes: can't use estimation
method for expiry, unexpected result, calculating optimal atime delta (first
pass)", which comes from BayesStore.pm around line 300
("$self->{expiry_period}" and $start are bayes_expiry_period):


  if ( (time() - $vars[4] > 86400*30) || ($vars[8] < $self->{expiry_period})
       || ($vars[9] < 1000)
       || ($newdelta < $self->{expiry_period}) || ($ratio > 1.5) ) {
    dbg("bayes: can't use estimation method for expiry, unexpected result,
calculating optimal atime delta (first pass)");

[snip]

    $newdelta = $start * $max_expire_mult;   <<<<<<<<<<<<<<<<<<<<<<<<<<<
    dbg("bayes: first pass decided on $newdelta for atime delta");
  }
  else { # use the estimation method
    dbg("bayes: can do estimation method for expiry, skipping first pass");
  }


This code is triggered because "newdelta: 10981", which is less than
bayes_expiry_period (BayesStore.pm line 281 cacluates newdelta):

  # Estimate new atime delta based on the last atime delta
  my $newdelta = 0;
  if ( $vars[9] > 0 ) {
    # newdelta = olddelta * old / goal;
    # this may seem backwards, but since we're talking delta here,
    # not actual atime, we want smaller atimes to expire more tokens,
    # and visa versa.
    #
    $newdelta = int($vars[8] * $vars[9] / $goal_reduction);
  }

Again, the code appears to say "if we're expiring anything younger than
bayes_expiry_period, then recalculate so nothing younger than
bayes_expiry_period  is expired".


Here's the log:

[21506] dbg: bayes: expiry starting
[21506] dbg: bayes: expiry check keep size, 0.75 * max: 112500
[21506] dbg: bayes: token count: 4933086, final goal reduction size: 4820586
[21506] dbg: bayes: first pass? current: 1196349002, Last: 1196338557,
atime: 21600, count: 1049110, newdelta: 4700, ratio: 4.59492903508688,
period: 21600
[21506] dbg: bayes: can't use estimation method for expiry, unexpected
result, calculating optimal atime delta (first pass)
[21506] dbg: bayes: expiry max exponent: 9
[21506] dbg: bayes: atime token reduction
[21506] dbg: bayes: ======== ===============
[21506] dbg: bayes: 21600 1230354
[21506] dbg: bayes: 43200 0
[21506] dbg: bayes: 86400 0
[21506] dbg: bayes: 172800 0
[21506] dbg: bayes: 345600 0
[21506] dbg: bayes: 691200 0
[21506] dbg: bayes: 1382400 0
[21506] dbg: bayes: 2764800 0
[21506] dbg: bayes: 5529600 0
[21506] dbg: bayes: 11059200 0
[21506] dbg: bayes: first pass decided on 21600 for atime delta
[21506] dbg: bayes: untie-ing
[21506] dbg: bayes: files locked, now unlocking lock
[21506] dbg: locker: safe_unlock: unlocked
/home/smfs/.spamassassin/bayes.mutex
[21506] dbg: bayes: expiry completed
bayes: synced databases from journal in 0 seconds: 927 unique entries (927
total entries)
expired old bayes database entries in 432 seconds
3702653 entries kept, 1230354 deleted
token frequency: 1-occurrence tokens: 83.22%
token frequency: less than 8 occurrences: 12.56%


Wes

Re: Mondo bayes_toks - millions of entries

Reply via email to