very basic SA-Learn performance question: is 90 seconds or so per token really, really slow or roughly normal?

David Gessel Mon, 30 Oct 2017 15:38:23 -0700

FreeBSD 10.3-RELEASE FreeBSD 10.3-RELEASE #0 r322073: Sat Aug  5 01:44:09 PDT 
2017
spamassassin-3.4.1_10
amavisd-new-2.11.0_2,1


I'm finding the command /usr/local/bin/sa-learn --spam --showdots 
/mail/blackrosetech.com/gessel/.Junk/{cur,new} is taking a while to complete... 
by a while I mean it has been running for 3 days.    The folder has a few 
months of spam in it, 4760 "conversations" according to Thunderbird, which is 
roughly the message count since spam doesn't tend to thread deeply.

I was trying to track progress and...
# sa-learn --dump magic
0.000          0          3          0  non-token data: bayes db version
0.000          0       1646          0  non-token data: nspam
0.000          0          0          0  non-token data: nham
0.000          0     114841          0  non-token data: ntokens
0.000          0 1438503364          0  non-token data: oldest atime
0.000          0 1508955277          0  non-token data: newest atime
0.000          0 1508964658          0  non-token data: last journal sync atime
0.000          0          0          0  non-token data: last expiry atime
0.000          0          0          0  non-token data: last expire atime delta
0.000          0          0          0  non-token data: last expire reduction 
count


.... about an hour later....

# sa-learn --dump magic
0.000          0          3          0  non-token data: bayes db version
0.000          0       1690          0  non-token data: nspam
0.000          0          0          0  non-token data: nham
0.000          0     114841          0  non-token data: ntokens
0.000          0 1438503364          0  non-token data: oldest atime
0.000          0 1508955277          0  non-token data: newest atime
0.000          0 1508964658          0  non-token data: last journal sync atime
0.000          0          0          0  non-token data: last expiry atime
0.000          0          0          0  non-token data: last expire atime delta
0.000          0          0          0  non-token data: last expire reduction 
count

but then 24 hours later...

# sa-learn --dump magic
0.000          0          3          0  non-token data: bayes db version
0.000          0          0          0  non-token data: nspam
0.000          0          0          0  non-token data: nham
0.000          0     133661          0  non-token data: ntokens
0.000          0 1438503364          0  non-token data: oldest atime
0.000          0 1508955277          0  non-token data: newest atime
0.000          0 1508964658          0  non-token data: last journal sync atime
0.000          0          0          0  non-token data: last expiry atime
0.000          0          0          0  non-token data: last expire atime delta
0.000          0          0          0  non-token data: last expire reduction 
count

Two issues:

1) sa-learn seems really, really slow.  Slow enough that spam sometimes comes 
in faster.  This seems far slower than the benchmark results suggest is within 
the range of normal.   I'm sure I'm doing something really wrong, but not sure 
what.

2)  what happened to my hard won spam tokens?  


I know --no-sync should speed up the process and if the task ever completes (or 
can be killed) I'll test that for speed on a smaller collection.  Would 
something like specifying the mailbox format also help?

very basic SA-Learn performance question: is 90 seconds or so per token really, really slow or roughly normal?

Reply via email to