Thank you very much for your help! A few answers inline.
-------- Original Message --------
Subject: Re: very basic SA-Learn performance question: is 90 seconds or so per
token really, really slow or roughly normal?
From: Matus UHLAR - fantomas <uh...@fantomas.sk>
To: users@spamassassin.apache.org
Date: Tue Oct 31 2017 11:27:47 GMT+0300 (AST)
> On 31.10.17 01:35, David Gessel wrote:
>> amavisd-new-2.11.0_2,1
>> I'm finding the command /usr/local/bin/sa-learn --spam --showdots
>> /mail/blackrosetech.com/gessel/.Junk/{cur,new} is taking a while to
>
> if you use amavis, you must train amavis' bayes database
> (/var/lib/amavis/.spamassassin/ here), not your own.
huh, I was getting bayes filter results, as I =think= I'm training a global
bayes database per
https://wiki.apache.org/spamassassin/SiteWideBayesSetup
>
>> complete... by a while I mean it has been running for 3 days. The folder
>> has a few months of spam in it, 4760 "conversations" according to
>> Thunderbird, which is roughly the message count since spam doesn't tend to
>> thread deeply.
>
> It's not needed to train on all spam you have. after initial training of
> let's say 200-500 pieces of (different types of) logged spam, it may be
> enough to train only spam that does not hit BAYES_99
>
> It's much more important to train on ham, since SA must know the DIFFERENCES
> between ham
> and spam - otherwise all mail will of course look like spam.
> (Also, SA won't hit if you don't have enough of ham trained).
>
> It's much worse to have FP than FN - train everything that does not hit
> BAYES_00
>
> I have trained my DB years ago and I rarely need new training now.
Yes, I do understand that. The cron jobs I set up quite some time ago
# learn ham and spam
17 3 * * 0 root /usr/local/bin/sa-learn --ham
--no-sync /mail/blackrosetech.com/gessel/.archives.2017/{cur,new}
22 3 * * 0 root /usr/local/bin/sa-learn --ham
--no-sync /mail/blackrosetech.com/gessel/.Sent/{cur,new}
27 3 * * 0 root /usr/local/bin/sa-learn --spam
--no-sync /mail/blackrosetech.com/gessel/.ManJunk/{cur,new}
22 3 * * 0 root /usr/local/bin/sa-learn --ham
--no-sync /mail/blackrosetech.com/carolyn/.Archives.2017/{cur,new}
32 3 * * 0 root /usr/local/bin/sa-learn --spam
--no-sync /mail/blackrosetech.com/carolyn/.ManJunk/{cur,new}
37 3 * * 0 root /usr/local/bin/sa-learn --ham
--no-sync /mail/blackrosetech.com/carolyn/.Sent/{cur,new}
55 3 * * 0 root /usr/local/bin/sa-learn --sync
I disabled auto-learn because non-spam would occasionally get through to spam
and I didn't want to train on that. The theory here was to wipe the database,
then groom the huge automatic spam folder to clear any non-spam (manually
moving it to the archives directory for later analysis as non-spam). Then
sa-learn a large set of spam tokens on obvious spam, then on an ongoing basis
keep it trained with the spam that slips through (which I move to ManJunk).
The incentive to restart was registering a few domain names which triggers a
deluge of "let me design your new logo" emails from rafts of hotmail and google
accounts, I thought retraining the bayes database would improve detection of
these linguistically distinctive spam messages.
>
>> I was trying to track progress and...
>> # sa-learn --dump magic
>> 0.000 0 3 0 non-token data: bayes db version
>> 0.000 0 1646 0 non-token data: nspam
>> 0.000 0 0 0 non-token data: nham
>
>> but then 24 hours later...
>>
>> # sa-learn --dump magic
>> 0.000 0 3 0 non-token data: bayes db version
>> 0.000 0 0 0 non-token data: nspam
>> 0.000 0 0 0 non-token data: nham
>
> are you sure someone did not back up your spam DB
Aside from the cron jobs above, no, but if they did that, then yes.
>
>> Two issues:
>>
>> 1) sa-learn seems really, really slow. Slow enough that spam sometimes
>> comes in faster. This seems far slower than the benchmark results suggest
>> is within the range of normal. I'm sure I'm doing something really wrong,
>> but not sure what.
>>
>> 2) what happened to my hard won spam tokens?
>>
>>
>> I know --no-sync should speed up the process and if the task ever completes
>> (or can be killed) I'll test that for speed on a smaller collection.
>
> --no-sync only helps if you have "bayes_learn_to_journal 1" - it's 0 by
> default. try turning it on.
OK, will do this. It is not in my local.cf.
bayes config read
# Use Bayesian classifier (default: 1)
#
use_bayes 1
# Bayesian classifier auto-learning (default: 1)
#
bayes_auto_learn 0
# Set headers which may provide inappropriate cues to the Bayesian
# classifier
#
# bayes_ignore_header X-Bogosity
bayes_ignore_header X-Spam-Flag
bayes_ignore_header X-Spam-Status
# Set the default directory for the bayes classifier
bayes_path /var/amavis/.spamassassin/bayes
bayes_file_mode 0777
Just added this
# If this option is set, whenever SpamAssassin does Bayes learning, it will
# put the information into the journal instead of directly into the database.
# This lowers contention for locking the database to execute an update, but
# will also cause more access to the journal and cause a delay before the
# updates are actually committed to the Bayes database.
#
bayes_learn_to_journal 1
>
>> Would something like specifying the mailbox format also help?
>
> only if you use mbox format.
No, maildir. Not really relevant (I don't think) but:
dovecot2-2.2.31_1
dovecot-pigeonhole-0.4.19
postfix-3.2.2,1
Now that "bayes_learn_to_journal 1" is set, I've stopped the process, and....
# sa-learn --dump magic
0.000 0 3 0 non-token data: bayes db version
0.000 0 2326 0 non-token data: nspam
0.000 0 1 0 non-token data: nham
0.000 0 154919 0 non-token data: ntokens
0.000 0 1438503364 0 non-token data: oldest atime
0.000 0 1508964396 0 non-token data: newest atime
0.000 0 1508964658 0 non-token data: last journal sync atime
0.000 0 0 0 non-token data: last expiry atime
0.000 0 0 0 non-token data: last expire atime delta
0.000 0 0 0 non-token data: last expire reduction
count
Restarting with:
# sa-learn --spam --showdots --no-sync
/mail/blackrosetech.com/gessel/.Junk/{cur,new}
And will let it run for a bit to see what the rate looks like.