On 2024-09-12 at 14:05:11 UTC-0400 (Thu, 12 Sep 2024 18:05:11 +0000)
Grega via users <gr...@nabiralnik.eu>
is rumored to have said:

Hi.

I have SA 4.0.1 configured it, all is good, except for bayes. It IS working, it IS learning but when it classifies mail it is really not so decisive as it was in V3.
I have:

dbg: bayes: corpus size: nspam = 1190, nham = 12441 dbg: bayes: DB expiry: tokens in DB: 979401, Expiry max size: 1500000, Oldest atime: 1725361640, Newest atime: 1725888528, Last expire: 0, Current time: 1725888537
So I have enough spam/ham and really enough tokens...
What I find weird is this:
BAYES_50 and BAYES_40 have like 10.000 hits EACH which is ALOT

BAYES_80 only 600
BAYES_95 even less: 341
BAYES_99: 284
BAYES_20 only 150
BAYES_60 only 87
I have no BAYES lower than 40 at all.

What's that BAYES_20 line then?

I am training and also use autolearn.
I have also transferred corpus trained on SA v3 where it worked correctly. Is Spamassassin v4 really so much more conservative or am I doing something wrong here?

There were substantial changes in the Bayes module between v3 and v4. Training the exact same corpus in the exact same order into v3.4x and 4.0x will yield different scores, due to *bug fixes* and *improvements* in parsing headers. In principle this should make scoring more consistent and accurate, which may mean fewer extreme scores. In theory, better parsing should result in some common tokens being split differently, yielding more diversity in their metrics. We also updated 'stopword' lists for various languages, removing tokens that are so common that they cannot help scoring in principle.

So, no, you are not doing anything wrong. We may need to re-examine the default scores for the BAYES_* rules to adapt but that has no concrete plan behind it.

With that said, I looked at recent logs on one system running the SA development trunk (which has no added Bayes changes relative to 4.0.1) and got this distribution:

16444 BAYES_00
  20 BAYES_05
  22 BAYES_20
  13 BAYES_40
  64 BAYES_50
   2 BAYES_60
   6 BAYES_80
   2 BAYES_95
 139 BAYES_99
 138 BAYES_999

That is a machine that excludes most blatant spam at the SMTP layer, without handing it to SA.


Also;
One more thing...
Some mails even dont have BAYES added in score list, confirmed on 2 installs

How many?

While you are initially training the Bayes DB and lack adequate ham and spam counts, you get no BAYES hits. Also, if you have any rules set to "shortcircuit" they can cause SA to stop checking before Bayes is done.

I *think* I've also seen Bayes skip on excess load, with too much lock contention on a file-based mechanism like Berkeley DB.


--
Bill Cole
b...@scconsult.com or billc...@apache.org
(AKA @grumpybozo@toad.social and many *@billmail.scconsult.com addresses)
Not Currently Available For Hire

Reply via email to