Re: Bayes in V4 compared to V3

Grega via users Thu, 12 Sep 2024 21:53:30 -0700

Hi.


I just filtered in last week and I have

BAYES_20

BAYES_40

BAYES_50

BAYES_80


So no BAYES_00, _05, _90,_95 etc...


All extreme values which are the only one useful to do real scoring and marking 
are missing.

Today I`m going to train bayes manually with around 4000 SPAM and 4000 HAM and 
will see what will happen.


And I`m reconfiguring autolearn to -4 for HAM and 12 for SPAM to really 
auto-train on correct mails...


You said: "There were substantial changes in the Bayes module between v3 and 
v4. "

This is all I needed really :)


So I will manually adjust BAYES scores and this should help me achieve desired 
results..


About BAYES missing...

I have NO load, server is almost idle....

BAYES in MariaDB so performance should not be problem.

Shortcircuit is not enabled.


Regards,

Grega


________________________________
From: Bill Cole <sausers-20150...@billmail.scconsult.com>
Sent: Thursday, 12 September 2024 21:38
To: Grega via users
Subject: Re: Bayes in V4 compared to V3


On 2024-09-12 at 14:05:11 UTC-0400 (Thu, 12 Sep 2024 18:05:11 +0000)
Grega via users <gr...@nabiralnik.eu>
is rumored to have said:

Hi.

I have SA 4.0.1 configured it, all is good, except for bayes. It IS working, it 
IS learning but when it classifies mail it is really not so decisive as it was 
in V3.
I have:

dbg: bayes: corpus size: nspam = 1190, nham = 12441 dbg: bayes: DB expiry: 
tokens in DB: 979401, Expiry max size: 1500000, Oldest atime: 1725361640, 
Newest atime: 1725888528, Last expire: 0, Current time: 1725888537
So I have enough spam/ham and really enough tokens...
What I find weird is this:
BAYES_50 and BAYES_40 have like 10.000 hits EACH which is ALOT

BAYES_80 only 600
BAYES_95 even less: 341
BAYES_99: 284
BAYES_20 only 150
BAYES_60 only 87
I have no BAYES lower than 40 at all.

What's that BAYES_20 line then?

I am training and also use autolearn.
I have also transferred corpus trained on SA v3 where it worked correctly.
Is Spamassassin v4 really so much more conservative or am I doing something 
wrong here?

There were substantial changes in the Bayes module between v3 and v4. Training 
the exact same corpus in the exact same order into v3.4x and 4.0x will yield 
different scores, due to *bug fixes* and *improvements* in parsing headers. In 
principle this should make scoring more consistent and accurate, which may mean 
fewer extreme scores. In theory, better parsing should result in some common 
tokens being split differently, yielding more diversity in their metrics. We 
also updated 'stopword' lists for various languages, removing tokens that are 
so common that they cannot help scoring in principle.

So, no, you are not doing anything wrong. We may need to re-examine the default 
scores for the BAYES_* rules to adapt but that has no concrete plan behind it.

With that said, I looked at recent logs on one system running the SA 
development trunk (which has no added Bayes changes relative to 4.0.1) and got 
this distribution:

16444 BAYES_00
20 BAYES_05
22 BAYES_20
13 BAYES_40
64 BAYES_50
2 BAYES_60
6 BAYES_80
2 BAYES_95
139 BAYES_99
138 BAYES_999

That is a machine that excludes most blatant spam at the SMTP layer, without 
handing it to SA.


Also;
One more thing...
Some mails even dont have BAYES added in score list, confirmed on 2 installs

How many?

While you are initially training the Bayes DB and lack adequate ham and spam 
counts, you get no BAYES hits. Also, if you have any rules set to 
"shortcircuit" they can cause SA to stop checking before Bayes is done.

I *think* I've also seen Bayes skip on excess load, with too much lock 
contention on a file-based mechanism like Berkeley DB.


   b...@scconsult.com or billc...@apache.org
   (AKA @grumpybozo@toad.social and many *@billmail.scconsult.com addresses)
   Not Currently Available For Hire

Re: Bayes in V4 compared to V3

Reply via email to