Hi.
I just filtered in last week and I have BAYES_20 BAYES_40 BAYES_50 BAYES_80 So no BAYES_00, _05, _90,_95 etc... All extreme values which are the only one useful to do real scoring and marking are missing. Today I`m going to train bayes manually with around 4000 SPAM and 4000 HAM and will see what will happen. And I`m reconfiguring autolearn to -4 for HAM and 12 for SPAM to really auto-train on correct mails... You said: "There were substantial changes in the Bayes module between v3 and v4. " This is all I needed really :) So I will manually adjust BAYES scores and this should help me achieve desired results.. About BAYES missing... I have NO load, server is almost idle.... BAYES in MariaDB so performance should not be problem. Shortcircuit is not enabled. Regards, Grega ________________________________ From: Bill Cole <sausers-20150...@billmail.scconsult.com> Sent: Thursday, 12 September 2024 21:38 To: Grega via users Subject: Re: Bayes in V4 compared to V3 On 2024-09-12 at 14:05:11 UTC-0400 (Thu, 12 Sep 2024 18:05:11 +0000) Grega via users <gr...@nabiralnik.eu> is rumored to have said: Hi. I have SA 4.0.1 configured it, all is good, except for bayes. It IS working, it IS learning but when it classifies mail it is really not so decisive as it was in V3. I have: dbg: bayes: corpus size: nspam = 1190, nham = 12441 dbg: bayes: DB expiry: tokens in DB: 979401, Expiry max size: 1500000, Oldest atime: 1725361640, Newest atime: 1725888528, Last expire: 0, Current time: 1725888537 So I have enough spam/ham and really enough tokens... What I find weird is this: BAYES_50 and BAYES_40 have like 10.000 hits EACH which is ALOT BAYES_80 only 600 BAYES_95 even less: 341 BAYES_99: 284 BAYES_20 only 150 BAYES_60 only 87 I have no BAYES lower than 40 at all. What's that BAYES_20 line then? I am training and also use autolearn. I have also transferred corpus trained on SA v3 where it worked correctly. Is Spamassassin v4 really so much more conservative or am I doing something wrong here? There were substantial changes in the Bayes module between v3 and v4. Training the exact same corpus in the exact same order into v3.4x and 4.0x will yield different scores, due to *bug fixes* and *improvements* in parsing headers. In principle this should make scoring more consistent and accurate, which may mean fewer extreme scores. In theory, better parsing should result in some common tokens being split differently, yielding more diversity in their metrics. We also updated 'stopword' lists for various languages, removing tokens that are so common that they cannot help scoring in principle. So, no, you are not doing anything wrong. We may need to re-examine the default scores for the BAYES_* rules to adapt but that has no concrete plan behind it. With that said, I looked at recent logs on one system running the SA development trunk (which has no added Bayes changes relative to 4.0.1) and got this distribution: 16444 BAYES_00 20 BAYES_05 22 BAYES_20 13 BAYES_40 64 BAYES_50 2 BAYES_60 6 BAYES_80 2 BAYES_95 139 BAYES_99 138 BAYES_999 That is a machine that excludes most blatant spam at the SMTP layer, without handing it to SA. Also; One more thing... Some mails even dont have BAYES added in score list, confirmed on 2 installs How many? While you are initially training the Bayes DB and lack adequate ham and spam counts, you get no BAYES hits. Also, if you have any rules set to "shortcircuit" they can cause SA to stop checking before Bayes is done. I *think* I've also seen Bayes skip on excess load, with too much lock contention on a file-based mechanism like Berkeley DB. b...@scconsult.com or billc...@apache.org (AKA @grumpybozo@toad.social and many *@billmail.scconsult.com addresses) Not Currently Available For Hire