he should not compare all the tokens but a rapid survey on the tokens derived from headers can tell him how the bayes result was formed.
A couple of weeks ago some phishing reached our inboxes. Our custom rule gave the message 5 points but I was surprised that the message was categorized BAYES_00, -1.9. I run the bayes debug and found that clearly spam words were not recognized as spammy. Then I discovered that one admin enable auto-learning by mistake and the database was full of garbage... I cleared the db, reloaded it with our hand-selected corpus and the message was now BAYES_50. On Wed, Feb 15, 2023 at 3:27 PM Matus UHLAR - fantomas <uh...@fantomas.sk> wrote: > On 15.02.23 14:53, hg user wrote: > >If you run spamassasin with -D bayes -t xxx 2>debug.log > > > >in debug.log you will see all the "tokens" the bayes system extracts > >from the headers and you will probably find a lot of them related to > >mailing lists. > > > >If you teach SA that those tokens are spam and they are present both > >in WP or Forbes, their emails will be flagged. It's normal. > > Don't expect anyone to manually compare tokens, unless they are deeply > debugging bayes functionality. > > Simply said, bayes DOES gather all possible tokens and compare their > occurence with interesting effectivity - if you train Forbes and WP > newsletters as ham, and other newsletters as spam, bayes should be able to > distinguish them quite nicely. > > However, many of tokens in even Forbes and WP newsletters may occure in > different spamy newsletters, so be careful when traning even these. > > If you get the score down enough not to be classified as spam, you've won > and should not contine (unless you are willing to check all BAYES_0 mail > for > suspicious newsletters and train those as spam, seeing how much it affects > mentioned Forbes and WP newsletters. > > Bayes training is great, but one should be careful about that. > > > >If you want you can use bayes_ignore_header to ignore some headers. > > this rarely helps. > > > >On 2/15/23, Matus UHLAR - fantomas <uh...@fantomas.sk> wrote: > >>>>*-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1% > >>>> >* [score: 0.0000] > >>>> > >>>> This indicates a mistrained database, which means you have trained too > >>>> many > >>>> spams or spam-like messages (commercial messages) as ham. > >>>> > >>>> Proper training of spams should help. Just keep your spam (and > >>>> optionally > >>>> ham) corpora for retraining in case you would drop the database. > >>>> > >>>> I also recommend to abstain from training commercial mail (notices > from > >>>> e-shops, companies you done business with etc) as ham, unless they > >>>> generate > >>>> BAYES_999 score and you want it lower. I often train them as spam so > >>>> those > >>>> give uncertain BAYES_50 result. > >> > >> On 14.02.23 23:05, Alex wrote: > >>>Is there any ability to distinguish a legitimate newsletter from a spam > >>>newsletter? > >> > >> Very hard. > >> > >> That's why I recommend not to train newsletters unless you know > you/users > >> want them and they produce BAYES_99 result. > >> > >> > >>>In other words, if I train emails from Forbes or Washington Post as ham, > >>>then train similar newsletter emails from other other providers that are > >>>more suspect, will bayes still be able to distinguish Forbes and WP as > >>> ham? > >> > >>>The problem is that if I avoid training newsletters or bulk email > >>>altogether, then I'm also left with spam newsletters still only hitting > >>>bayes50. > >> > >> If you only do this for Forbes or Washington Post, bayes will likely be > able > >> > >> to distinguish other newsletters, if you train those as spam. > >> > >>>I'm actually in a situation now where Forbes and WP newsletters are > being > >>>marked as spam, so considering retraining, but wondering what > >>> approach/best > >>>practices I should be following. > >> > >> This should be safe. There are many types of newsletters, the problem > would > >> > >> only be if you started training them as ham unless you really know they > are > >> > >> welcome. > >> > >> -- > >> Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/ > >> Warning: I wish NOT to receive e-mail advertising to this address. > >> Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. > >> WinError #99999: Out of error messages. > >> > > -- > Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/ > Warning: I wish NOT to receive e-mail advertising to this address. > Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. > Save the whales. Collect the whole set. >