Am 16.03.2016 um 02:14 schrieb Ted Mittelstaedt:


On 3/15/2016 2:48 PM, Reindl Harald wrote:


Am 15.03.2016 um 22:24 schrieb Ted Mittelstaedt:
Baloney - spamoney!!!

I do not use autolearning, and ALL my spam is either hand-selected or it
comes from honeypot addresses that have NEVER been on my domains - I get
these honeypot addresses by scanning the mail log and looking for
guesses by spammers - when I see a popular address in the "guess bin"
I set it up as a honeypot - and within 6 months it's getting thousands
of spams a week. And the ham comes from me and from a select group of
users who have large amounts of mail stored on the system that is all
clean.

Bayes is NOT the answer to everything!!!!

no, but to most things if your corpora is well maintained and don't
forget already learned samples - otherwise it's easy to trick out over
the long and won't catch seasonal junk or end in miss-classified
seasonal ham

we have scripts checking any samples against current bayes
classification and ignore them if they already have BAYES_99,

Is this even necessary?  I thought the learner automatically
rejected everything already tagged.

14 months expierience shows that there is a difference between "tagged" and "has BAYES_99" because other slightly similar messages in the future share tokes which would otherwise not get that spammy classification

important is that you also train ham proper which is shown by around 75% of all passed messages have BAYES_00 and that's finally the key to make negative bayes a safety net and ond the other hand BAYES_99 a nearly poision pill balanced with DNSWL's

/etc/mail/spamassassin/local-*.cf
score BAYES_00 -3.5
score BAYES_05 -2.0
score BAYES_20 -1.0
score BAYES_40 -0.5
score BAYES_50 1.5
score BAYES_60 3.5
score BAYES_80 5.5
score BAYES_95 6.5
score BAYES_99 7.5
score BAYES_999 0.4

0      62556    SPAM
0      21903    HAM
0    2588189    TOKEN

BAYES_00        16233   74.25 %
BAYES_05          466    2.13 %
BAYES_20          539    2.46 %
BAYES_40          534    2.44 %
BAYES_50         1708    7.81 %
BAYES_60          223    1.02 %     8.84 % (OF TOTAL BLOCKED)
BAYES_80          172    0.78 %     6.82 % (OF TOTAL BLOCKED)
BAYES_95          161    0.73 %     6.38 % (OF TOTAL BLOCKED)
BAYES_99         1826    8.35 %    72.43 % (OF TOTAL BLOCKED)
BAYES_999        1632    7.46 %    64.73 % (OF TOTAL BLOCKED)

DELIVERED       30464   94.33 %
DNSWL           30210   93.54 %
SPF             21158   65.51 %
SPF/DKIM WL      9122   28.24 %
SHORTCIRCUIT    10409   32.23 %

BLOCKED          2521    7.80 %
SPAMMY           2382    7.37 %    94.48 % (OF TOTAL BLOCKED)

there is
not much left to train and with the data of a whole year have fun to
bypass it, especially when it's scored proper


All my spam from 2015 fed into the Bayes learner is backed up there's
probably about 3GB of it.  I show 369686 spams and 15128675 tokens in
the database.  I don't think I'm forgetting already
learned samples since the spam and token count increases every time I
feed it.

if you think there is a different way to learn it that is better,
I can create a test db and feed last year into that and see if it works
any better

i strip away most headers from samples, put a generic reciebd-header on top and also delete the binary part of most attachments

[root@mail-gw:/training]$ disk-usage.sh
   21903 Files   351284 KB      343 MB : ham/
       1 Files        5 KB        0 MB : learn.sh
   62556 Files   408280 KB      398 MB : spam/

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to