Am 16.03.2016 um 02:14 schrieb Ted Mittelstaedt:
On 3/15/2016 2:48 PM, Reindl Harald wrote:Am 15.03.2016 um 22:24 schrieb Ted Mittelstaedt:Baloney - spamoney!!! I do not use autolearning, and ALL my spam is either hand-selected or it comes from honeypot addresses that have NEVER been on my domains - I get these honeypot addresses by scanning the mail log and looking for guesses by spammers - when I see a popular address in the "guess bin" I set it up as a honeypot - and within 6 months it's getting thousands of spams a week. And the ham comes from me and from a select group of users who have large amounts of mail stored on the system that is all clean. Bayes is NOT the answer to everything!!!!no, but to most things if your corpora is well maintained and don't forget already learned samples - otherwise it's easy to trick out over the long and won't catch seasonal junk or end in miss-classified seasonal ham we have scripts checking any samples against current bayes classification and ignore them if they already have BAYES_99,Is this even necessary? I thought the learner automatically rejected everything already tagged.
14 months expierience shows that there is a difference between "tagged" and "has BAYES_99" because other slightly similar messages in the future share tokes which would otherwise not get that spammy classification
important is that you also train ham proper which is shown by around 75% of all passed messages have BAYES_00 and that's finally the key to make negative bayes a safety net and ond the other hand BAYES_99 a nearly poision pill balanced with DNSWL's
/etc/mail/spamassassin/local-*.cf score BAYES_00 -3.5 score BAYES_05 -2.0 score BAYES_20 -1.0 score BAYES_40 -0.5 score BAYES_50 1.5 score BAYES_60 3.5 score BAYES_80 5.5 score BAYES_95 6.5 score BAYES_99 7.5 score BAYES_999 0.4 0 62556 SPAM 0 21903 HAM 0 2588189 TOKEN BAYES_00 16233 74.25 % BAYES_05 466 2.13 % BAYES_20 539 2.46 % BAYES_40 534 2.44 % BAYES_50 1708 7.81 % BAYES_60 223 1.02 % 8.84 % (OF TOTAL BLOCKED) BAYES_80 172 0.78 % 6.82 % (OF TOTAL BLOCKED) BAYES_95 161 0.73 % 6.38 % (OF TOTAL BLOCKED) BAYES_99 1826 8.35 % 72.43 % (OF TOTAL BLOCKED) BAYES_999 1632 7.46 % 64.73 % (OF TOTAL BLOCKED) DELIVERED 30464 94.33 % DNSWL 30210 93.54 % SPF 21158 65.51 % SPF/DKIM WL 9122 28.24 % SHORTCIRCUIT 10409 32.23 % BLOCKED 2521 7.80 % SPAMMY 2382 7.37 % 94.48 % (OF TOTAL BLOCKED)
there is not much left to train and with the data of a whole year have fun to bypass it, especially when it's scored properAll my spam from 2015 fed into the Bayes learner is backed up there's probably about 3GB of it. I show 369686 spams and 15128675 tokens in the database. I don't think I'm forgetting already learned samples since the spam and token count increases every time I feed it. if you think there is a different way to learn it that is better, I can create a test db and feed last year into that and see if it works any better
i strip away most headers from samples, put a generic reciebd-header on top and also delete the binary part of most attachments
[root@mail-gw:/training]$ disk-usage.sh 21903 Files 351284 KB 343 MB : ham/ 1 Files 5 KB 0 MB : learn.sh 62556 Files 408280 KB 398 MB : spam/
signature.asc
Description: OpenPGP digital signature