Dear members, User repeatedly sends us spam messages to train SA. Traning - at the moment - requires manual intervention: administrator verifies if it's really spam then issues sa-learn.
Then the user thinks the process is done, and the next time when the same email arrives, it will automatically marked as spam. However, that doesn't happen. Before: spamassassin -D -t <spam Content analysis details: (0.0 points, 5.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- 0.0 HTML_IMAGE_RATIO_08 BODY: HTML has a low ratio of text to image area 0.0 HTML_MESSAGE BODY: HTML included in message Content analysis details: (0.8 points, 5.0 required) Then: sa-learn --progress --spam --mbox /tmp/spam --dbpath /var/spool/amavisd/.spamassassin/ -u amavis After: pts rule name description ---- ---------------------- -------------------------------------------------- 0.0 HTML_IMAGE_RATIO_08 BODY: HTML has a low ratio of text to image area 0.0 HTML_MESSAGE BODY: HTML included in message 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% [score: 0.5000] It only assigns 0.8. (required_hits around 4.0) Version: spamassassin-3.3.2-4.el6.rfx.x86_64 $ sa-learn --dump magic --dbpath /var/spool/amavisd/.spamassassin/ 0.000 0 3 0 non-token data: bayes db version 0.000 0 338770 0 non-token data: nspam 0.000 0 1460807 0 non-token data: nham 0.000 0 187804 0 non-token data: ntokens 0.000 0 1512318030 0 non-token data: oldest atime 0.000 0 1518524875 0 non-token data: newest atime 0.000 0 1518524876 0 non-token data: last journal sync atime 0.000 0 1518508126 0 non-token data: last expiry atime 0.000 0 43238 0 non-token data: last expire atime delta 0.000 0 136970 0 non-token data: last expire reduction count I obviously see that nspam is increased after the sa-learn. When I tried to understand what was happening, I found the following: # https://wiki.apache.org/spamassassin/BayesInSpamAssassin The Bayesian classifier in Spamassassin tries to identify spam by looking at what are called tokens; words or short character sequences that are commonly found in spam or ham. If I've handed 100 messages to sa-learn that have the phrase penis enlargement and told it that those are all spam, when the 101st message comes in with the words penis and enlargment, the Bayesian classifier will be pretty sure that the new message is spam and will increase the spam score of that message. My questions are: 1) is there any chance to change spamassassin settings to mark similar messages as SPAM in the future? bayes_50 with 0.8 points are really-really low. I know that I'm able to write custom rules based on e-mail body content but I flattered myself that sa-learn would do that by manipulating the bayes database. 2) or tell users that learning process doesn't necessarily mean that future messages will be flagged SPAM. Rather than it should be considered as a "warning sign". I appreciate any feedback on this. Already try to find docs that answers those questions, but no luck so far. If you have a good documentation, just send me. I love reading manuals. Best regards, Szabolcs Horvath