On 02/13/2018 07:55 AM, Horváth Szabolcs wrote:
Dear members,
User repeatedly sends us spam messages to train SA.
Traning - at the moment - requires manual intervention: administrator verifies
if it's really spam then issues sa-learn.
Then the user thinks the process is done, and the next time when the same email
arrives, it will automatically marked as spam.
However, that doesn't happen.
Before: spamassassin -D -t <spam
Content analysis details: (0.0 points, 5.0 required)
pts rule name description
---- ---------------------- --------------------------------------------------
0.0 HTML_IMAGE_RATIO_08 BODY: HTML has a low ratio of text to image area
0.0 HTML_MESSAGE BODY: HTML included in message
Content analysis details: (0.8 points, 5.0 required)
Then:
sa-learn --progress --spam --mbox /tmp/spam --dbpath
/var/spool/amavisd/.spamassassin/ -u amavis
After:
pts rule name description
---- ---------------------- --------------------------------------------------
0.0 HTML_IMAGE_RATIO_08 BODY: HTML has a low ratio of text to image area
0.0 HTML_MESSAGE BODY: HTML included in message
0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60%
[score: 0.5000]
There should be many more rule hits than just these 3. It looks like
network tests aren't happening.
Can you post the original email to pastebin.com with minimal redacting
so the rest of us can run it through our SA to see how it scores to help
with suggestions?
I suspect there needs to be some MTA tuning in front of SA along with
some SA tuning that is mentioned on this list every couple of months --
add extra RBLs, add KAM.cf, enable some SA plugins, etc.
It only assigns 0.8. (required_hits around 4.0)
You are certainly free to set a local score higher if you want but that
is probably not the main resolution to this issue.
Version: spamassassin-3.3.2-4.el6.rfx.x86_64
This is very old and no longer supported. Why not upgrade to 3.4.x?
$ sa-learn --dump magic --dbpath /var/spool/amavisd/.spamassassin/
0.000 0 3 0 non-token data: bayes db version
0.000 0 338770 0 non-token data: nspam
0.000 0 1460807 0 non-token data: nham
0.000 0 187804 0 non-token data: ntokens
0.000 0 1512318030 0 non-token data: oldest atime
0.000 0 1518524875 0 non-token data: newest atime
0.000 0 1518524876 0 non-token data: last journal sync atime
0.000 0 1518508126 0 non-token data: last expiry atime
0.000 0 43238 0 non-token data: last expire atime delta
0.000 0 136970 0 non-token data: last expire reduction
count
I obviously see that nspam is increased after the sa-learn.
When I tried to understand what was happening, I found the following:
# https://wiki.apache.org/spamassassin/BayesInSpamAssassin
The Bayesian classifier in Spamassassin tries to identify spam by looking at
what are called tokens; words or short character sequences that are commonly
found in spam or ham. If I've handed 100 messages to sa-learn that have the
phrase penis enlargement and told it that those are all spam, when the 101st
message comes in with the words penis and enlargment, the Bayesian classifier
will be pretty sure that the new message is spam and will increase the spam
score of that message.
My questions are:
1) is there any chance to change spamassassin settings to mark similar messages
as SPAM in the future?
bayes_50 with 0.8 points are really-really low.
You should be hitting BAYES_95, BAYES_99, and BAYES_999 on these really
bad emails with proper training which would give it a higher probability
and thus a higher score.
I know that I'm able to write custom rules based on e-mail body content but I
flattered myself that sa-learn would do that by manipulating the bayes database.
I suspect that after the MTA and SA are tuned, this would be blocked
without requiring a local custom rule but I would need to see the rule
hits on my SA platform before I could say for sure. Sometimes it does
require a header or body rule combine with other hits in a local custom
meta rule to block them.
2) or tell users that learning process doesn't necessarily mean that future
messages will be flagged SPAM.
Rather than it should be considered as a "warning sign".
I appreciate any feedback on this.
Already try to find docs that answers those questions, but no luck so far.
If you have a good documentation, just send me. I love reading manuals.
Best regards,
Szabolcs Horvath
--
David Jones