Dear members,

User repeatedly sends us spam messages to train SA. 
Traning - at the moment - requires manual intervention: administrator verifies 
if it's really spam then issues sa-learn.

Then the user thinks the process is done, and the next time when the same email 
arrives, it will automatically marked as spam.

However, that doesn't happen.

Before: spamassassin -D -t <spam

Content analysis details:   (0.0 points, 5.0 required)

 pts rule name              description
---- ---------------------- --------------------------------------------------
 0.0 HTML_IMAGE_RATIO_08    BODY: HTML has a low ratio of text to image area
 0.0 HTML_MESSAGE           BODY: HTML included in message

Content analysis details:   (0.8 points, 5.0 required)

Then:
sa-learn --progress --spam --mbox /tmp/spam --dbpath 
/var/spool/amavisd/.spamassassin/ -u amavis

After: 

pts rule name              description
---- ---------------------- --------------------------------------------------
 0.0 HTML_IMAGE_RATIO_08    BODY: HTML has a low ratio of text to image area
 0.0 HTML_MESSAGE           BODY: HTML included in message
 0.8 BAYES_50               BODY: Bayes spam probability is 40 to 60%
                            [score: 0.5000]


It only assigns 0.8. (required_hits around 4.0)


Version: spamassassin-3.3.2-4.el6.rfx.x86_64

$ sa-learn --dump magic --dbpath /var/spool/amavisd/.spamassassin/
0.000          0          3          0  non-token data: bayes db version
0.000          0     338770          0  non-token data: nspam
0.000          0    1460807          0  non-token data: nham
0.000          0     187804          0  non-token data: ntokens
0.000          0 1512318030          0  non-token data: oldest atime
0.000          0 1518524875          0  non-token data: newest atime
0.000          0 1518524876          0  non-token data: last journal sync atime
0.000          0 1518508126          0  non-token data: last expiry atime
0.000          0      43238          0  non-token data: last expire atime delta
0.000          0     136970          0  non-token data: last expire reduction 
count

I obviously see that nspam is increased after the sa-learn.

When I tried to understand what was happening, I found the following:
# https://wiki.apache.org/spamassassin/BayesInSpamAssassin
The Bayesian classifier in Spamassassin tries to identify spam by looking at 
what are called tokens; words or short character sequences that are commonly 
found in spam or ham. If I've handed 100 messages to sa-learn that have the 
phrase penis enlargement and told it that those are all spam, when the 101st 
message comes in with the words penis and enlargment, the Bayesian classifier 
will be pretty sure that the new message is spam and will increase the spam 
score of that message.


My questions are: 
1) is there any chance to change spamassassin settings to mark similar messages 
as SPAM in the future?
bayes_50 with 0.8 points are really-really low. 

I know that I'm able to write custom rules based on e-mail body content but I 
flattered myself that sa-learn would do that by manipulating the bayes database.

2) or tell users that learning process doesn't necessarily mean that future 
messages will be flagged SPAM. 
Rather than it should be considered as a "warning sign".

I appreciate any feedback on this. 

Already try to find docs that answers those questions, but no luck so far. 
If you have a good documentation, just send me. I love reading manuals.

Best regards,
  Szabolcs Horvath

Reply via email to