Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam

David Jones Tue, 13 Feb 2018 08:37:20 -0800

On 02/13/2018 07:55 AM, Horváth Szabolcs wrote:

Dear members,


User repeatedly sends us spam messages to train SA.
Traning - at the moment - requires manual intervention: administrator verifies 
if it's really spam then issues sa-learn.

Then the user thinks the process is done, and the next time when the same email 
arrives, it will automatically marked as spam.

However, that doesn't happen.

Before: spamassassin -D -t <spam

Content analysis details:   (0.0 points, 5.0 required)

  pts rule name              description
---- ---------------------- --------------------------------------------------
  0.0 HTML_IMAGE_RATIO_08    BODY: HTML has a low ratio of text to image area
  0.0 HTML_MESSAGE           BODY: HTML included in message

Content analysis details:   (0.8 points, 5.0 required)

Then:
sa-learn --progress --spam --mbox /tmp/spam --dbpath 
/var/spool/amavisd/.spamassassin/ -u amavis

After:

pts rule name              description
---- ---------------------- --------------------------------------------------
  0.0 HTML_IMAGE_RATIO_08    BODY: HTML has a low ratio of text to image area
  0.0 HTML_MESSAGE           BODY: HTML included in message
  0.8 BAYES_50               BODY: Bayes spam probability is 40 to 60%
                             [score: 0.5000]

There should be many more rule hits than just these 3. It looks likenetwork tests aren't happening.

Can you post the original email to pastebin.com with minimal redactingso the rest of us can run it through our SA to see how it scores to helpwith suggestions?

I suspect there needs to be some MTA tuning in front of SA along withsome SA tuning that is mentioned on this list every couple of months --add extra RBLs, add KAM.cf, enable some SA plugins, etc.


It only assigns 0.8. (required_hits around 4.0)

You are certainly free to set a local score higher if you want but thatis probably not the main resolution to this issue.


Version: spamassassin-3.3.2-4.el6.rfx.x86_64


This is very old and no longer supported.  Why not upgrade to 3.4.x?

$ sa-learn --dump magic --dbpath /var/spool/amavisd/.spamassassin/
0.000          0          3          0  non-token data: bayes db version
0.000          0     338770          0  non-token data: nspam
0.000          0    1460807          0  non-token data: nham
0.000          0     187804          0  non-token data: ntokens
0.000          0 1512318030          0  non-token data: oldest atime
0.000          0 1518524875          0  non-token data: newest atime
0.000          0 1518524876          0  non-token data: last journal sync atime
0.000          0 1518508126          0  non-token data: last expiry atime
0.000          0      43238          0  non-token data: last expire atime delta
0.000          0     136970          0  non-token data: last expire reduction 
count

I obviously see that nspam is increased after the sa-learn.

When I tried to understand what was happening, I found the following:
# https://wiki.apache.org/spamassassin/BayesInSpamAssassin
The Bayesian classifier in Spamassassin tries to identify spam by looking at 
what are called tokens; words or short character sequences that are commonly 
found in spam or ham. If I've handed 100 messages to sa-learn that have the 
phrase penis enlargement and told it that those are all spam, when the 101st 
message comes in with the words penis and enlargment, the Bayesian classifier 
will be pretty sure that the new message is spam and will increase the spam 
score of that message.


My questions are:
1) is there any chance to change spamassassin settings to mark similar messages 
as SPAM in the future?
bayes_50 with 0.8 points are really-really low.

You should be hitting BAYES_95, BAYES_99, and BAYES_999 on these reallybad emails with proper training which would give it a higher probabilityand thus a higher score.

I know that I'm able to write custom rules based on e-mail body content but I 
flattered myself that sa-learn would do that by manipulating the bayes database.

I suspect that after the MTA and SA are tuned, this would be blockedwithout requiring a local custom rule but I would need to see the rulehits on my SA platform before I could say for sure. Sometimes it doesrequire a header or body rule combine with other hits in a local custommeta rule to block them.

2) or tell users that learning process doesn't necessarily mean that future 
messages will be flagged SPAM.
Rather than it should be considered as a "warning sign".

I appreciate any feedback on this.

Already try to find docs that answers those questions, but no luck so far.
If you have a good documentation, just send me. I love reading manuals.

Best regards,
   Szabolcs Horvath


--
David Jones

Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam

Reply via email to