On Tue, 8 Mar 2016, Matus UHLAR - fantomas wrote:

On Mar 8, 2016, at 7:31 AM, Matus UHLAR - fantomas <uh...@fantomas.sk> wrote:
how can these two stats be different?

On 08.03.16 10:19, @lbutlr wrote:
Because one is for SPAM and one is for HAM.

On Mar 8, 2016, at 10:41 AM, Matus UHLAR - fantomas <uh...@fantomas.sk> wrote:
Why did you remove the important part?

On 08.03.16 11:16, @lbutlr wrote:
I didn’t.

yes, you did, so I've had to paste them again below:

TOP SPAM RULES FIRED

RANK RULE NAME COUNT %OFRULES %OFMAIL %OFSPAM %OFHAM

2 HTML_MESSAGE 12714 8.18 38.98 87.85 90.80

TOP HAM RULES FIRED

RANK RULE NAME COUNT %OFRULES %OFMAIL %OFSPAM %OFHAM

1 HTML_MESSAGE 16473 9.13 50.51 87.85 90.80


Why did the same rule hit 38.98% of all mail and 50.51% of all mail?

Because on is checking SPAM and on is checking HAM.

so why was %OFMAIL different from %OFSPAM in the first case and from %OFHAM
in the second case?

seems that the mail counts were different, but why?

Because there are differing amounts of SPAM and HAM?

if we are only checking spam mail for a given rule, how can be number of
all hits different than number of spam hits? they all should be spam,
shouldn't they?

Assuming that the OP was using Dallas Engelken's "sa-stats.pl" script
(I was) then the report line for each rule (excepting the first column)
should be IDENTICAL.

This script takes as input a spamd's log output. It then aggregates a digest
of all the rule hits. In a given log report there will be lines that are
spam results ("spamd: result: Y 75") and lines that are ham results ("spamd: result: 
. -3").
For each line (spam & ham) there will be a list of the rules that fired on that particular message:

2016-03-08T12:37:44.833847-06:00 s-l107 spamd[10463]: spamd: result: . -3 - BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,HTML_MESSAGE,KHOP_RCVD_TRUST,L_LOCAL_MUCHO_DOT_LINES2,RCVD_IN_DNSWL_LOW,RCVD_IN_HOSTKARMA_YE,RP_MATCHES_RCVD,SPF_PASS,T__RECEIVED_1 scantime=3.5,size=11059,user=redacted,uid=115,required_score=6.0,rhost=s-l012.engr.uiowa.edu,raddr=128.255.17.253,rport=35620,mid=<redacted-000...@email.amazonses.com>,bayes=0.000000,autolearn=ham autolearn_force=no

So for the HTML_MESSAGE rule, I get stats of:
grep HTML_MESSAGE sa-stats-dec.out
   4    HTML_MESSAGE                    90850    79.41   68.63   86.59  0.3456
   2    HTML_MESSAGE                    171992   79.41   68.63   86.59  0.3456

This means that of all the messages processed (for the duration of that log run) that rule hit %79.41 of all messages processed, %68.63 of the lines classifed as spam (a count of 90850 and resulting in a rank of 4) and %86.59 of the lines classifed as ham (a count of 171992 resulting in a rank of 2).

Thus for a given rule, the %all-messages, %spam %ham should be IDENTICAL.
(assuming they are from the same log run).

So for the OP's original post, having %spam %ham be identical but %all-messages being different is weird. Now it could be that he's got a different version of
the sa-stats script, it has an addtional field, that "%of-rules" thing.

So to Charles Sprickman, which sa-stats script did you use to generate your rules report?


--
Dave Funk                                  University of Iowa
<dbfunk (at) engineering.uiowa.edu>        College of Engineering
319/335-5751   FAX: 319/384-0549           1256 Seamans Center
Sys_admin/Postmaster/cell_admin            Iowa City, IA 52242-1527
#include <std_disclaimer.h>
Better is not better, 'standard' is better. B{

Reply via email to