On Fri, May 22, 2009 at 9:06 PM, Henrik K <h...@hege.li> wrote:
> On Fri, May 22, 2009 at 09:28:55PM +0200, Karsten Bräckelmann wrote:
>> > The EmailBL test zone period has been extended to July 1st.

[snip]

> Thanks. And this is just a small scale test. If we used more domains, feeds,
> and submissions, it could be even nicer. ;-) Keep the reports coming in. It
> would be nice to also know how much of spam are generally from freemails, so
> FREEMAIL_FROM/BODY/REPLYTO figures would be nice also when reporting. It
> might differ from user to user.

I just spent some time putting together some stats.  I'm going to try
to follow the excellent lead of Karsten, and provide some overlap
figures based on the cool grep formula that Dan Mcdonald showed.  The
short version is that it hits about 12% of spam scoring under 15.

The time period is somewhat short: May 22 to May 25.  It's a little
inaccurate too, due to 12 hours of extra mail in the May 22 side
because I implemented at noon, but...

As I mentioned before, this is from a mid-sized install of Canadian
government & education users (somewhere around 100 000 mailboxes).  SA
only sees a filtered mail-stream in my setup -- to give an idea how
filtered, 75% of the mail that SA sees is classified as ham.  The
totals volumes were 192 530 Spam, 564 483 Ham.


24.5% of the spam that's tagged is between 5 & 10 score.
2.76% of that mail hit EMAILBL_TEST_LEM.
0.95% hit FREEMAIL_REPLYTO

22.9% of the spam that's tagged is between 10 and 15.
8.97% of that mail hit EMAILBL_TEST_LEM.
1.20% hit FREEMAIL_REPLYTO

52.5% of the spam that's tagged is above 15.
21.41% of that mail hit EMAILBL_TEST_LEM.
2.36% hit FREEMAIL_REPLYTO

I also saw 0.05% hits of EMAILBL_TEST_LEM on mail classified as ham.
I hand-verified the 35 messages of 299 that weren't obvious spam.
About 9 of those were FPs (and those came down to 3 distinct messages
from lists I sure wouldn't choose to be on).  I can provide them
off-list if desired.

I saw even fewer FREEMAIL_REPLYTO hits on mail classified as ham.  56,
or 0.01%.  About 22 of those (based on subject line -- sorry it's the
end of the day) look legit.

Here are the overlap numbers for mail with score less than 10:
$ grep EMAILBL_TEST_LEM spamd_since_22nd | perl -ne 'if (/spamd:
result: Y (\d+)/) { print if $1 <= 10 }' | cut -d' ' -f11 | egrep -o
'[A-Z0-9_:\.]+?,' | sort | uniq -c | sort -rn | head -n15
   1304 EMAILBL_TEST_LEM,
    728 RAZOR2_CHECK,
    643 RAZOR2_CF_RANGE_51_100,
    629 RAZOR2_CF_RANGE_E4_51_100,
    612 BAYES_50,
    590 FORGED_YAHOO_RCVD,
    582 BAYES_99,
    282 HTML_MESSAGE,
    199 FREEMAIL_FROM,
    157 ADVANCE_FEE_2,
    132 FORGED_MUA_OUTLOOK,
    114 FREEMAIL_REPLYTO,
    103 RCVD_IN_BRBL,
     72 SPF_PASS,

And here they are for all hits on EMAILBL_TEST_LEM:
$ grep EMAILBL_TEST_LEM spamd_since_22nd | cut -d' ' -f11 | egrep -o
'[A-Z0-9_:\.]+?,' | sort | uniq -c | sort -rn | head -n15
  41503 EMAILBL_TEST_LEM,
  38987 BAYES_99,
  36782 FORGED_MUA_OUTLOOK,
  36028 ADVANCE_FEE_2,
  33746 RCVD_IN_BRBL,
  33506 JM_SOUGHT_FRAUD_3,
  33214 JM_SOUGHT_FRAUD_2,
  33186 HTML_MESSAGE,
  32281 RCVD_IN_BL_SPAMCOP_NET,
  31953 JM_SOUGHT_FRAUD_1,
  31914 RDNS_NONE,
  31893 RCVD_IN_SBL,
  31883 MIME_HTML_ONLY,

Phew.  Hopefully those numbers are useful.

Reply via email to