btw guys, note that hit-frequencies can also produce rule-overlap reports using the "-o" switch....
--j. On Tue, May 26, 2009 at 00:57, Mandy <messaging.director...@gmail.com> wrote: > On Fri, May 22, 2009 at 9:06 PM, Henrik K <h...@hege.li> wrote: >> On Fri, May 22, 2009 at 09:28:55PM +0200, Karsten Bräckelmann wrote: >>> > The EmailBL test zone period has been extended to July 1st. > > [snip] > >> Thanks. And this is just a small scale test. If we used more domains, feeds, >> and submissions, it could be even nicer. ;-) Keep the reports coming in. It >> would be nice to also know how much of spam are generally from freemails, so >> FREEMAIL_FROM/BODY/REPLYTO figures would be nice also when reporting. It >> might differ from user to user. > > I just spent some time putting together some stats. I'm going to try > to follow the excellent lead of Karsten, and provide some overlap > figures based on the cool grep formula that Dan Mcdonald showed. The > short version is that it hits about 12% of spam scoring under 15. > > The time period is somewhat short: May 22 to May 25. It's a little > inaccurate too, due to 12 hours of extra mail in the May 22 side > because I implemented at noon, but... > > As I mentioned before, this is from a mid-sized install of Canadian > government & education users (somewhere around 100 000 mailboxes). SA > only sees a filtered mail-stream in my setup -- to give an idea how > filtered, 75% of the mail that SA sees is classified as ham. The > totals volumes were 192 530 Spam, 564 483 Ham. > > > 24.5% of the spam that's tagged is between 5 & 10 score. > 2.76% of that mail hit EMAILBL_TEST_LEM. > 0.95% hit FREEMAIL_REPLYTO > > 22.9% of the spam that's tagged is between 10 and 15. > 8.97% of that mail hit EMAILBL_TEST_LEM. > 1.20% hit FREEMAIL_REPLYTO > > 52.5% of the spam that's tagged is above 15. > 21.41% of that mail hit EMAILBL_TEST_LEM. > 2.36% hit FREEMAIL_REPLYTO > > I also saw 0.05% hits of EMAILBL_TEST_LEM on mail classified as ham. > I hand-verified the 35 messages of 299 that weren't obvious spam. > About 9 of those were FPs (and those came down to 3 distinct messages > from lists I sure wouldn't choose to be on). I can provide them > off-list if desired. > > I saw even fewer FREEMAIL_REPLYTO hits on mail classified as ham. 56, > or 0.01%. About 22 of those (based on subject line -- sorry it's the > end of the day) look legit. > > Here are the overlap numbers for mail with score less than 10: > $ grep EMAILBL_TEST_LEM spamd_since_22nd | perl -ne 'if (/spamd: > result: Y (\d+)/) { print if $1 <= 10 }' | cut -d' ' -f11 | egrep -o > '[A-Z0-9_:\.]+?,' | sort | uniq -c | sort -rn | head -n15 > 1304 EMAILBL_TEST_LEM, > 728 RAZOR2_CHECK, > 643 RAZOR2_CF_RANGE_51_100, > 629 RAZOR2_CF_RANGE_E4_51_100, > 612 BAYES_50, > 590 FORGED_YAHOO_RCVD, > 582 BAYES_99, > 282 HTML_MESSAGE, > 199 FREEMAIL_FROM, > 157 ADVANCE_FEE_2, > 132 FORGED_MUA_OUTLOOK, > 114 FREEMAIL_REPLYTO, > 103 RCVD_IN_BRBL, > 72 SPF_PASS, > > And here they are for all hits on EMAILBL_TEST_LEM: > $ grep EMAILBL_TEST_LEM spamd_since_22nd | cut -d' ' -f11 | egrep -o > '[A-Z0-9_:\.]+?,' | sort | uniq -c | sort -rn | head -n15 > 41503 EMAILBL_TEST_LEM, > 38987 BAYES_99, > 36782 FORGED_MUA_OUTLOOK, > 36028 ADVANCE_FEE_2, > 33746 RCVD_IN_BRBL, > 33506 JM_SOUGHT_FRAUD_3, > 33214 JM_SOUGHT_FRAUD_2, > 33186 HTML_MESSAGE, > 32281 RCVD_IN_BL_SPAMCOP_NET, > 31953 JM_SOUGHT_FRAUD_1, > 31914 RDNS_NONE, > 31893 RCVD_IN_SBL, > 31883 MIME_HTML_ONLY, > > Phew. Hopefully those numbers are useful. > >