On Tue, Feb 2, 2010 at 18:21, Warren Togami <wtog...@redhat.com> wrote: > On 02/02/2010 12:07 PM, Adam Katz wrote: >> >> That is quite different from our masscheck stats. Today's results at >> http://ruleqa.spamassassin.org/20100201/%2FJM_SOUGHT look like this: >> >> SPAM% HAM% S/O RANK SCORE NAME >> 9.8564 0.0042 1.000 0.94 0.01 T_JM_SOUGHT_3 >> 8.1587 0.0068 0.999 0.93 0.01 T_JM_SOUGHT_2 >> 11.6464 0.0289 0.998 0.89 0.01 T_JM_SOUGHT_1 >> 0 0 0.500 0.48 0.00 JM_SOUGHT_FRAUD_1 >> 0 0 0.500 0.48 0.00 JM_SOUGHT_FRAUD_2 >> 0 0 0.500 0.48 0.00 JM_SOUGHT_FRAUD_3 >> > > FWIW the nightly masscheck is often very unbalanced especially on the spam > side. Sometimes we have only 50k spam, sometimes over 500k spam. Some spam > corpora contain a disproportionate amount of high scoring spam trap mail. I > personally randomly filter out a large percentage of high scoring mail in an > attempt to balance my spam corpus. But ultimately we need more masscheck > participants to have better results.
The corpus-quality for that masscheck doesn't look too bad though: http://ruleqa.spamassassin.org/20100201-r905213-n/T_JM_SOUGHT_1/detail?s_corpus=1#corpus -- --j.