On Thu, Sep 17, 2009 at 02:34:24PM +0200, Mark Martinec wrote: > Austin, > > > > now hope to do this Thursday/Friday. I should be able to scan my > > > million or so messages in a day on my cluster. > > > > Wow, that makes me feel inadequate :) I'm struggling to clean up my > > little ham sample of 3600 messages, and looking at another couple > > thousand that I'll do if I've got time... > > Thanks, that will be nice to have. As the rulesqa site can distinguish > results based on a corpus submitter, even a small but carefully checked > collection is worth having. > > I found it valuable to double check ham samples which fire rules > URIBL_JP_SURBL, URIBL_WS_SURBL, URIBL_OB_SURBL, > RCVD_IN_PBL, RCVD_IN_XBL, RCVD_IN_PSBL, RCVD_IN_SSBL
There's lots that one can do.. - analyze corpuses through dspam_train, spots misfiles quite nicely (might also use crm114, haven't tried) - clamscan hams with sanesecurity etc - grep ham/spam.log for rules with S/O >= ~0.98 (most likely includes all that Marc said and more) - grep Subjects from spams and grep all those from ham (and vice versa) - fuzzily hash duplicate mails away, so miscategoried mails have smaller effect on the totals (or does it make good rules seem worse? heh..), you can also spot similar mails that are in both ham+spam for double checking Sadly I don't have a cleanly defined process yet, it's all scripts and memorized one-liners. Finding FPs from spam-corpus is more important but harder..