Re: NOTICE: SpamAssassin 3.3.0 mass-checks now starting

Henrik K Thu, 17 Sep 2009 06:14:05 -0700

On Thu, Sep 17, 2009 at 02:34:24PM +0200, Mark Martinec wrote:
> Austin,
> 
> > > now hope to do this Thursday/Friday.  I should be able to scan my
> > > million or so messages in a day on my cluster.
> > 
> > Wow, that makes me feel inadequate :)  I'm struggling to clean up my
> > little ham sample of 3600 messages, and looking at another couple
> > thousand that I'll do if I've got time...
> 
> Thanks, that will be nice to have. As the rulesqa site can distinguish
> results based on a corpus submitter, even a small but carefully checked
> collection is worth having.
> 
> I found it valuable to double check ham samples which fire rules
> URIBL_JP_SURBL, URIBL_WS_SURBL, URIBL_OB_SURBL,
> RCVD_IN_PBL, RCVD_IN_XBL, RCVD_IN_PSBL, RCVD_IN_SSBL


There's lots that one can do..

- analyze corpuses through dspam_train, spots misfiles quite nicely (might
  also use crm114, haven't tried)

- clamscan hams with sanesecurity etc

- grep ham/spam.log for rules with S/O >= ~0.98 (most likely includes all
  that Marc said and more)

- grep Subjects from spams and grep all those from ham (and vice versa)

- fuzzily hash duplicate mails away, so miscategoried mails have smaller
  effect on the totals (or does it make good rules seem worse? heh..), you
  can also spot similar mails that are in both ham+spam for double checking

Sadly I don't have a cleanly defined process yet, it's all scripts and
memorized one-liners. Finding FPs from spam-corpus is more important but
harder..

Re: NOTICE: SpamAssassin 3.3.0 mass-checks now starting

Reply via email to