Re: [SAtalk] Fun with corpus pollution

Justin Mason Tue, 24 Sep 2002 04:38:47 -0700


James R. Van Zandt said:> Matt Kettler <[EMAIL PROTECTED]> writes:
> 
> > Ahh, you are deceived by truncation. They actually can match one
> > nonspam and still be 0.000% because the nonspam corpus is > 100k
> > messages :)
> 
> I think that's a bug.  The output precision should be increased.


Probably a good idea for marginal cases.

> > If a given rule has 1 misplaced nonspam, it will outweigh 99 correctly
> > placed spam mails matching THAT RULE. Note that's not 1% of the
> > corpus, that's 1% of the overall for the rule.
> 
> I suggest cross-validation: Run the GA using half the corpus.  Using
> that set of scores, check the other half of the corpus.  Examine the
> FPs and FNs for misplacement.  Repeat starting with the other half of
> the corpus.  

We do. ;)

--j.


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Re: [SAtalk] Fun with corpus pollution

Reply via email to