Re: [SAtalk] Fun with corpus pollution

James R. Van Zandt Mon, 23 Sep 2002 16:19:17 -0700


Matt Kettler <[EMAIL PROTECTED]> writes:


> Ahh, you are deceived by truncation. They actually can match one
> nonspam and still be 0.000% because the nonspam corpus is > 100k
> messages :)

I think that's a bug.  The output precision should be increased.

> If a given rule has 1 misplaced nonspam, it will outweigh 99 correctly
> placed spam mails matching THAT RULE. Note that's not 1% of the
> corpus, that's 1% of the overall for the rule.

I suggest cross-validation: Run the GA using half the corpus.  Using
that set of scores, check the other half of the corpus.  Examine the
FPs and FNs for misplacement.  Repeat starting with the other half of
the corpus.  

         - Jim Van Zandt


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Re: [SAtalk] Fun with corpus pollution

Reply via email to