On Wed, Nov 24, 2004 at 01:19:49AM -0500, Matt Kettler wrote: > Quite frankly, I suspect corpus pollution. It really only takes 1 high > scoring spam in the nonspam corpus to really screw up the message scores.
That's quite possible. I don't think anyone has 100% non-polluted corpus, though try we might. :( > 1) DRUGS_PAIN_OBFU actually hit some nonspam? I find that odd, but it could > be a typo. Looking at the submitted results: dave.log:. /home/dave/corpus/cooked-ham.43366468 jm.log:. /home/jm/Mail/deld.priv/34675 jm.log:. /home/jm/Mail/deld.priv/34682 jm.log:. /home/jm/Mail/deld.priv/34699 jm.log:. /home/jm/Mail/deld.priv/34703 quinlan.log:. /home/corpus/mail/ham/166370 quinlan.log:. /home/corpus/mail/ham/166400 quinlan.log:. /home/corpus/mail/ham/166430 quinlan.log:. /home/corpus/mail/ham/166437 > 2) DRUGS_SMEAR1 hit some nonspam? I find that damn near impossible. I don't > think any nonspam email other than one quoting spam will ever hit that > rule. It seems there's one drug spam, or drug spam quote in somebody's > corpus, and it was run in all 4 sets. (If anyone can show me the nonspam > matching that rule and it's not spam or a spam quote or discussion of SA's > rules, I'll send em $20. Really.) jm.log:. /home/jm/Mail/deld.priv/26352 > 4) NIGERIAN_BODY3? could be a finance newsletter, but very unlikely. That was mine: theo.log:Y ham/misc200405-200407.33861588 Unfortunately I took those misc ham mboxes and converted them to dir format a while ago, so I don't know what message that was. > 6) PERCENT_RANDOM? Very unlikely. What would have %rnd_x in it? jm.log:. /home/jm/Mail/deld.pub/12701 -- Randomly Generated Tagline: Choosy modemers choose GIF.
pgpN52skHkgJL.pgp
Description: PGP signature