On Wed, Nov 24, 2004 at 01:19:49AM -0500, Matt Kettler wrote:
> Quite frankly, I suspect corpus pollution. It really only takes 1 high 
> scoring spam in the nonspam corpus to really screw up the message scores.

That's quite possible.  I don't think anyone has 100% non-polluted corpus,
though try we might. :(

> 1) DRUGS_PAIN_OBFU actually hit some nonspam? I find that odd, but it could 
> be a typo.

Looking at the submitted results:

dave.log:. /home/dave/corpus/cooked-ham.43366468
jm.log:. /home/jm/Mail/deld.priv/34675
jm.log:. /home/jm/Mail/deld.priv/34682
jm.log:. /home/jm/Mail/deld.priv/34699
jm.log:. /home/jm/Mail/deld.priv/34703
quinlan.log:. /home/corpus/mail/ham/166370
quinlan.log:. /home/corpus/mail/ham/166400
quinlan.log:. /home/corpus/mail/ham/166430
quinlan.log:. /home/corpus/mail/ham/166437

> 2) DRUGS_SMEAR1 hit some nonspam? I find that damn near impossible. I don't 
> think any nonspam email other than one quoting spam will ever hit that 
> rule. It seems there's one drug spam, or drug spam quote in somebody's 
> corpus, and it was run in all 4 sets. (If anyone can show me the nonspam 
> matching that rule and it's not spam or a spam quote or discussion of SA's 
> rules, I'll send em $20. Really.)

jm.log:. /home/jm/Mail/deld.priv/26352

> 4) NIGERIAN_BODY3? could be a finance newsletter, but very unlikely.

That was mine:

theo.log:Y ham/misc200405-200407.33861588

Unfortunately I took those misc ham mboxes and converted them to dir
format a while ago, so I don't know what message that was.

> 6) PERCENT_RANDOM? Very unlikely. What would have %rnd_x in it?

jm.log:. /home/jm/Mail/deld.pub/12701

-- 
Randomly Generated Tagline:
Choosy modemers choose GIF.

Attachment: pgpN52skHkgJL.pgp
Description: PGP signature

Reply via email to