Michael Bell said: > Kinda hard to say. Most of it IS spammy and valid MIME as far as I > could tell. I did catch a few clearly-non-spam (evite) things in the > corpus. > > The lack of Received lines does mess up quite a few DNS related tests > (RBL, MX records) but I wouldn't think that alone made a 23% > difference (83% on Justin's sample) in success. Remember - I ran SA > 2.43 in both cases with -L so most of the stuff relying on that isn't > relevant.
I haven't looked yet, but (a) if they're not well-cleaned (ie if there is valid nonspam in there), it's going to seriously impact the archive's usefulness. (b) on the other issue: a lot of SpamAssassin's top tests use header info, even in the -L case, so with the headers removed, a 20% accuracy drop would be about right. --j. ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk