Michael Bell <[EMAIL PROTECTED]> writes: > Kinda hard to say. Most of it IS spammy and valid MIME as far as I > could tell. I did catch a few clearly-non-spam (evite) things in the > corpus.
You caught or SA caught? ;-) > The lack of Received lines does mess up quite a few DNS related tests > (RBL, MX records) but I wouldn't think that alone made a 23% > difference (83% on Justin's sample) in success. Remember - I ran SA > 2.43 in both cases with -L so most of the stuff relying on that isn't > relevant. > > but you can check it out yourself at > ftp://ftp.spamarchive.org/archives. If the messages are not pristine (or very close to pristine), then any accuracy comparison is almost completely meaningless. There are two primary reasons why that's the case: 1. Inability to do any DNS tests. As you noted, testing with local tests only would remove that factor, so let's move on to the big problem. 2. There are many local Received: header tests and the GA is tuned to run with them working. Without the local Received: tests working, the GA is completely mistuned. All of these tests use Received: headers: FAKED_IP_IN_RCVD FORGED_EUDORAMAIL_RCVD FORGED_GW05_RCVD FORGED_HOTMAIL_RCVD FORGED_JUNO_RCVD FORGED_MX_HOTMAIL FORGED_RCVD_TRAIL FORGED_TELESP_RCVD FORGED_YAHOO_RCVD GENUINE_EBAY_RCVD MDAEMON_2_7_4 POST_IN_RCVD RATWARE_EMWAC RCVD_BY_QVES_COM RCVD_FAKE_HELO_DOTCOM RECEIVED_IDENT_SQUID ROUND_THE_WORLD ROUND_THE_WORLD_LOCAL SHORT_RECEIVED_LINE SMTPD_IN_RCVD T_IDENT_CACHEFLOW T_IDENT_NOBODY VAR_REF_IN_RECEIVED YAHOO_MSGID_ADDED __EVITE_RCVD __RCVD_BY_HOTMAIL And that's not counting the many eval: tests that use Received: internally: date difference tests, MTA tests, forged Received: header tests, HELO tests, the round the world test, message-id timestamp tests, etc. Think of it this way: How much do you think the average score of spam drops without those tests? It's not an insignificant amount. You also need to bear in mind that a significant percentage of our development effort is aimed at using Received: headers. Removing them does not exactly level the playing field. If you want to see how badly removing all of the Received: headers affects SA, just remove them from Justin's corpus and then see how well SA does. I bet the results will then be comparable to this other "corpus". This of course, goes back to the near necessity of everyone developing their own ham and spam corpus. Well, at least that's what I'm trying to drum into everyone's head. - Any munging whatsoever leads to less-representative messages. For example, a number of Justin's ham and spam messages have been munged to use "example.com". Unfortunately, this triggers NO_MX_FOR_FROM because there aren't MX records for the example.com domain! - Published corpuses are much more likely to end up in DNS blacklist databases, distributed message checksum databases (such as Razor), etc. - You need to test on real email received by real people, not spamtraps and public mailing lists. Non-real email used in development leads to less-optimal GA solutions for real email. Dan -- Daniel Quinlan Linux, open source, and http://www.pathname.com/~quinlan/ anti-spam consulting ------------------------------------------------------- This SF.net email is sponsored by: Get the new Palm Tungsten T handheld. Power & Color in a compact size! http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0002en _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk