Re: [SAtalk] spamarchive.org corpuses have quite low success rates with SA

Daniel Quinlan Sat, 30 Nov 2002 22:48:07 -0800

Michael Bell <[EMAIL PROTECTED]> writes:

> Kinda hard to say. Most of it IS spammy and valid MIME as far as I
> could tell. I did catch a few clearly-non-spam (evite) things in the
> corpus.


You caught or SA caught?  ;-)
 
> The lack of Received lines does mess up quite a few DNS related tests
> (RBL, MX records) but I wouldn't think that alone made a 23%
> difference (83% on Justin's sample) in success. Remember - I ran SA
> 2.43 in both cases with -L so most of the stuff relying on that isn't
> relevant.
> 
> but you can check it out yourself at
> ftp://ftp.spamarchive.org/archives.

If the messages are not pristine (or very close to pristine), then any
accuracy comparison is almost completely meaningless.  There are two
primary reasons why that's the case:

1. Inability to do any DNS tests.

   As you noted, testing with local tests only would remove that factor,
   so let's move on to the big problem.

2. There are many local Received: header tests and the GA is tuned to
   run with them working.  Without the local Received: tests working,
   the GA is completely mistuned.

   All of these tests use Received: headers:

     FAKED_IP_IN_RCVD FORGED_EUDORAMAIL_RCVD FORGED_GW05_RCVD
     FORGED_HOTMAIL_RCVD FORGED_JUNO_RCVD FORGED_MX_HOTMAIL
     FORGED_RCVD_TRAIL FORGED_TELESP_RCVD FORGED_YAHOO_RCVD
     GENUINE_EBAY_RCVD MDAEMON_2_7_4 POST_IN_RCVD RATWARE_EMWAC
     RCVD_BY_QVES_COM RCVD_FAKE_HELO_DOTCOM RECEIVED_IDENT_SQUID
     ROUND_THE_WORLD ROUND_THE_WORLD_LOCAL SHORT_RECEIVED_LINE
     SMTPD_IN_RCVD T_IDENT_CACHEFLOW T_IDENT_NOBODY VAR_REF_IN_RECEIVED
     YAHOO_MSGID_ADDED __EVITE_RCVD __RCVD_BY_HOTMAIL

  And that's not counting the many eval: tests that use Received:
  internally: date difference tests, MTA tests, forged Received: header
  tests, HELO tests, the round the world test, message-id timestamp
  tests, etc.

  Think of it this way: How much do you think the average score of spam
  drops without those tests?  It's not an insignificant amount.

  You also need to bear in mind that a significant percentage of our
  development effort is aimed at using Received: headers.  Removing them
  does not exactly level the playing field.

If you want to see how badly removing all of the Received: headers
affects SA, just remove them from Justin's corpus and then see how well
SA does.  I bet the results will then be comparable to this other
"corpus".

This of course, goes back to the near necessity of everyone developing
their own ham and spam corpus.  Well, at least that's what I'm trying to
drum into everyone's head.

  - Any munging whatsoever leads to less-representative messages.

    For example, a number of Justin's ham and spam messages have been
    munged to use "example.com".  Unfortunately, this triggers
    NO_MX_FOR_FROM because there aren't MX records for the example.com
    domain!

  - Published corpuses are much more likely to end up in DNS blacklist
    databases, distributed message checksum databases (such as Razor),
    etc.

  - You need to test on real email received by real people, not
    spamtraps and public mailing lists.  Non-real email used in
    development leads to less-optimal GA solutions for real email.

Dan

-- 
Daniel Quinlan                      Linux, open source, and
http://www.pathname.com/~quinlan/    anti-spam consulting


-------------------------------------------------------
This SF.net email is sponsored by: Get the new Palm Tungsten T 
handheld. Power & Color in a compact size! 
http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0002en
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Re: [SAtalk] spamarchive.org corpuses have quite low success rates with SA

Reply via email to