[SAtalk] Re: [SAdev] fully-public corpus of mail available

Justin Mason Thu, 10 Oct 2002 05:45:18 -0700


(trimmed cc list)


Daniel Quinlan said:

> 1. These messages could end up being falsely (or incorrectly) reported
>    to Razor, DCC, Pyzor, etc.  Certain RBLs too.  I don't think the
>    results for these distributed tests can be trusted in any way,
>    shape, or form when running over a public corpus.

I'll note that in the README.

> 2. These messages could also be submitted (more than once) to projects
>    like SpamAssassin that rely on filtering results submission for GA
>    tuning and development.
> The second problem could be alleviated somewhat by adding a Nilsimsa
> signature (or similar) to the mass-check file (the results format used
> by SpamAssassin) and giving the message files unique names (MD5 or
> SHA-1 of each file).

OK; maybe rewriting the message-ids will help here, that should allow
us to pick them out.  I'll do that.

> 3. Spammers could adopt elements of the good messages to throw off
>    filters.  And, of course, there's always progression in technology
>    (by both spammers and non-spammers).
> The third problem doesn't really worry me.

nah, me neither.

> These problems (and perhaps others I have not identified) are unique
> to spam filtering.  Compression corpuses and other performance-related
> corpuses have their own set of problems, of course.
> 
> In other words, I don't think there's any replacement for having
> multiple independent corpuses.  Finding better ways to distribute
> testing and collate results seems like a more viable long-term solution
> (and I'm glad we're working on exactly that for SpamAssassin).  If
> you're going to seriously work on filter development, building a corpus
> of 10000-50000 messages (half spam/half non-spam) is not really that
> much work.  If you don't get enough spam, creating multi-technique
> spamtraps (web, usenet, replying to spam) is pretty easy.  And who
> doesn't get thousands of non-spam every week?  ;-)

Yep.  The primary reason I released this, was to provide a good, big
corpus for academic testing of filter systems; it allows results to
be compared between filters using a known corpus.

For SpamAssassin development, everyone has to maintain their own corpus.

--j.


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

[SAtalk] Re: [SAdev] fully-public corpus of mail available

Reply via email to