(trimmed cc list)
Daniel Quinlan said: > 1. These messages could end up being falsely (or incorrectly) reported > to Razor, DCC, Pyzor, etc. Certain RBLs too. I don't think the > results for these distributed tests can be trusted in any way, > shape, or form when running over a public corpus. I'll note that in the README. > 2. These messages could also be submitted (more than once) to projects > like SpamAssassin that rely on filtering results submission for GA > tuning and development. > The second problem could be alleviated somewhat by adding a Nilsimsa > signature (or similar) to the mass-check file (the results format used > by SpamAssassin) and giving the message files unique names (MD5 or > SHA-1 of each file). OK; maybe rewriting the message-ids will help here, that should allow us to pick them out. I'll do that. > 3. Spammers could adopt elements of the good messages to throw off > filters. And, of course, there's always progression in technology > (by both spammers and non-spammers). > The third problem doesn't really worry me. nah, me neither. > These problems (and perhaps others I have not identified) are unique > to spam filtering. Compression corpuses and other performance-related > corpuses have their own set of problems, of course. > > In other words, I don't think there's any replacement for having > multiple independent corpuses. Finding better ways to distribute > testing and collate results seems like a more viable long-term solution > (and I'm glad we're working on exactly that for SpamAssassin). If > you're going to seriously work on filter development, building a corpus > of 10000-50000 messages (half spam/half non-spam) is not really that > much work. If you don't get enough spam, creating multi-technique > spamtraps (web, usenet, replying to spam) is pretty easy. And who > doesn't get thousands of non-spam every week? ;-) Yep. The primary reason I released this, was to provide a good, big corpus for academic testing of filter systems; it allows results to be compared between filters using a known corpus. For SpamAssassin development, everyone has to maintain their own corpus. --j. ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk