> (Please feel free to forward this message to other possibly-interested > parties.)
Some caveats (in decending order of concern): 1. These messages could end up being falsely (or incorrectly) reported to Razor, DCC, Pyzor, etc. Certain RBLs too. I don't think the results for these distributed tests can be trusted in any way, shape, or form when running over a public corpus. 2. These messages could also be submitted (more than once) to projects like SpamAssassin that rely on filtering results submission for GA tuning and development. 3. Spammers could adopt elements of the good messages to throw off filters. And, of course, there's always progression in technology (by both spammers and non-spammers). The second problem could be alleviated somewhat by adding a Nilsimsa signature (or similar) to the mass-check file (the results format used by SpamAssassin) and giving the message files unique names (MD5 or SHA-1 of each file). The third problem doesn't really worry me. These problems (and perhaps others I have not identified) are unique to spam filtering. Compression corpuses and other performance-related corpuses have their own set of problems, of course. In other words, I don't think there's any replacement for having multiple independent corpuses. Finding better ways to distribute testing and collate results seems like a more viable long-term solution (and I'm glad we're working on exactly that for SpamAssassin). If you're going to seriously work on filter development, building a corpus of 10000-50000 messages (half spam/half non-spam) is not really that much work. If you don't get enough spam, creating multi-technique spamtraps (web, usenet, replying to spam) is pretty easy. And who doesn't get thousands of non-spam every week? ;-) Dan ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk