(Please feel free to forward this message to other possibly-interested parties.)
Hi all, One of the big problems working with spam classification, is finding good mail to test with. There are few public corpora available; Ion Androutsopoulos' "Ling-spam" corpus is one (hi Ion!), but unfortunately this does not contain all of the mail message data, so would not be useful to a SpamAssassin-style system (which relies heavily on header data), for example. Another effect of not having a common, shared corpus, is the difficulty this introduces in comparing accuracy rates between spam filter software; since everyone tests using different corpora, statistics can be unportable as a result. Building public corpora is difficult, as it typically involves saving your own (classified) mail. This brings privacy problems, as your mail senders may not wish to see this made public. But what the heck, that's what I've done anyway ;) Here's a public corpus I've assembled from my own corpora, removing messages which were not public in the first place. Please feel free to download it and use it for spam-filter development. It's quite small, but should be big enough for use as a reference corpus, at least, so that hit-rate statistics can be compared across tools. Hope it helps. It lives here: http://spamassassin.org/publiccorpus/ and here's the README.txt: Welcome to the SpamAssassin public mail corpus. This is a selection of mail messages, suitable for use in testing spam filtering systems. Pertinent points: - All headers are reproduced in full. Some address obfuscation has taken place; hostnames in some cases have been replaced with "example.com", which should have a valid MX record (if I recall correctly). In most cases though, the headers appear as they were received. - All of these messages were posted to public fora, were sent to me in the knowledge that they may be made public, were sent by me, or originated as newsletters from public news web sites. - Copyright for the text in the messages remains with the original senders. OK, now onto the corpus description. It's split into three parts, as follows: - spam: 500 spam messages, all received from non-spam-trap sources. - easy_ham: 350 non-spam messages. These are typically quite easy to differentiate from spam, since they frequently do not contain any spammish signatures (like HTML etc). - hard_ham: 250 non-spam messages which are closer in many respects to typical spam: use of HTML, unusual HTML markup, coloured text, "spammish-sounding" phrases etc. The corpora are prefixed with "200210", because that's the date when I assembled it, so it's as good a version string as anything else ;) . They are compressed using "bzip2". This corpus lives at http://spamassassin.org/publiccorpus/ . Mail jm - public - corpus AT jmason dot org if you have questions, or to donate mail. (Oct 9 2002 jm) ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk