I've definitely got the volume required. I'm currently selecting a random sample of 1% of my site's inbound & outbound mail on both ham & spam sides, and will still be reviewing the corpus for several (many?) hours today to make sure it's clean. I'll see how soon I can get all of the pieces in place and fire you a link to the files. Collecting all of the samples together seems to be taking me quite a bit longer than I thought (of course). Don't hold things up on my account, but I'm hoping to have some results to share by the deadline.
I've had the wiki page open since Justin sent the initial request, but hadn't gotten around to the soul crushing work of reviewing thousands of messages yet... On Wed, Sep 16, 2009 at 11:43 AM, Warren Togami <wtog...@redhat.com> wrote: > On 09/16/2009 01:01 PM, Austin wrote: >> >> Would it be worth contributing data from a brand-new corpus of mail >> from the last few days? That's the best I can do presently. >> >> I have plenty of dreams of creating a good, hand verified, corpus of >> mail from the last several months, but the development work keeps >> getting bumped... >> > > Do you have > 1000+ ham, human verified to contain no spam? If so I suppose > it is worthwhile. > > http://wiki.apache.org/spamassassin/RescoreDetails > If you follow these instructions and put your logs somewhere I can grab them > (preferably via HTTP) I can upload your logs for this one-time rescoring > masscheck. > > http://wiki.apache.org/spamassassin/NightlyMassCheck > If you want to participate in nightly masscheck you should request your own > account. > > Warren Togami > wtog...@redhat.com >