On 12 Dec 2003, [EMAIL PROTECTED] moaned: > On Thu, 11 Dec 2003 09:10:29 -0500, Adam Denenberg <[EMAIL PROTECTED]> > posted to spamassassin-talk: > > What i want to start is a Bayes Corpus Project. I would like to be > > able to allow people to submit confirmed ham and/or spam to a large > > bayes corpus repository (or maybe just spam) where people could then > > download (or somehow do an sa-learn remotely) to an ongoing updated > > bayes corpus. > > There are various efforts to collect representative email corpora for > spam testing but none of them are very successful IMHO. > > The main problem, as others already pointed out, is to get a hold of > good, representative ham email. Privacy issues and everything > notwithstanding, I think it would be beneficial to collect > +something+, on a regular basis, to test against.
Nah, what's really needed is a tool that merges Bayes DBs together. That way someone could learn from a pile of ham and hand the DBs to people for them to merge into their databases. That should be a lot less confidential than the raw emails, because the ordering over tokens has been lost :) The only problem then would be that some of the spammy tokens (the header ones in particular) might never hit at any other site: but in that case, expiry will zap them soon enough. (If you're paranoid, you could make sure that you don't have confidential single tokens in there: bank account numbers and important --- i.e., non-Mailman --- passwords). -- `...some suburbanite DSL customer who thinks kernel patches are some form of military insignia.' --- Bob Apthorpe ------------------------------------------------------- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk