Well as regular readers will know, Justin Mason, our fearless leader, has fearlessly buggered off and left me in charge, but we forgot to think through some of the details of how I'd continue to run the GA to update the scores which SA depends on. I think I have a possible solution to this problem:
I have setup a server on my machine for the purpose of accepting nonspam.log output from mass-check to create a non-spam corpus (or at least a list of rules which match against individuals' corpi). The idea would be that each of a number of trusted individuals runs mass-check against their own non-spam mail archive on a regular-ish basis, then submits the results via rsync to my server which acts as a collection point. I can then merge the various nonspam.log's and create the uber-nonspam.log which I can then feed to the GA along with the spam corpus (for which collection systems already exist). We particularly need the output of mass-check over the following: * Generic non-spam email (whatever's in your mailboxes) * False-positive prone email (stuff like jm's crackmice archive, or mailing list stuff) * Non-techie (business) non-spam email (stuff with attached word docs, powerpoint, lots of tables with dollar signs in them, etc) * Non-techie (personal) non-spam email (email between your grandma and your dad for example) * Foreign spam (french, english but distributed overseas, etc) * Foreign non-spam (any language, etc) The point of the middle two is that we probably get a disproportionately small amount of those types of email amongst the people I'm sending this to. The point of the last two is to possibly weight different rules differently for different locales (think 60_scores_es.cf or 60_scores_fr.cf) Preferably there should be a high degree of certainty that the mail being scanned is in fact either all "clean" or all "dirty". Note that I don't need the actual emails at all -- there isn't really much of a privacy-sensitivity issue here -- all we'd be doing is running mail through a script which matches it against all the SA rules and records which ones get triggered. If you're interested in contributing to this worthy task, please hit "reply" and I'll give more details on exactly how to submit mass-check output to the system. Don't bother clogging up the list with responses, I think it's probably better done on a 1-on-1 basis than broadcasting to the world. Thanks, C _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk