[SAtalk] Collecting non-spam data for GA

Craig Hughes Mon, 04 Feb 2002 20:15:40 -0800

Well as regular readers will know, Justin Mason, our fearless leader,
has fearlessly buggered off and left me in charge, but we forgot to
think through some of the details of how I'd continue to run the GA to
update the scores which SA depends on.  I think I have a possible
solution to this problem:



I have setup a server on my machine for the purpose of accepting
nonspam.log output from mass-check to create a non-spam corpus (or at
least a list of rules which match against individuals' corpi).  The idea
would be that each of a number of trusted individuals runs mass-check
against their own non-spam mail archive on a regular-ish basis, then
submits the results via rsync to my server which acts as a collection
point.  I can then merge the various nonspam.log's and create the
uber-nonspam.log which I can then feed to the GA along with the spam
corpus (for which collection systems already exist).


We particularly need the output of mass-check over the following:

* Generic non-spam email (whatever's in your mailboxes)
* False-positive prone email (stuff like jm's crackmice archive, or
mailing list stuff)
* Non-techie (business) non-spam email (stuff with attached word docs,
powerpoint, lots of tables with dollar signs in them, etc)
* Non-techie (personal) non-spam email (email between your grandma and
your dad for example)
* Foreign spam (french, english but distributed overseas, etc)
* Foreign non-spam (any language, etc)

The point of the middle two is that we probably get a disproportionately
small amount of those types of email amongst the people I'm sending this
to.

The point of the last two is to possibly weight different rules
differently for different locales (think 60_scores_es.cf or
60_scores_fr.cf)

Preferably there should be a high degree of certainty that the mail
being scanned is in fact either all "clean" or all "dirty".

Note that I don't need the actual emails at all -- there isn't really
much of a privacy-sensitivity issue here -- all we'd be doing is running
mail through a script which matches it against all the SA rules and
records which ones get triggered.

If you're interested in contributing to this worthy task, please hit
"reply" and I'll give more details on exactly how to submit mass-check
output to the system.  Don't bother clogging up the list with responses,
I think it's probably better done on a 1-on-1 basis than broadcasting to
the world.

Thanks,

C


_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

[SAtalk] Collecting non-spam data for GA

Reply via email to