On Thu, 29 Oct 2009 15:07:08 -0400
Adam Katz <antis...@khopis.com> wrote:

> Since we have corpus data, we should be able to restrict it to the
> body a few select headers (presumably done for the GA regarding Bayes
> rule scoring anyway?) and present it to the community in some way...
> 
> Of course, this might present (single-word?) privacy issues for the
> corpus providers and might not be as useful as one might like, but it
> might also be more useful than starting from zero.

I think if you going to do that then it would be sensible to massage
the ham/spam counts to keep them small, but representative. The
database needs to have a rough representation of the general spammyness
of tokens, but it also need to adapt quickly to the actual local
ratios.  

Another thing is that token atimes should be faked, spreading them
evenly over the last few weeks. If you don't do that then, depending
what headers you strip, either they'll all be the same or they will be
historic. Spamassassin's auto-expiry algorithm doesn't work well with
delta-function distributions or tokens over 256 days old. 

 
> There are a few somewhat older "Bayes Starter DB" files served by
> http://www.fsl.com/index.php/resources  (Fort Systems Ltd. makes a
> product based on MailScanner, which uses SpamAssassin).  I don't know
> what they do to that database to ensure it's clear of received headers
> and other muddying data or if its even at all worthwhile.

Odd way to distribute it, not only haven't they bothered to use
sa-learn --backup, but they've included the seen and journal files. 

As regards privacy, don't forget that the database contains truncated
sha1 hashes, not strings. You would need to know what you are looking
for.

Reply via email to