If people do assemble non-english corpuses, I think it'd be swell to roll them into the distribution (or at least the score files generated therefrom). The framework for it's even already there thanks to the work done in internationalizing the test descriptions.
C
On Wed, 2002-01-30 at 18:59, Duncan Findlay wrote:
On Thu, Jan 31, 2002 at 09:46:15AM +0700, Olivier Nicole wrote: > Greg, > > > You don't run SpamAssassin's genetic algorithm -- I gather that only > > Justin Mason, the prime developer, does that currently. He has a big > > huge pile ("the corpus") of mail, spam and non-spam, that is used to > > feed the GA and generate the scores in everyone's > > /usr/share/spamassassin/*.cf files. > > > > Clever, eh? I'm sure it would be possible for everyone to have their > > own corpus of mail, and if Justin released the GA code (or has he > > already?) then we could all run the GA ourselves and come up with our > > own score sets. But why bother? > One other problem is that the GA currently (IIRC) doesn't process the messages, just the tests hit. Of course, now, the test are different from those 2 versions ago, messing up the GA. Furthermore, everyone has a different idea of what spam is. Is commercial e-mail, that was sent by a company who legitimately has your e-mail address, spam? I imagine that the size of the corpus is not as important as the variety of messages, its currentness, and the accuracy of its filing. -- Duncan Findlay _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk