The code is available.  There are actually two different GAs in the /masses directory -- one, evolve.cxx is Justin's, based on a library called galib; the other, craig-evolve.cxx is mine, based on pgapack so it can make use of multiple CPUs (and even multiple nodes if your computational desires swing that way).  My fitness function is also somewhat different than Justin's, and I've been tinkering with deviations from the standard mutation/crossover to sort of combine elements of both at once (with some regression stuff too) to make things converge faster while doing a reasonably good job of not getting stuck in local minima.  Anyway, both programs are really quite readable if you're familiar with genetic algorithms at all.  Particularly justin's (which is the one currently used for generating the distribution's scores).  With justin off on extended leave for the next 6 months or so though, odds are pretty good that I'll be switching to use craig-evolve very shortly, since it runs considerably faster and produces better results in my experience.  Certainly it's a lot faster.

Now having just said that, I've realized that one thing Justin didn't give me access to (I don't think) is the corpus before it's been passed through mass-check!  Hopefully you're still there Justin, an we can figure something out there.

C

On Wed, 2002-01-30 at 18:33, Greg Ward wrote:
On 31 January 2002, Olivier Nicole said:
> I wonder if/how I should/could update the ponderations that are given
> by the genetic algorithm.
> 
> I know little about GA, bt I think I remember (some 12 or 15 years
> ago) that it needed quite big samples.
> 
> So I beleive I should keep all incoming messages, mark them as spam or
> not spam and run GA on it.

You don't run SpamAssassin's genetic algorithm -- I gather that only
Justin Mason, the prime developer, does that currently.  He has a big
huge pile ("the corpus") of mail, spam and non-spam, that is used to
feed the GA and generate the scores in everyone's
/usr/share/spamassassin/*.cf files.

Clever, eh?  I'm sure it would be possible for everyone to have their
own corpus of mail, and if Justin released the GA code (or has he
already?)  then we could all run the GA ourselves and come up with our
own score sets.  But why bother?

        Greg
-- 
Greg Ward - software developer                [EMAIL PROTECTED]
MEMS Exchange                            http://www.mems-exchange.org

_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk


Reply via email to