By stating that GA's need large samples, I assume you are referring to
the population size.  For years, people have been arguing whether a
large or small population is better.  The best theory we have on the
inner workings of a GA, the Building Block Theory, is by John Holland,
who is the father of GA's.  

In my experience, a population size of approximately 50-200 is
sufficient.  I'm not sure how the author of SpamAssasin uses the GA to
assign values, but I suppose his fitness function is related to how well
SpamAssassin tags an e-mail as spam (ie +1 for a correctly tagged spam,
and -1 for an incorrectly tagged message). Each Chromosome in the GA
probably consists of the entire ruleset, with different values assigned
to each rule (Gene's).  Then the GA would take these different
chromosomes(rulesets) and using Crossover and Mutation would run until
it found a good candidate with a high fitness score.

One way of (possibly) increasing the accuracy of Spamassassin would be
to give it many possible rules to work with, and let the GA sort out
which ones are good and bad (kind of like what it does now). I would
assume the more rules you have, the better accuracy you will obtain, as
the GA has more rules to work with.  This should work up to a point.
There probably exists some threshold, where having too many rules would
decrease the efficiency of the GA, because the GA would spend most of
it's time sorting through the ruleset to determine which rules are
worthy, and which ones are not.  I'm sure this is a gross
oversimplification of what actually occurs... =)

Gene Ruebsamen

On Wed, 2002-01-30 at 17:47, Olivier Nicole wrote:
> Hello,
> 
> I wonder if/how I should/could update the ponderations that are given
> by the genetic algorithm.
> 
> I know little about GA, bt I think I remember (some 12 or 15 years
> ago) that it needed quite big samples.
> 
> So I beleive I should keep all incoming messages, mark them as spam or
> not spam and run GA on it.
> 
> How big should the sample be? (not the bigger the better, I know if I
> accumulate samples for a life long, I should be able to find
> parameters that are 100% correct for my case, but then it will be the
> end of my life so i won't receive email anymore :).
> 
> The reason I am asking is because it seems that I get false positives
> just because the way I write English (OK I am not native speaker) and
> that is annoying.
> 
> Thanks
> 
> Olivier
> 
> _______________________________________________
> Spamassassin-talk mailing list
> [EMAIL PROTECTED]
> https://lists.sourceforge.net/lists/listinfo/spamassassin-talk



_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to