By stating that GA's need large samples, I assume you are referring to the population size. For years, people have been arguing whether a large or small population is better. The best theory we have on the inner workings of a GA, the Building Block Theory, is by John Holland, who is the father of GA's.
In my experience, a population size of approximately 50-200 is sufficient. I'm not sure how the author of SpamAssasin uses the GA to assign values, but I suppose his fitness function is related to how well SpamAssassin tags an e-mail as spam (ie +1 for a correctly tagged spam, and -1 for an incorrectly tagged message). Each Chromosome in the GA probably consists of the entire ruleset, with different values assigned to each rule (Gene's). Then the GA would take these different chromosomes(rulesets) and using Crossover and Mutation would run until it found a good candidate with a high fitness score. One way of (possibly) increasing the accuracy of Spamassassin would be to give it many possible rules to work with, and let the GA sort out which ones are good and bad (kind of like what it does now). I would assume the more rules you have, the better accuracy you will obtain, as the GA has more rules to work with. This should work up to a point. There probably exists some threshold, where having too many rules would decrease the efficiency of the GA, because the GA would spend most of it's time sorting through the ruleset to determine which rules are worthy, and which ones are not. I'm sure this is a gross oversimplification of what actually occurs... =) Gene Ruebsamen On Wed, 2002-01-30 at 17:47, Olivier Nicole wrote: > Hello, > > I wonder if/how I should/could update the ponderations that are given > by the genetic algorithm. > > I know little about GA, bt I think I remember (some 12 or 15 years > ago) that it needed quite big samples. > > So I beleive I should keep all incoming messages, mark them as spam or > not spam and run GA on it. > > How big should the sample be? (not the bigger the better, I know if I > accumulate samples for a life long, I should be able to find > parameters that are 100% correct for my case, but then it will be the > end of my life so i won't receive email anymore :). > > The reason I am asking is because it seems that I get false positives > just because the way I write English (OK I am not native speaker) and > that is annoying. > > Thanks > > Olivier > > _______________________________________________ > Spamassassin-talk mailing list > [EMAIL PROTECTED] > https://lists.sourceforge.net/lists/listinfo/spamassassin-talk _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk