No, I think by large samples he means it needs a lot of examples of what is and what is not spam. In fact, the corpus we feed to the GA includes some 75,000+ spam messages and 46,000+ non-spam messages, weighted towards messages which tend to yield false-positives (mailing lists, etc). The summary of the results of passing these messages is listed at the top of every version of the scores file, eg:
# SUMMARY: 19 / 2932 # # Correctly non-spam: 45324 99.96% (40.40% overall) # Correctly spam: 63916 95.61% (56.97% overall) # False positives: 19 0.04% (0.02% overall, 10740 adjusted) # False negatives: 2932 4.39% (2.61% overall, 75 adjusted) # TOTAL: 112191 100.00% FYI, justin's evolve.cxx uses a population size of 100 by default, and I generally use a population of ~100 as well, though frequently ~30 works pretty well for my algorithm. AFAIK noone has yet done any work on trying to evolve a set of rules; all we're doing is evolving optimal scores for human-determined rules. In practice this works pretty darned well, because humans can be quite good at creating regular expressions which the kinds of natural language strings observed in spam. Humans are also reasonably good at making those rules be likely to cover new SPAM which has not yet been seen by generalizing from things actually observed -- there would be a strong danger if the GA evolved its own rules that it would severely overfit. C On Wed, 2002-01-30 at 18:31, Gene Ruebsamen wrote: By stating that GA's need large samples, I assume you are referring to the population size. For years, people have been arguing whether a large or small population is better. The best theory we have on the inner workings of a GA, the Building Block Theory, is by John Holland, who is the father of GA's. In my experience, a population size of approximately 50-200 is sufficient. I'm not sure how the author of SpamAssasin uses the GA to assign values, but I suppose his fitness function is related to how well SpamAssassin tags an e-mail as spam (ie +1 for a correctly tagged spam, and -1 for an incorrectly tagged message). Each Chromosome in the GA probably consists of the entire ruleset, with different values assigned to each rule (Gene's). Then the GA would take these different chromosomes(rulesets) and using Crossover and Mutation would run until it found a good candidate with a high fitness score. One way of (possibly) increasing the accuracy of Spamassassin would be to give it many possible rules to work with, and let the GA sort out which ones are good and bad (kind of like what it does now). I would assume the more rules you have, the better accuracy you will obtain, as the GA has more rules to work with. This should work up to a point. There probably exists some threshold, where having too many rules would decrease the efficiency of the GA, because the GA would spend most of it's time sorting through the ruleset to determine which rules are worthy, and which ones are not. I'm sure this is a gross oversimplification of what actually occurs... =) Gene Ruebsamen On Wed, 2002-01-30 at 17:47, Olivier Nicole wrote: > Hello, > > I wonder if/how I should/could update the ponderations that are given > by the genetic algorithm. > > I know little about GA, bt I think I remember (some 12 or 15 years > ago) that it needed quite big samples. > > So I beleive I should keep all incoming messages, mark them as spam or > not spam and run GA on it. > > How big should the sample be? (not the bigger the better, I know if I > accumulate samples for a life long, I should be able to find > parameters that are 100% correct for my case, but then it will be the > end of my life so i won't receive email anymore :). > > The reason I am asking is because it seems that I get false positives > just because the way I write English (OK I am not native speaker) and > that is annoying. > > Thanks > > Olivier > > _______________________________________________ > Spamassassin-talk mailing list > [EMAIL PROTECTED] > https://lists.sourceforge.net/lists/listinfo/spamassassin-talk _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk