No, I think by large samples he means it needs a lot of examples of what
is and what is not spam.  In fact, the corpus we feed to the GA includes
some 75,000+ spam messages and 46,000+ non-spam messages, weighted
towards messages which tend to yield false-positives (mailing lists,
etc).  The summary of the results of passing these messages is listed at
the top of every version of the scores file, eg:

# SUMMARY:                19 /   2932
#
# Correctly non-spam:  45324  99.96%  (40.40% overall)
# Correctly spam:      63916  95.61%  (56.97% overall)
# False positives:        19  0.04%  (0.02% overall,  10740 adjusted)
# False negatives:      2932  4.39%  (2.61% overall,     75 adjusted)
# TOTAL:              112191  100.00%

FYI, justin's evolve.cxx uses a population size of 100 by default, and I
generally use a population of ~100 as well, though frequently ~30 works
pretty well for my algorithm.

AFAIK noone has yet done any work on trying to evolve a set of rules;
all we're doing is evolving optimal scores for human-determined rules. 
In practice this works pretty darned well, because humans can be quite
good at creating regular expressions which the kinds of natural language
strings observed in spam.  Humans are also reasonably good at making
those rules be likely to cover new SPAM which has not yet been seen by
generalizing from things actually observed -- there would be a strong
danger if the GA evolved its own rules that it would severely overfit.

C

On Wed, 2002-01-30 at 18:31, Gene Ruebsamen wrote:
    By stating that GA's need large samples, I assume you are referring to
    the population size.  For years, people have been arguing whether a
    large or small population is better.  The best theory we have on the
    inner workings of a GA, the Building Block Theory, is by John Holland,
    who is the father of GA's.  
    
    In my experience, a population size of approximately 50-200 is
    sufficient.  I'm not sure how the author of SpamAssasin uses the GA to
    assign values, but I suppose his fitness function is related to how well
    SpamAssassin tags an e-mail as spam (ie +1 for a correctly tagged spam,
    and -1 for an incorrectly tagged message). Each Chromosome in the GA
    probably consists of the entire ruleset, with different values assigned
    to each rule (Gene's).  Then the GA would take these different
    chromosomes(rulesets) and using Crossover and Mutation would run until
    it found a good candidate with a high fitness score.
    
    One way of (possibly) increasing the accuracy of Spamassassin would be
    to give it many possible rules to work with, and let the GA sort out
    which ones are good and bad (kind of like what it does now). I would
    assume the more rules you have, the better accuracy you will obtain, as
    the GA has more rules to work with.  This should work up to a point.
    There probably exists some threshold, where having too many rules would
    decrease the efficiency of the GA, because the GA would spend most of
    it's time sorting through the ruleset to determine which rules are
    worthy, and which ones are not.  I'm sure this is a gross
    oversimplification of what actually occurs... =)
    
    Gene Ruebsamen
    
    On Wed, 2002-01-30 at 17:47, Olivier Nicole wrote:
    > Hello,
    > 
    > I wonder if/how I should/could update the ponderations that are given
    > by the genetic algorithm.
    > 
    > I know little about GA, bt I think I remember (some 12 or 15 years
    > ago) that it needed quite big samples.
    > 
    > So I beleive I should keep all incoming messages, mark them as spam or
    > not spam and run GA on it.
    > 
    > How big should the sample be? (not the bigger the better, I know if I
    > accumulate samples for a life long, I should be able to find
    > parameters that are 100% correct for my case, but then it will be the
    > end of my life so i won't receive email anymore :).
    > 
    > The reason I am asking is because it seems that I get false positives
    > just because the way I write English (OK I am not native speaker) and
    > that is annoying.
    > 
    > Thanks
    > 
    > Olivier
    > 
    > _______________________________________________
    > Spamassassin-talk mailing list
    > [EMAIL PROTECTED]
    > https://lists.sourceforge.net/lists/listinfo/spamassassin-talk
    
    
    
    _______________________________________________
    Spamassassin-talk mailing list
    [EMAIL PROTECTED]
    https://lists.sourceforge.net/lists/listinfo/spamassassin-talk
    
    
    
    

_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to