Re: [SAtalk] Updating ponderations given by the GA

2002-01-31 Thread Justin Mason
> Now having just said that, I've realized that one thing Justin didn't > give me access to (I don't think) is the corpus before it's been passed > through mass-check! Hopefully you're still there Justin, an we can > figure something out there. Craig -- still here, ish -- on dialup and webmail

Re: [SAtalk] Updating ponderations given by the GA

2002-01-30 Thread Craig Hughes
It's hard/impossible to *optimize* the scores by hand.  You can still set them by hand using intuition and do "well enough", particularly if you're only setting/adjusting a couple rules and you guide your scoring based off the existing scores.  I wouldn't want to go through and hand-score all 3

Re: [SAtalk] Updating ponderations given by the GA

2002-01-30 Thread Craig Hughes
On Wed, 2002-01-30 at 19:29, Olivier Nicole wrote: Now I may be wrong, but how new tests can be introduced if they are not accounted by the GA to get some weight? You can create new rules in your own config files, and assign them any score you like.  You can get a sense of what kind o

Re: [SAtalk] Updating ponderations given by the GA

2002-01-30 Thread Craig Hughes
These are all good points, but I think from the fact that many people find SpamAssassin useful because it identifies spam with a low false-positive rate that most of the time, most people consider the same things to be spam.  There are definitely some weird sample points in the corpus, but by a

Re: [SAtalk] Updating ponderations given by the GA

2002-01-30 Thread Craig Hughes
You could either run the GA yourself (which requires building up a corpus of spam and non-spam to feed it, not sure how large it would need to be), or you more easily can just tweak scores on individual rules (this is very simple).  For example, you could just create a file called /path/to/spam

Re: [SAtalk] Updating ponderations given by the GA

2002-01-30 Thread Craig Hughes
No, I think by large samples he means it needs a lot of examples of what is and what is not spam. In fact, the corpus we feed to the GA includes some 75,000+ spam messages and 46,000+ non-spam messages, weighted towards messages which tend to yield false-positives (mailing lists, etc). The summa

Re: [SAtalk] Updating ponderations given by the GA

2002-01-30 Thread Craig Hughes
The code is available.  There are actually two different GAs in the /masses directory -- one, evolve.cxx is Justin's, based on a library called galib; the other, craig-evolve.cxx is mine, based on pgapack so it can make use of multiple CPUs (and even multiple nodes if your computational desires

Re: [SAtalk] Updating ponderations given by the GA

2002-01-30 Thread Olivier Nicole
> Running the GA yourself would likely > yield better results, but at least you have an option now :-). Well I guess if GA was used, it is because it is practically unfeasible to acheive proper scoring by hand :) In fact I'd rather run SA that way than temper with the scores. I plan to quaranti

Re: [SAtalk] Updating ponderations given by the GA

2002-01-30 Thread Olivier Nicole
Duncan, >One other problem is that the GA currently (IIRC) doesn't process the >messages, just the tests hit. Of course, now, the test are different from >those 2 versions ago, messing up the GA. Replacing the message by the result of the test would be pretty simple I beleive. X-Spam-Status:

Re: [SAtalk] Updating ponderations given by the GA

2002-01-30 Thread dman
On Thu, Jan 31, 2002 at 09:46:15AM +0700, Olivier Nicole wrote: | One of my concern, for example, as a sys admin, I start my email by | "Dear user" this is quite highly pondered. One more hit and my message | would get lost. | | I receive quite some emails from Indian/Sri Lankan/Pakistanese peo

Re: [SAtalk] Updating ponderations given by the GA

2002-01-30 Thread Duncan Findlay
On Thu, Jan 31, 2002 at 09:46:15AM +0700, Olivier Nicole wrote: > Greg, > > > You don't run SpamAssassin's genetic algorithm -- I gather that only > > Justin Mason, the prime developer, does that currently. He has a big > > huge pile ("the corpus") of mail, spam and non-spam, that is used to > >

Re: [SAtalk] Updating ponderations given by the GA

2002-01-30 Thread Olivier Nicole
Greg, > You don't run SpamAssassin's genetic algorithm -- I gather that only > Justin Mason, the prime developer, does that currently. He has a big > huge pile ("the corpus") of mail, spam and non-spam, that is used to > feed the GA and generate the scores in everyone's > /usr/share/spamassassin

Re: [SAtalk] Updating ponderations given by the GA

2002-01-30 Thread Greg Ward
On 31 January 2002, Olivier Nicole said: > I wonder if/how I should/could update the ponderations that are given > by the genetic algorithm. > > I know little about GA, bt I think I remember (some 12 or 15 years > ago) that it needed quite big samples. > > So I beleive I should keep all incoming

Re: [SAtalk] Updating ponderations given by the GA

2002-01-30 Thread Gene Ruebsamen
By stating that GA's need large samples, I assume you are referring to the population size. For years, people have been arguing whether a large or small population is better. The best theory we have on the inner workings of a GA, the Building Block Theory, is by John Holland, who is the father o

[SAtalk] Updating ponderations given by the GA

2002-01-30 Thread Olivier Nicole
Hello, I wonder if/how I should/could update the ponderations that are given by the genetic algorithm. I know little about GA, bt I think I remember (some 12 or 15 years ago) that it needed quite big samples. So I beleive I should keep all incoming messages, mark them as spam or not spam and ru