> Now having just said that, I've realized that one thing Justin didn't
> give me access to (I don't think) is the corpus before it's been passed
> through mass-check! Hopefully you're still there Justin, an we can
> figure something out there.
Craig --
still here, ish -- on dialup and webmail
It's hard/impossible to *optimize* the scores by hand. You can still set them by hand using intuition and do "well enough", particularly if you're only setting/adjusting a couple rules and you guide your scoring based off the existing scores. I wouldn't want to go through and hand-score all 3
On Wed, 2002-01-30 at 19:29, Olivier Nicole wrote:
Now I may be wrong, but how new tests can be introduced if they are
not accounted by the GA to get some weight?
You can create new rules in your own config files, and assign them any score you like. You can get a sense of what kind o
These are all good points, but I think from the fact that many people find SpamAssassin useful because it identifies spam with a low false-positive rate that most of the time, most people consider the same things to be spam. There are definitely some weird sample points in the corpus, but by a
You could either run the GA yourself (which requires building up a corpus of spam and non-spam to feed it, not sure how large it would need to be), or you more easily can just tweak scores on individual rules (this is very simple). For example, you could just create a file called /path/to/spam
No, I think by large samples he means it needs a lot of examples of what
is and what is not spam. In fact, the corpus we feed to the GA includes
some 75,000+ spam messages and 46,000+ non-spam messages, weighted
towards messages which tend to yield false-positives (mailing lists,
etc). The summa
The code is available. There are actually two different GAs in the /masses directory -- one, evolve.cxx is Justin's, based on a library called galib; the other, craig-evolve.cxx is mine, based on pgapack so it can make use of multiple CPUs (and even multiple nodes if your computational desires
> Running the GA yourself would likely
> yield better results, but at least you have an option now :-).
Well I guess if GA was used, it is because it is practically
unfeasible to acheive proper scoring by hand :)
In fact I'd rather run SA that way than temper with the scores.
I plan to quaranti
Duncan,
>One other problem is that the GA currently (IIRC) doesn't process the
>messages, just the tests hit. Of course, now, the test are different from
>those 2 versions ago, messing up the GA.
Replacing the message by the result of the test would be pretty simple
I beleive.
X-Spam-Status:
On Thu, Jan 31, 2002 at 09:46:15AM +0700, Olivier Nicole wrote:
| One of my concern, for example, as a sys admin, I start my email by
| "Dear user" this is quite highly pondered. One more hit and my message
| would get lost.
|
| I receive quite some emails from Indian/Sri Lankan/Pakistanese peo
On Thu, Jan 31, 2002 at 09:46:15AM +0700, Olivier Nicole wrote:
> Greg,
>
> > You don't run SpamAssassin's genetic algorithm -- I gather that only
> > Justin Mason, the prime developer, does that currently. He has a big
> > huge pile ("the corpus") of mail, spam and non-spam, that is used to
> >
Greg,
> You don't run SpamAssassin's genetic algorithm -- I gather that only
> Justin Mason, the prime developer, does that currently. He has a big
> huge pile ("the corpus") of mail, spam and non-spam, that is used to
> feed the GA and generate the scores in everyone's
> /usr/share/spamassassin
On 31 January 2002, Olivier Nicole said:
> I wonder if/how I should/could update the ponderations that are given
> by the genetic algorithm.
>
> I know little about GA, bt I think I remember (some 12 or 15 years
> ago) that it needed quite big samples.
>
> So I beleive I should keep all incoming
By stating that GA's need large samples, I assume you are referring to
the population size. For years, people have been arguing whether a
large or small population is better. The best theory we have on the
inner workings of a GA, the Building Block Theory, is by John Holland,
who is the father o
Hello,
I wonder if/how I should/could update the ponderations that are given
by the genetic algorithm.
I know little about GA, bt I think I remember (some 12 or 15 years
ago) that it needed quite big samples.
So I beleive I should keep all incoming messages, mark them as spam or
not spam and ru
15 matches
Mail list logo