These are all good points, but I think from the fact that many people find SpamAssassin useful because it identifies spam with a low false-positive rate that most of the time, most people consider the same things to be spam.  There are definitely some weird sample points in the corpus, but by and large it seems pretty solid for english-as-a-first-language email recipients.

If people do assemble non-english corpuses, I think it'd be swell to roll them into the distribution (or at least the score files generated therefrom).  The framework for it's even already there thanks to the work done in internationalizing the test descriptions.

C

On Wed, 2002-01-30 at 18:59, Duncan Findlay wrote:
On Thu, Jan 31, 2002 at 09:46:15AM +0700, Olivier Nicole wrote:
> Greg,
> 
> > You don't run SpamAssassin's genetic algorithm -- I gather that only
> > Justin Mason, the prime developer, does that currently.  He has a big
> > huge pile ("the corpus") of mail, spam and non-spam, that is used to
> > feed the GA and generate the scores in everyone's
> > /usr/share/spamassassin/*.cf files.
> > 
> > Clever, eh?  I'm sure it would be possible for everyone to have their
> > own corpus of mail, and if Justin released the GA code (or has he
> > already?)  then we could all run the GA ourselves and come up with our
> > own score sets.  But why bother?
> 

One other problem is that the GA currently (IIRC) doesn't process the
messages, just the tests hit.  Of course, now, the test are different from
those 2 versions ago, messing up the GA.

Furthermore, everyone has a different idea of what spam is.  Is commercial
e-mail, that was sent by a company who legitimately has your e-mail address,
spam?

I imagine that the size of the corpus is not as important as the variety of
messages, its currentness, and the accuracy of its filing.
-- 
Duncan Findlay

_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk


Reply via email to