RE: [SAtalk] scoring system and values...

Dan Kohn Sat, 08 Nov 2003 22:57:08 -0800

I think one of the greatest areas of confusion about SpamAssassin today
is how well Bayes can work with absolutely no training whatsoever.
Specifically, because it autotrains only on very spammy and very hammy
messages, Bayes learns quite well without any hand-selected corpuses.


The magic of SpamAssassin is that the Bayes bootstraps its learning off
of the several hundred non-Bayes rules, including the use of DNS
blocklists. So, spammy messages that hit certain rules train the Bayes
to find similar spam in the future, even the future spam may not hit
those rules.  The rules are necessary to auto-train the Bayes, but the
Bayes is what can catch all borderline (and custom-crafted) spam going
forward.

You still probably want to give users a way to manually create
whitelists to avoid false positives, and I recommend allowing
mistake-based training of the small number of false negatives (I use
procmail for this as described at
<http://www.dankohn.com/archives/000323.htm>).  

But there's no reason to think that your average user ever needs to
understand what Bayes in order to be able to take full advantage of it.

As to performance, I can't speak to administering a Bayes machine with
thousands of users, but if 99.9+% true positives with essentially no
false negatives is important to them, then it may be work finding a
$1000 server to just run spamd.  For me, SA is now eliminating over 200
spam a day, so email would be utterly unusable without it.

          - dan
--
Dan Kohn <mailto:[EMAIL PROTECTED]>
<http://www.dankohn.com/>  <tel:+1-650-327-2600>
-----Original Message-----
From: Terry Milnes [mailto:[EMAIL PROTECTED] 
Sent: Saturday, November 08, 2003 05:47
Cc: [EMAIL PROTECTED]
Subject: Re: [SAtalk] scoring system and values...


>>Now I don't expect SA to know dutch; that would be unfair. But what I
>>would
>>like is some way to score those english terms way higher than an
american
>>would or could.  For an american, mortgage does not spell spam per se.
But
>>for ME it does, and I can practically guarantee I will not ever get an
>>email
>>that mentions "mortgage" together with "you have been approved" which
>>won't
>>be spam.
>  
> At the risk of being repetitive, this is precisely the sort of thing
bayes
> excels at.  Give it a shot (hopefully you have some ham'n'spam saved
up
> already), I think you will be pleased.
>
>>Well, none of this is your concern of course. But I would really
really
> 
> Perhaps it's true that your success is not directly anyone's concern
but
> your own.  However, the regulars on this list are basically a buncha
SA
> users who are trying to improve their results and help others do the
same
> along the way.
> 

And herin lies the problem, sure anybody who is willing to spend time 
tweaking their personal setups, training bayes etc. will have great 
success at filtering out spam.

Some of us though are system administrators and need a solution to offer

to the end users.  The typical end user wants to open their email and 
see no spam, period.

Presently without the tweaks and training all we can do is reduce his 
spam by about 50 - 60%.  Settings have to be left at conservative in 
order not to get the phone calls complaining about false positives.

The bayes filtering works great, but the typical user is not going to 
want to jump through what he would consider the huge obstacles to train 
a corpus. Furthermore implementing bayes on a system that incorporates 
thousands of users can be a daunting task, and isn't even an available 
option to some of us.

Therefore when someone asks if there is a method to improve on the basic

ruleset we should pay more attention, not just recommend he use bayes.

tm.


>>really
>>like if there was a way to have those typical english spam-words score
way
>>higher than they do now.  Could we maybe envision two rulesets, one
for
>>english-speaking residents and one for non-english speaking
residents...?
>>I edited the score file myself but not only is it a hard, long and
>>error-prone
>>task, but by editing it I throw away much of the valueable knowhow
which
>>assembled that score-list in the first place.  But I am faced with the
>>fact
>>that over 95% of my spam is in english and that I cannot sit back
while
>>the
>>online pharmacies fly around me, so to speak.
>>Put yourself in my (our, if i'd be speaking for all non-english
countries)
>>place and ask yourself this question: Would you accept a score of only
0.5
>>for a rule that says "gratis hypotheekadvies" or "vijf miljoen
>>emailadressen"
>>??  No, of course you wouldn't, because you'd know that a company that
>>pretends to sell you a mortgage from 12000 miles away will never ever
be a
>>genuine offer...
> 
> 
> 
> Knowing that there are regulars on this list who's primary language is
NOT
> English, anyone care to share how their setup handles English and
> non-English spam?
> 
> 
> 
> 
> 



-------------------------------------------------------
This SF.Net email sponsored by: ApacheCon 2003,
16-19 November in Las Vegas. Learn firsthand the latest
developments in Apache, PHP, Perl, XML, Java, MySQL,
WebDAV, and more! http://www.apachecon.com/
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk




-------------------------------------------------------
This SF.Net email sponsored by: ApacheCon 2003,
16-19 November in Las Vegas. Learn firsthand the latest
developments in Apache, PHP, Perl, XML, Java, MySQL,
WebDAV, and more! http://www.apachecon.com/
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

RE: [SAtalk] scoring system and values...

Reply via email to