Re: [SAtalk] Whitelist ignored for auto-learn?

David B Funk Wed, 23 Jul 2003 12:41:06 -0700

On Wed, 23 Jul 2003, Joe Julian wrote:

> I have a list of specific trusted addresses in my whitelist, but it
> still won't autolearn from them. Why not? Their scores are quite
> negative, way below -2, but it still won't autolearn from them. It looks
> like it's ignoring the whitelist when checking whether or not it should
> autolearn. What can I do to change that?


Um, you probably -don't- want to change that, there's a good reason for
that logic.

Think about what you whitelist and why you whitelist those sites.
It's usually because those sources send out 'ham' that "looks spammish".
(if it didn't "look spammish" you wouldn't need to whitelist it.)
If the messages contain lots of "spammish" content and you autolearn
it as 'ham' then your bayes database will contain "spammish" tokens
with 'ham' scores and it will defeat the whole purpose of that facility.

For example, suppose you're subscribed to a maillist that discusses
spam fighting techniques (hmm, do we know one of those ;).
In that list it's not unusual for people to post example spam messages
while discussing "why did this one get thru?", so traditional scoring
mechanisms would mark those messages as 'spam', thus necessitating
a whitelisting of that list.
However if you autolearn those posts, you'll add lots of "spammish"
tokens to your ham list.

When you get good 'ham' from those sources just feed it to
"sa-learn --ham" and you're done. If you always get 'ham' from those
sources, you don't need the whitelist.

The flip side of this is to beware of feeding 'hammish' spam to
"sa-learn --spam". I fouled up my bayes by feeding lots of 'Nigerian'
spam into "sa-learn --spam". For the most part 'Nigerian' looks like
business mail and so I ended up with lots of business like tokens that
had strong spam scores. Thus I started seeing all kinds of strictly
'ham' mail end up with 99% bayes scores.
I had to dump it and start from scratch.

Thus be sure that anything you "learn" is clearly representative
of the type of message (spam or ham) that you want to recognize
in the future. (and remember that bayes looks at individual words or
small phrases).

-- 
Dave Funk                                  University of Iowa
<dbfunk (at) engineering.uiowa.edu>        College of Engineering
319/335-5751   FAX: 319/384-0549           1256 Seamans Center
Sys_admin/Postmaster/cell_admin            Iowa City, IA 52242-1527
#include <std_disclaimer.h>
Better is not better, 'standard' is better. B{



-------------------------------------------------------
This SF.Net email sponsored by: Free pre-built ASP.NET sites including
Data Reports, E-commerce, Portals, and Forums are available now.
Download today and enter to win an XBOX or Visual Studio .NET.
http://aspnet.click-url.com/go/psa00100003ave/direct;at.aspnet_072303_01/01
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Re: [SAtalk] Whitelist ignored for auto-learn?

Reply via email to