Re: [SAtalk] Consistency between releases

listuser Mon, 14 Oct 2002 08:09:18 -0700

On 13 Oct 2002, Daniel Quinlan wrote:

> [EMAIL PROTECTED] writes:
> 
> > Can anyone give me any ideas why SA is so inconsistent between different
> > releases?  For example I picked a spam to test a new installation of SA
> > with.  It had scored over 10 on a previous install.  When the message
> > arrived on my new box, it was scored at only 8.4.  I downgraded to 2.40
> > and tried it again and again it was over 10 but not as high as it was with
> > 2.41.  The test spam is in NANAS:
> 
> Looking at a single message (which was, by the way, marked as spam in
> both releases) is not a good measure of anything.


I only used the first message in my spam box, one that scored highly the
first time around.  I'm sure I could pick a half a dozen at random and see
similar results.

> The only worthwhile measures are false positive and false negative rates
> over a large sample size.  There are various ways to measure those two
> attributes (and ways to combine the two into a single number), but our
> focus is on improving both from release to release.

I really don't see it that way.  I really don't think just looking at the
false postitives and negatives is looking at the whole picture.  Ignoring
the hits in the middle would be like ignoring successes )complete or
partial and only focusing on failures IMHO.

> A single message's score is liable to change quite a bit if a rule is
> deleted, or added, or the GA algorithm is changed.  The scores changed
> quite a bit across the 2.4x series because the GA was being improved.
> The GA sometimes finds it's way into local maxima/minima (or maybe
> that's all that's possible given the search space), so if it manages to
> pop out and find a more optimal solution, the scores may change quite a
> bit.  Frankly, we don't worry too much about individual messages.  We
> test rules on tens of thousands of messages and the GA runs on hundreds
> of thousands of messages.  Changes are made when they seem likely to
> improve SA in general.  Optimizing for any small set of messages would
> destroy SA's overall performance.

If 2.42 scored that message only a few points lower, I wouldn't be
concerned.  However it almost halved the scored.  I can't see that as a
good thing.  Lets look at a couple of the individual rules that scored
differently.  

"Message has X-MSMail-Priority, but no X-MimeOLE" scored 1.6 previously.
2.42 scored it at 0.5, less than a third the previous score.

"BODY: Spam phrases score is 13 to 21 (high) [score: 19]:" scored 3 in
2.41.  2.42 also gave a phrases score of 19 but only scored the message
at 0.4.  I really don't get this one.

"BODY: Claims you can be removed from the list" scored 1.9 previously.
2.42 also found the same string but it only scored at 1.9, almost one
fifth the previous score.

I don't understand how these rules can score so far apart.  A small
difference isn't a big deal.  If it was found that claming to be removed
from the list scored a lot of legit mail, lowering it to 1.0 wouldn't
surprise me.  0.4 though?

> > Now I can understand it scoring higher over time as SA's rules get better
> > and better at matching spam.  However I really don't understand why a new
> > release would score it lower, especially looking at the specific rules
> > that were scored lower.  Can anyone shed any light on this?
> 
> After a certain point, higher scores don't help much.  But, if those
> lower scores reduced false positives by a significant amount, it's
> really significant.  Or, if by lowering those scores, we could raise
> others, maybe that will catch more spam without more false positives.
> The GA optimizes for correctly categorizing messages, not scoring spam
> with ever-higher scores.

The ever-higher score isn't really what I'm after unless SA was really in
its infancy and more spam traits in the same test bed of spam were found.
What I'm wondering about is why the suddent and drastic change.  

> Again, single message scores are not really important.  Look at overall
> spam vs. nonspam accuracy if you want to do any sort of comparison.  And
> yes, that means you need to do your comparison using a "real email"
> corpus that has been hand-cleaned -- no false positives and no false
> negatives.

If I get time I'll try running my spambox through a couple different SA
releases.  To be honest I don't forsee results different than what I've
already seen.

I thought about this problem this weekend and something else occured to
me.  If SA scores are raised and lowered more than just a few tenths, then
that could drastically change the threshold that our users will have
defined in their MUA filters.  For example lets say that I'm a user and I
create a filter to take everything scoring above 10 and delete it.  I also
create a filter that takes everything above 5 (really 5-9 since my first
filter got everything greater) and move it to a spam folder to browse
through later.  Everything above 10 I assume to be junk.  I also know that
some of my company bulk memos occasionally fall into that range so I don't
want to delete it right away.  If many scores are changed as it seems to
have been between 2.41 and 2.42, my thresholds are now different.  My
thresholds might now be 3-6 and 6+.  Now I have to change my filters.
First I have to realize there's a problem and either call tech support to
find out why I suddenly getting more spam or I happen to read SA-talk and
I know why (highly unlikely).  Then I'll probably have to have tech
support assistance to change my filter thresholds.  Everything is fine for
a month or two after that until the mail admin upgrades to a new SA
release.  This time the scores are raised.  Now some of my company bulk
memos don't get through.  However I don't notice this right away.  All I
know if that I'm getting very little spam.  I won't find out that the
thresholds have once again changed until I don't know something that's in
one of those memos (which could be a bad thing).  Again I have to call
tech support to find out about the change and to get help changing my
thresholds, again.

I don't see that scenario being unlikely at all.  My Unv is considering
rolling out SA.  The campus techs have been asked to provide the
assistance that the users need to set up the filters.  The techs aren't
going to want to go back out multiple times to keep updating every users'
filters each time SA gets upgraded.

I as as admin have done about all I can to filter out spam.  I've taken
every non-subjective step I know about.  I use DNSBls for open relays and
known spam sources.  I also maintain a huge list of purely spamming
domains.  I've done everything I can think of that isn't very subjective.
SA is the next step.  It's subjective though and every user will have
their own thresholds.  Because of that I have to pass the filtering down
to my users' MUA.  I can give them a limited amount of assistance in
setting up their filters.  Few will be able to do it on their own.  I
don't want to make major changes to the SA scores that would cause those
filters to have to be changed.  Make sense?

To me this scenario is very strong reasoning for not changing a SA score
very much, more than X percent.  This would mean that more time and
thought would have to be put in to each rule's score as they are added and
large scores would rarely be used.  I can't think of any caveats to this
off the top of my head but I'm sure there are some.  I think I've shown
that there are also caveats to making major score changes to SA rules
though too.

My $.02, thanks for the reply
 Justin



-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Re: [SAtalk] Consistency between releases

Reply via email to