Re: [SAdev] Re: [SAtalk] RE: Mack Truck

Daniel Quinlan Tue, 18 Mar 2003 02:23:20 -0800

"Tony L. Svanstrom" <[EMAIL PROTECTED]> writes:

> It's an "agree to disagree"-situation, methinks;


You could completely right and the right course of action would be to
remove all negative scoring rules.  Further, I care less about being
right than I do about improving SA.  My goal here isn't really to
convince you that you're wrong, but to explain my thinking and some of
the general direction of SA development in this area (not that I speak
for the other developers, of course).

> where I claim that the negative scores will hit unevenly (hugely
> benefiting some mailclients, indirectly very slightly hurting those of
> unknown/less known mailers)

The percentage of legitimate clients that we hit is very large.
However, most of the MUA rules are there as prerequisites to detect
forgery, not as compensation (negative scoring) rules.

Also, I think you are excessively focusing on the mail client rules.  To
be perfectly honest, I suspect those specific rules should probably go
because they are too easy to forge.  However, there are many other
negative rules.  Some of the negative rules are needed to compensate for
certain bad behaviors found in specific legitimate clients, otherwise we
would lose some of our otherwise effective spam-detection rules.  For
example, we have a negative-scoring rule for Evites.

Also, some negative rules that are nearly foolproof and a few planned
ones that will be even better (like Message-ID tracking).  There's no
reason to get rid of all negative rules which is what you're claiming is
a good idea -- with no basis, but lots of rhetoric.

I have data that shows these rules work well.  For example,

OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
  66734    30802    35932    0.462   0.00    0.00  (all messages)
100.000  46.1564  53.8436    0.462   0.00    0.00  (all messages as %)
 11.087   0.0065  20.5861    0.000   0.99   -6.60  REFERENCES

(Corpus data from theo, rODbegbie, and myself.)

Supplement the REFERENCES one with a Message-ID tracking REFERENCE_SEEN
test, then it's even better.  I suspect the attribution/quote rules will
probably get phased out at some point.

Also, I think the negative scores are too low, but again, we're already
talking about fixing that for 2.5x and 2.60.

> and at the same time allow the more clever spammers to bypass SA (as
> well as those less clueful spammers that just follow the current
> trends), while the local gods (meant in a very friendly and nice way)
> claim that at the bottomline the good outweights the bad.
> Sure, I could try to prove it, but then I'd need to set up an as close
> to as possible identical environment to that of the SA-developers; and
> unless the resulting data would show anything hugely different the
> data would just be discared as not proving anything relevant.

Not really.  Yes, you need a good corpus and you need to use statistics
rather than not guesses, but we don't have a monopoly on those things.

We can't just make changes because you *believe* some statistical fact
is true without actually having the statistical data.

> The second problem would be that I claim that it will even things out
> for most people, which results in that even if I'm right the
> endresults might be very very close to that of a "standard" SA if you
> just run it with a large enough corpus...

If the results are going to be the same, then why should anyone care?
Maybe it would save me some time discussing the issue, but that's just
speculation.  Even more people might complain on the other side of the
issue if we made that change for no reason.  :-)

> [...] 
>> Finally, does anyone have evidence that our FN rates are going up from
>> release to release?  Mine certainly aren't.  (Objective results matter.)

>  I think it's a combination of two things already talked about on satalk:
> 
>  #1: An increase in spamtraffic.
>  #2: More and more ISPs using spamfilters.
> 
> The result is that to some people it will feel like they're getting
> the same number of spam, but more FPs; a situation caused by ISP
> already having removed the easiest to spot spams.

When ISPs use spamfilters (and then the user runs SpamAssassin), it's
the SpamAssassin FN rates, not FP rates, that skyrocket.  Spam traffic
is higher, which is why I still receive about one uncaught spam per day,
but I get 2-3x as much spam as a year ago.  But, I agree, that these are
factors that affect user perception of our performance negatively.

I guess when you're close to perfect, then people will start demanding
perfection.  ;-)

> Personally I'd also like to think that it's becaused people are being
> hit by different groups of spammers, which are more or less clever
> when it comes to designing their spam... Something that would to some
> extent support my claim that negative scores should be avoided.

To paraphrase Craig: we don't have to catch spam from every spammer.  We
just have to catch spam from most (or the average) spammers.  The cost
of catching 100.0% of spam is a significant number of false positives
because *some* spam will always manage to look sufficiently similar to
ham.
 
>  That last part is pure guessing and, IMNSHO, borderline trollish. =)

I think it's the earlier part of your message that was the troll,
actually.  You can't win an argument by claiming it's unwinnable and
therefore you must right.  ;-)

Daniel

-- 
Daniel Quinlan                     anti-spam (SpamAssassin), Linux, and open
http://www.pathname.com/~quinlan/   source consulting (looking for new work)


-------------------------------------------------------
This SF.net email is sponsored by: Does your code think in ink? 
You could win a Tablet PC. Get a free Tablet PC hat just for playing. 
What are you waiting for?
http://ads.sourceforge.net/cgi-bin/redirect.pl?micr5043en
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Re: [SAdev] Re: [SAtalk] RE: Mack Truck

Reply via email to