Re: [SAtalk] Philosophical SA questions

Robert Menschel Mon, 22 Sep 2003 22:51:28 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello Darren,


Monday, September 22, 2003, 4:43:33 PM, you wrote:

DM> 256 Ham, 1040 Probably Spam (>5 points), 256 Almost Certainly Spam
DM> (>15 points), and 269 false negatives, 0 false positives.  Bayes was
DM> trained with 16680 Spam, 4092 Ham, 125776 tokens.  I have
DM> auto-learning enabled, and feed all the false negatives back into
DM> sa-learn the same day...

Sounds like you're doing the right things.  Which version of SpamAssassin
are you using? That can be VERY important to this situation. Stats: your
FNs are about 15% of your email load.

DM> Philosophical question #1:  Am I expecting too much to be
DM> disappointed with so many false negatives?  I'm [obviously] nowhere
DM> near the numbers you guys are quoting.  A lot of my ham doesn't have
DM> an X-Spam-Status header at all for some unknown reason.  Should every
DM> non-spam?  I thought I initially had a configuration problem, but
DM> other mail was working (and tagged good or bad) and it seems to have
DM> died down with bayes training.

If you're using SpamAssassin 2.53 or older, then I wouldn't be concerned
with this number of false negatives. 2.54 or newer should have brought
you under the 10% FN level, and more likely to 5%.

Yes, ALL of your emails should have the X-Spam-Status header. If it
doesn't then they are being missed by SA. I suspect then that many of
your FNs are also missing this header, and are being treated as ham only
because SA never saw them at all. Can you verify this?

Early 2.6 test releases had this problem, where they would miss emails
based on load problems. Other versions may have that situation as well.
That could cause this type of symptom.

DM> Philosophical question #1.5:  Are the network tests (RAZOR, etc.)
DM> essentially required?  I haven't installed them yet (was worried
DM> about processor and network impact), but could do so if my results
DM> will get much better.

Watch your network impact definitely, but I would say yes, the network
tests are VERY helpful.  I'd guess that half my spam is detected
primarily by network tests, with non-network tests being just the edge
which pushes almost-spam into the probably-spam category.

DM> Philosophical question #2:  I feel I could do much better tweaking
DM> some of the rules (already made MIME_HTML_ONLY 3 points) that most of
DM> my spam hits that never are in my ham, but should I start there or
DM> just lower my overall spam threshold?  Has anyone already done a
DM> "more aggressive" prefs file, especially anti-HTML mail so that I
DM> don't have to start from scratch?

Definitely tweak some rule scores.  I can send you my list of score
tweaks if it will help (about 150 scores -- you'll need to take them as
only a sample -- copying mine as is would probably be a bad idea).

Mine isn't specifically anti-HTML, quite far from it, since most of the
ham received here contains HTML segments. I have tweaked it quite a bit,
though.

DM> Philosophical question #2.5:  How are the default scores chosen?  I
DM> thought I read they were determined mathematically based on their
DM> frequency in the test spam corpus?  Is that true?  If so, why is my
DM> corpus so different?

Default scores are chosen by running them against a huge corpus of email,
both spam and ham, and then modifying the scores over and over and over
again until they capture the highest numbers of spam while flagging the
lowest numbers of ham (with the emphasis on avoiding false positives).

Your corpus is so different because you and your users are so different.
If we compare you and me, it's unlikely you receive much email which
discussing plumbing materials or roofing insulation, nor ancient egyptian
philosophy and religion, whereas those two categories are a large portion
of my ham. Likewise, the spam my family signs up for by accident is bound
to be different from the spam your professional client firms sign up for
by accident.

DM> Philosophical question #3:  One of the things I liked about
DM> SpamBouncer was feeding it your legitimate email addresses and
DM> mailing list addresses and then it would consider items sent TO those
DM> (missing or specifically there) in the overall scoring.  I don't
DM> think SA offers anything like that... it's not whitelisting (since
DM> that's From:), and it fails on BCCs (hence the need for positive
DM> weighting of other factors)... would be nice to have?  Anyone written
DM> a rule like that?  Any suggestions?  I'm not sure how highly to score
DM> it.

There is an ability to "whitelist to" (subtract points from all email
sent to specific addresses).  I also have several header rules which test
To and Cc for invalid or high-spam addresses, and adjust scores on those.
You'll see some of my ToCc rules at http://www.exit0.us

DM> Philosophical question #4:  Should I convert purely to bayes-type
DM> filters?  I can't believe it's worth throwing out some of the basic
DM> SA heuristics, but the Bayes scores coming from SA have been pretty
DM> accurate.  To start with, has anybody already written a prefs file
DM> favoring bayes heavier than default?  Alternatively, can somebody
DM> explain to me the differences in the DEFAULT SCORES (local, net, with
DM> bayes, with bayes+net) column on the tests page?

I have modified Bayes 70 = 1/3 threshold, 80 = 55%, 90 = 75%, 99 =
threshold.

No, I wouldn't use Bayes alone. Bayes is wonderful, but it should be used
along with the basic rule set and with the network tests for best
results.

DM> Philosophical question #5:  Should I try to get my bayes ham vs. spam
DM> ratio closer as many suggest?  If so, why exactly?  It seems a waste
DM> to throw out spam since it can only further prove the frequency of
DM> spam tokens and lack of hammy ones... maybe I'm missing the math
DM> behind it?

I feed all false-negatives into Bayes. I feed all flagged spam with
Bayes_70 or less into Bayes. I feed all ham into Bayes. I figure I'm
running about 1 ham for every 4 spam fed, manually or automatically. I
agree: do not throw out spam, but that which Bayes is already very
confident about probably doesn't need strengthening.

DM> Philosophical question #6:  Why autolearn only on the certainly spam?
DM>  Most of them already score high on Bayes, why not train on the
DM> borderlines where bayes could push it over the edge? I get a lot of
DM> 3.9s and 4.2s with no (or little) affecting score from bayes.

Auto-learn on the very certainly spam because that's what the automatic
system can be sure of.  The most probably spam isn't almost certainly
spam, and you DO NOT want to feed any ham as spam accidentally -- it can
foul up your Bayes results seriously.

Bob Menschel

-----BEGIN PGP SIGNATURE-----
Version: PGP 8.0

iQA/AwUBP2+mbpebK8E4qh1HEQKedgCfXoY2r8WwcfDDYcm3OituBOiQhCkAoICc
Oj8KV2Uzhg6W1fS0PGqrViYv
=ZPRZ
-----END PGP SIGNATURE-----




-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Re: [SAtalk] Philosophical SA questions

Reply via email to