Re: [SAtalk] Philosophical SA questions

Matt Kettler Mon, 22 Sep 2003 18:55:37 -0700

At 11:43 PM 9/22/03 +0000, Darren Madams wrote:

Note in advance, I'm assuming sa 2.55 since this is a recent install.

Philosophical question #1: Am I expecting too much to be disappointed with so many false negatives? I'm [obviously] nowhere near the numbers you guys are quoting. A lot of my ham doesn't have an X-Spam-Status header at all for some unknown reason. Should every non-spam? I thought I initially had a configuration problem, but other mail was working (and tagged good or bad) and it seems to have died down with bayes training.

If you've changed always_add_headers to 0, then it won't, otherwise all email should have a X-Spam-Status.. If they don't, take a close look at the Received: headers. Did they go through a secondary MX and avoid SA on your primary that way?

Philosophical question #1.5: Are the network tests (RAZOR, etc.) essentially required? I haven't installed them yet (was worried about processor and network impact), but could do so if my results will get much better.

No, although I find that a few of the RBLs are very helpful. I'd first install Net::DNS to get the RBLs going, and worry about razor, etc later.

Overall I view razor as better than dcc or pyzor (and STATISTICS.txt backs this up somewhat), but I view it only as a modest improvement.

Philosophical question #2: I feel I could do much better tweaking some of the rules (already made MIME_HTML_ONLY 3 points) that most of my spam hits that never are in my ham, but should I start there or just lower my overall spam threshold? Has anyone already done a "more aggressive" prefs file, especially anti-HTML mail so that I don't have to start from scratch?

I haven't, someone else might have.

Philosophical question #2.5: How are the default scores chosen? I thought I read they were determined mathematically based on their frequency in the test spam corpus? Is that true? If so, why is my corpus so different?

The rule scores are defined based on THE corpus, not yours. The Developers maintain a rather massive corpus, and the results of their analysis can be seen in STATISTICS.txt.

The bayes engine learns it's tokens based on your training, but sa-learn does not affect the rule scores, just what BAYES_xx results a given message will receive.

Philosophical question #3: One of the things I liked about SpamBouncer was feeding it your legitimate email addresses and mailing list addresses and then it would consider items sent TO those (missing or specifically there) in the overall scoring. I don't think SA offers anything like that... it's not whitelisting (since that's From:), and it fails on BCCs (hence the need for positive weighting of other factors)... would be nice to have? Anyone written a rule like that? Any suggestions? I'm not sure how highly to score it.

Spamassassin has both whitelist_to and whitelist_from, and has had both since long ago (at least 2.31 had both).

However, SA may fail to recognize To: unless your MTA inserts sufficient hints into the headers. (it looks at a lot of headers to try to figure this out, not just To:).

Philosophical question #4: Should I convert purely to bayes-type filters? I can't believe it's worth throwing out some of the basic SA heuristics, but the Bayes scores coming from SA have been pretty accurate. To start with, has anybody already written a prefs file favoring bayes heavier than default? Alternatively, can somebody explain to me the differences in the DEFAULT SCORES (local, net, with bayes, with bayes+net) column on the tests page?

Each test in SA gets 4 scores. Which scoreset SA uses depends on wether or not you have bayes and/or network tests.

As an example consider this score line:

score LOCAL_DEMO 1.0 2.0 3 4.0

This means that if you have neither bayes, nor network tests enabled a message matching the rule gets 1 point added. If you have Network tests, but no bayes, it gets 2 points. If you have bayes but no network, it gets 3 points and if you have both it gets 4 points.

The reason for the different scores is that in spamassassin, rule scores are not based on a single rule at a time. They are based on the combinations of matches.

The addition of the added bayes and/or network tests shifts the balance of the ruleset around. In order to properly avoid excessive false positives, the ruleset is evaluated and the scores are balanced 4 separate times.

Philosophical question #5: Should I try to get my bayes ham vs. spam ratio closer as many suggest? If so, why exactly? It seems a waste to throw out spam since it can only further prove the frequency of spam tokens and lack of hammy ones... maybe I'm missing the math behind it?

Idealisticaly, your bayes ham vs spam ratio should be the same as what you get in your inbox. In a perfect world, your bayes training is a perfect reflection of what it should see for typical spam and nonspam with no mis-classifications, and no lopsidedness.

You've got a spam:ham ratio of about 4:1 in your training, and about 4:1 in your actual received email, so that's very good.

If you over-train spam relative to what you realisticaly get, you'll effectively start shifting the center-line of bayes results towards the "high chance of spam" side (ie: towards BAYES_100).

If you over-train ham relative to reality, you'll effectively shift things the other way.

Philosophical question #6: Why autolearn only on the certainly spam? Most of them already score high on Bayes, why not train on the borderlines where bayes could push it over the edge? I get a lot of 3.9s and 4.2s with no (or little) affecting score from bayes.

Just because a message has a high bayes score already, doesn't mean there aren't new tokens in it that the bayes engine can learn from and apply those lessons towards future spam.

Thanks in advance! And I in no way mean this to be a negative statement on the work everyone has done on SA so far. I have nothing but respect for the code that's there! I just want to make it work the best way possible for me.


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Re: [SAtalk] Philosophical SA questions

Reply via email to