-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hello Darren,
Monday, September 22, 2003, 4:43:33 PM, you wrote: DM> 256 Ham, 1040 Probably Spam (>5 points), 256 Almost Certainly Spam DM> (>15 points), and 269 false negatives, 0 false positives. Bayes was DM> trained with 16680 Spam, 4092 Ham, 125776 tokens. I have DM> auto-learning enabled, and feed all the false negatives back into DM> sa-learn the same day... Sounds like you're doing the right things. Which version of SpamAssassin are you using? That can be VERY important to this situation. Stats: your FNs are about 15% of your email load. DM> Philosophical question #1: Am I expecting too much to be DM> disappointed with so many false negatives? I'm [obviously] nowhere DM> near the numbers you guys are quoting. A lot of my ham doesn't have DM> an X-Spam-Status header at all for some unknown reason. Should every DM> non-spam? I thought I initially had a configuration problem, but DM> other mail was working (and tagged good or bad) and it seems to have DM> died down with bayes training. If you're using SpamAssassin 2.53 or older, then I wouldn't be concerned with this number of false negatives. 2.54 or newer should have brought you under the 10% FN level, and more likely to 5%. Yes, ALL of your emails should have the X-Spam-Status header. If it doesn't then they are being missed by SA. I suspect then that many of your FNs are also missing this header, and are being treated as ham only because SA never saw them at all. Can you verify this? Early 2.6 test releases had this problem, where they would miss emails based on load problems. Other versions may have that situation as well. That could cause this type of symptom. DM> Philosophical question #1.5: Are the network tests (RAZOR, etc.) DM> essentially required? I haven't installed them yet (was worried DM> about processor and network impact), but could do so if my results DM> will get much better. Watch your network impact definitely, but I would say yes, the network tests are VERY helpful. I'd guess that half my spam is detected primarily by network tests, with non-network tests being just the edge which pushes almost-spam into the probably-spam category. DM> Philosophical question #2: I feel I could do much better tweaking DM> some of the rules (already made MIME_HTML_ONLY 3 points) that most of DM> my spam hits that never are in my ham, but should I start there or DM> just lower my overall spam threshold? Has anyone already done a DM> "more aggressive" prefs file, especially anti-HTML mail so that I DM> don't have to start from scratch? Definitely tweak some rule scores. I can send you my list of score tweaks if it will help (about 150 scores -- you'll need to take them as only a sample -- copying mine as is would probably be a bad idea). Mine isn't specifically anti-HTML, quite far from it, since most of the ham received here contains HTML segments. I have tweaked it quite a bit, though. DM> Philosophical question #2.5: How are the default scores chosen? I DM> thought I read they were determined mathematically based on their DM> frequency in the test spam corpus? Is that true? If so, why is my DM> corpus so different? Default scores are chosen by running them against a huge corpus of email, both spam and ham, and then modifying the scores over and over and over again until they capture the highest numbers of spam while flagging the lowest numbers of ham (with the emphasis on avoiding false positives). Your corpus is so different because you and your users are so different. If we compare you and me, it's unlikely you receive much email which discussing plumbing materials or roofing insulation, nor ancient egyptian philosophy and religion, whereas those two categories are a large portion of my ham. Likewise, the spam my family signs up for by accident is bound to be different from the spam your professional client firms sign up for by accident. DM> Philosophical question #3: One of the things I liked about DM> SpamBouncer was feeding it your legitimate email addresses and DM> mailing list addresses and then it would consider items sent TO those DM> (missing or specifically there) in the overall scoring. I don't DM> think SA offers anything like that... it's not whitelisting (since DM> that's From:), and it fails on BCCs (hence the need for positive DM> weighting of other factors)... would be nice to have? Anyone written DM> a rule like that? Any suggestions? I'm not sure how highly to score DM> it. There is an ability to "whitelist to" (subtract points from all email sent to specific addresses). I also have several header rules which test To and Cc for invalid or high-spam addresses, and adjust scores on those. You'll see some of my ToCc rules at http://www.exit0.us DM> Philosophical question #4: Should I convert purely to bayes-type DM> filters? I can't believe it's worth throwing out some of the basic DM> SA heuristics, but the Bayes scores coming from SA have been pretty DM> accurate. To start with, has anybody already written a prefs file DM> favoring bayes heavier than default? Alternatively, can somebody DM> explain to me the differences in the DEFAULT SCORES (local, net, with DM> bayes, with bayes+net) column on the tests page? I have modified Bayes 70 = 1/3 threshold, 80 = 55%, 90 = 75%, 99 = threshold. No, I wouldn't use Bayes alone. Bayes is wonderful, but it should be used along with the basic rule set and with the network tests for best results. DM> Philosophical question #5: Should I try to get my bayes ham vs. spam DM> ratio closer as many suggest? If so, why exactly? It seems a waste DM> to throw out spam since it can only further prove the frequency of DM> spam tokens and lack of hammy ones... maybe I'm missing the math DM> behind it? I feed all false-negatives into Bayes. I feed all flagged spam with Bayes_70 or less into Bayes. I feed all ham into Bayes. I figure I'm running about 1 ham for every 4 spam fed, manually or automatically. I agree: do not throw out spam, but that which Bayes is already very confident about probably doesn't need strengthening. DM> Philosophical question #6: Why autolearn only on the certainly spam? DM> Most of them already score high on Bayes, why not train on the DM> borderlines where bayes could push it over the edge? I get a lot of DM> 3.9s and 4.2s with no (or little) affecting score from bayes. Auto-learn on the very certainly spam because that's what the automatic system can be sure of. The most probably spam isn't almost certainly spam, and you DO NOT want to feed any ham as spam accidentally -- it can foul up your Bayes results seriously. Bob Menschel -----BEGIN PGP SIGNATURE----- Version: PGP 8.0 iQA/AwUBP2+mbpebK8E4qh1HEQKedgCfXoY2r8WwcfDDYcm3OituBOiQhCkAoICc Oj8KV2Uzhg6W1fS0PGqrViYv =ZPRZ -----END PGP SIGNATURE----- ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk