Hello Stewart, Sunday, September 12, 2004, 4:42:13 PM, you wrote:
>> Adding custom rules is among the last things you want to do. I do them, >> and I can help you with the process (provided you can run bash scripts >> under cron), but there are things you want to do first. SN> I had considered running SpamAssassin from a background job, but there SN> seemed to be a bad interaction with IMAP (see below). I don't think you should do that anyway, since SpamAssassin is being run automatically by your host. I'd be concerned about such a system corrupting your emails. >> Step 1: If False Positives are your major problem, >> a) identify which rules are causing the false positives and lower their >> scores, or >> b) raise your required_hits, or >> c) both. I use required_hits of 9.0, and have modified the scores of >> several dozen rules. SN> We don't have an FP problem at all. Mail sent by individuals almost SN> always gets a negative score, and our users know that they need to SN> make a whitelist entry if they don't want to miss "Sex News Daily" ;) SN> It's the dozen spams per user per day that leak through that is our SN> problem. Good. That's easier to deal with. Sorry for misreading your original email. >> Step 2: Having done step 1, you'll increase the amount of spam that comes >> through. Identify which distribution rules hit that spam, and raise their >> scores enough to score the spam, without causing false positives. SN> Well, a typical false negative shows: SN> X-Spam-Status: No, hits=3.7 required=5.0 tests=BAYES_50,HTML_90_100, SN> HTML_IMAGE_ONLY_02,HTML_MESSAGE,MIME_HTML_ONLY,RCVD_IN_SBL SN> The only difference, unfortunately, between this and much commercial SN> ham is the SBL, but that gets too polluted with ham sources to assign SN> it a much bigger score. So the best solution for those is Bayes. >> Step 3: Bayes is your friend. Identify all email as guaranteed spam, >> guaranteed not-spam, spam discussions, and uncertain. Feed the first two >> into the Bayes system consistently and accurately, and that will help >> enormously. >> So enormously that some people will recommend doing step 3 before steps 1 >> and 2. SN> Yes. I made a big mistake here, naively thinking that the autolearn SN> feature would do an adequate job. I now suspect that the bayes_* files SN> on my server are garbage. Should I save and delete them before feeding SN> the spam and ham corpera to sa-learn? Is it necessary to run sa-learn SN> on mail that SpamAssassin has already correctly classified? Actually, unless you're getting spam flagged regularly as BAYES_00, or non-spam as BAYES_99, then you don't yet have a problem. If spam is sneaking through with BAYES_50 as above, then no, your Bayes files are not garbage -- they just haven't learned about the questionable emails yet. Unless you have the 00/99 problem causing emails to be mis-classified, do not delete your bayes files. Simply train them better. It's not necessary to run sa-learn on mail that SpamAssassin has already auto-learned, but it doesn't hurt. If SpamAssassin correctly classified but did not auto-learn an email, then it's not *necessary* to sa-learn it, but it helps. The more emalis you feed to Bayes, correctly, the more correctly Bayes will be able to score emails going forward. I don't worry here about whether an email has been correctly or not correctly classified, nor whether it's been auto-learned. I sa-learn EVERY email after manual classification. >> Step 4: Your system does allow for whitelist and blacklist entries. Maybe >> this should be in front of step 1 also: identify from your false >> positives those sites that can be reliably whitelisted with >> whitelist_from_rcvd (use the _rcvd version rather than just >> whitelist_from whenever possible). Copy William Sterns' blacklist file >> from http://www.stearns.org/sa-blacklist/sa-blacklist.current.cf into >> your user_prefs. SN> Many thanks for this link. I manually checked some uncaught spam against SN> it, and found hits on about 75% ! I'll be installing this right away. SN> However, it is IMO unfortunate that we are forced to blacklist by name. SN> Bill Waggoner alone accounts for about 1000 domains on Mr. Sterns' list. SN> If we could say blacklist_from_rcvd 69.42.96.0/19, one line would do the SN> job of 1000. More importantly, it would last a lot longer, because SN> this A-hole got his IPs directly from ARIN and they are unlikely to change SN> any time soon. OTOH, he registers a dozen new domains every day! Agreed. That's why SARE has begun using our SARE_RECV_IP_* rules. The best of those may eventually end up in the distribution set. >> Bayes: Do your people retrieve their email using POP3 (in which case >> they probably get the inbox mail only), or do they use webmail? If the >> latter, have them create two more folders: spam and notspam. Have them >> move all spam into the spam folder. Have them copy (not move) all >> non-spam intothe notspam folder. Have a cron job which runs sa-learn >> against these mbox files on a regular basis (mine runs hourly), deleting >> the mbox files when done. SN> We don't use POP3 at all; it's mostly IMAP and occasionally webmail. SN> The good news is that the folders you describe are easily accessible; SN> I'll try that in the next couple of days and let you know how it works. SN> The bad news (I think) is that when users leave their Outlook open, SN> then new mail appears on the desktop within seconds of when it is SN> delivered to the server. This would prevent a cron-based task from SN> resorting the mail properly. But you don't want to run sa-learn on un-verified emails. You want your users to check the emails, and you want someone to manually put the spam into a spam folder for sa-learn, and to manually copy the not-spam into a not-spam folder for sa-learn. Automating this without manual verification /will/ corrupt your Bayes files. >> No, under your setup there's no way for each mailbox to have its own >> user_prefs; there's one user_prefs for each master domain and that's it. >> There's also no way for each mailbox to have its own bayes database -- >> there's one bayes database for the entire master domain. SN> I realize that this is true for my present setup. However, I hope that SN> the new setup won't have those restrictions. If it's possible to run SN> SpamAssassin via cron or whatever, it should also be possible to run SN> a private copy that is installed in my home directory. I hope that by SN> determining the recipient and setting up an appropriate environment SN> prior to invoking SpamAssassin, independent bayes and prefs will work. SN> If not, hey, SpamAssassin is made of this amazing stuff called open source SN> -- you can change the code and make it do what you want. Of course, SN> it may take more effort than the improvement in performance would justify, SN> so I'll first see how much improvement sa-learn gives. Several people are making progress with SQL-based user_prefs and rules; their systems might be adaptable to yours. Bob Menschel