Hi, On Thu, 10 Dec 2009 20:28:27 +0100, Johannes Bauer wrote: > Eduardo M KALINOWSKI schrieb: >> On Qui, 10 Dez 2009, Johannes Bauer wrote: >>> I'm thinking about filtering all such encoded subjects (as there's no >>> reason to encode them US-ASCII), but suppose it were UTF-8 or something: >>> how can I filter on the actual content, not the encoded subject? Surely >>> someone has solved that problem already? >> >> Yes, such as the guys behind SpamAssassin, or dspam, or any of the many >> spam filtering programs that exist. Actually, they make much more >> complicated decisions instead of only looking for bad words in the >> subject field. I'd suggest you try installing one of them. > > I had SpamAssassin running once and was pretty disappointed. All those > complicated rules and scoring and "smart" bayesian filtering did not > work very well, although I taught it in around 50k mails right from > wrong. I had both lots of false-positives and lots of false-negatives, > which was kind of annoying. > > However, analyzing 274 spam mails I deleted in the last 5 months I can > conclude that by using that extremely simple filter list I'd catch 258 > of them (that's 94%). So I'd like to stick to KISS in this case.
That must have been a configuration issue - SpamAssassin works pretty well, if configured correctly - but I admit, it's a monster (both in terms of configuration and resource usage). You could go for bogofilter (purely Bayesian). I'm using it for years on my private mail server with very good results. I like to use the tri-state filtering, where there is not only one threshold value, but two. A certainty of a mail being spam ("bogosity") of 0.35 and below goes into my inbox, mails with a bogosity value between 0.35 and 0.65 go into Spam/Unsure, and everything above 0.65 goes directly into Spam. That way I have something like 10-20 mails per week in Spam/Unsure that are usually false negatives, rarely false positives (currently around 1000 mails per week end up in Spam). To my knowledge there has never been a false positive in Spam. Of course initial training is necessary. For ongoing training / feedback I have set up a Spam/Learn-Spam and Spam/Learn-Ham mailbox into which I move false negatives/positives. A cron script then runs the mails found in those (maildir) mailboxes through bogofilter again, with the command line option for classifying the mail as Spam/Ham and moves them to the correct mailbox (Spam/inbox) afterwards. This works well in all MUAs, because it only requires IMAP functionality to train the filter. The solution was inspired by a Gentoo Wiki article (http://www.gentoo-wiki.info/Bogofilter). Patrick. -- STAR Software (Shanghai) Co., Ltd. http://www.star-group.net/ Phone: +86 (21) 3462 7688 x 826 Fax: +86 (21) 3462 7779 PGP key E883A005 https://stshacom1.star-china.net/keys/patrick_nagel.asc Fingerprint: E09A D65E 855F B334 E5C3 5386 EF23 20FC E883 A005