Dear all, Thanks for all the replies to my question, I think all of them were useful to read. Thank you all for your time.
I wasn't sure whom to reply to, but I've been tinkering with my setup and I think that many spam messages are getting through which should be caught by the so-called "Bayesian" text-based classifier. For instance, there are 499 "spam" messages containing "Instacheat" in the subject, and no such "ham" messages... The most recent such message in my "spam" folder has BAYES_999, but there are two in my inbox, one with BAYES_95 and one with BAYES_50. I've pasted the second one below. There are a bunch of spam messages with similar properties, with e.g. obvious erectile dysfunction words in the subject, getting through. My spam folder has about 40,000 messages and my inbox maybe 30,000. I tried changing my "train-spamassassin" script to clear the Bayesian database first, and I also remembered the "sa-update" command and put it in there for good measure. I disabled the mail fetch line in my crontab while I ran it, so I'm sure it's not misclassifying things (to answer Mr. Hardin's implied question). #!/bin/zsh sudo sa-update --channel updates.spamassassin.org --verbose sudo -u spamd sa-learn --clear sudo -u spamd sa-learn --showdots -D 1 --spam --dir ~/mail/folders/spam sudo -u spamd sa-learn --showdots -D 1 --ham --dir ~/mail/folders/inbox After running it, I saw a message with a penny stock subject go from BAYES_60 to BAYES_95, now being classified correctly; but all the ones I described were misclassified by the latest training run (which took hours). It seems a bit unfortunate, at least from my perspective, that it's not so easy to train the weights for various rules on a per-user basis, not just automatic textual features but things like HTML_MESSAGE or T_REMOTE_IMAGE... There are algorithms to do this reweighting very quickly - e.g. using a logistic GLM which should take seconds - while also incorporating some prior beliefs, corresponding to default weights. But I think nobody in the machine learning community seems to be really interested in the problem of spam... It would also be nice to see what the Bayesian classifier is doing, but the database is all hashes so one is left guessing when it goes wrong. Of course, even Gmail's spam classifier is pretty bad in my experience. I'm still waiting for a "semi-supervised" or "active learning" solution, which can take some large corpora and query me only about labels of the boundary cases. Maybe Google has already tried this and it doesn't work for some reason that escapes my imagination. Given that Spamassassin has gone through the effort of coding up so many useful rules, it should be easy for a machine learning researcher to take these use them as features in a more modern algorithm. Maybe I'm being totally unhelpful by pointing this out, in which case I apologise for not knowing better. I think I tried to make the same point here some years ago and it didn't go anywhere. Best regards, Frederick Here's the sample spam: From tfioxmns...@mariupol.us Fri Dec 16 20:30:08 2016 Return-Path: <tfioxmns...@mariupol.us> X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on thutmose X-Spam-Level: *** X-Spam-Status: No, score=4.0 required=5.0 tests=BAYES_50, HEADER_FROM_DIFFERENT_DOMAINS,HELO_DYNAMIC_IPADDR,HTML_MESSAGE, MIME_QP_LONG_LINE,RDNS_DYNAMIC,T_REMOTE_IMAGE,T_SPF_HELO_TEMPERROR, T_SPF_TEMPERROR autolearn=no autolearn_force=no version=3.4.1 X-Original-To: frede...@ofb.net Delivered-To: frede...@ofb.net Received: from host-173-230-94-183.fltapsf.clients.pavlovmedia.com (host-173-230-94-183.fltapsf.clients.pavlovmedia.com [173.230.94.183]) by ofb.net (Postfix) with SMTP id 1CF1D3FFB7 for <frede...@ofb.net>; Fri, 16 Dec 2016 20:30:07 -0800 (PST) Message-ID: <756871361203-qgaxslpamnpdlkenbja...@pyzgb78.ezmicro.com> From: Alexandra Smith <smith_ryle...@ezmicro.com> Subject: Re: 1 Instacheat Request is Pending To: frede...@ofb.net Date: Sat, 17 Dec 2016 10:25:25 +0600 Mime-Version: 1.0 Content-Type: multipart/alternative; boundary="_av-gJPw4bNeVCqYrAQlhC5agA" X-My-Tags: inbox Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable *1 Instacheat Request is Pending* ❤ ❤ ❤ --_av-Hmri4xobxH07rQj8ufhPIg Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: quoted-printable On Thu, Dec 15, 2016 at 08:42:36AM -0800, John Hardin wrote: > On Thu, 15 Dec 2016, frede...@ofb.net wrote: > > > sudo -u spamd sa-learn --showdots -D 1 --ham --dir ~/mail/folders/inbox > > Bad idea. That learns as ham any FNs you haven't yet noticed and removed > from your inbox. > > You should only learn as ham messages that you have explicitly reviewed and > judged as ham. > > -- > John Hardin KA7OHZ http://www.impsec.org/~jhardin/ > jhar...@impsec.org FALaholic #11174 pgpk -a jhar...@impsec.org > key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 > ----------------------------------------------------------------------- > It is not the place of government to make right every tragedy and > woe that befalls every resident of the nation. > ----------------------------------------------------------------------- > Today: Bill of Rights day >