Dear all,

Thanks for all the replies to my question, I think all of them were
useful to read. Thank you all for your time.

I wasn't sure whom to reply to, but I've been tinkering with my setup
and I think that many spam messages are getting through which should
be caught by the so-called "Bayesian" text-based classifier. For
instance, there are 499 "spam" messages containing "Instacheat" in the
subject, and no such "ham" messages... The most recent such message in
my "spam" folder has BAYES_999, but there are two in my inbox, one
with BAYES_95 and one with BAYES_50. I've pasted the second one below.

There are a bunch of spam messages with similar properties, with e.g.
obvious erectile dysfunction words in the subject, getting through. My
spam folder has about 40,000 messages and my inbox maybe 30,000. I
tried changing my "train-spamassassin" script to clear the Bayesian
database first, and I also remembered the "sa-update" command and put
it in there for good measure. I disabled the mail fetch line in my
crontab while I ran it, so I'm sure it's not misclassifying things (to
answer Mr. Hardin's implied question).

    #!/bin/zsh
    sudo sa-update --channel updates.spamassassin.org --verbose
    sudo -u spamd sa-learn --clear
    sudo -u spamd sa-learn --showdots -D 1 --spam --dir ~/mail/folders/spam
    sudo -u spamd sa-learn --showdots -D 1 --ham --dir ~/mail/folders/inbox

After running it, I saw a message with a penny stock subject go from
BAYES_60 to BAYES_95, now being classified correctly; but all the ones
I described were misclassified by the latest training run (which took
hours).

It seems a bit unfortunate, at least from my perspective, that it's
not so easy to train the weights for various rules on a per-user
basis, not just automatic textual features but things like
HTML_MESSAGE or T_REMOTE_IMAGE... There are algorithms to do this
reweighting very quickly - e.g. using a logistic GLM which should take
seconds - while also incorporating some prior beliefs, corresponding
to default weights. But I think nobody in the machine learning
community seems to be really interested in the problem of spam... It
would also be nice to see what the Bayesian classifier is doing, but
the database is all hashes so one is left guessing when it goes wrong.

Of course, even Gmail's spam classifier is pretty bad in my
experience. I'm still waiting for a "semi-supervised" or "active
learning" solution, which can take some large corpora and query me
only about labels of the boundary cases. Maybe Google has already
tried this and it doesn't work for some reason that escapes my
imagination.

Given that Spamassassin has gone through the effort of coding up so
many useful rules, it should be easy for a machine learning researcher
to take these use them as features in a more modern algorithm. Maybe
I'm being totally unhelpful by pointing this out, in which case I
apologise for not knowing better. I think I tried to make the same
point here some years ago and it didn't go anywhere.

Best regards,

Frederick

Here's the sample spam:

    From tfioxmns...@mariupol.us  Fri Dec 16 20:30:08 2016
    Return-Path: <tfioxmns...@mariupol.us>
    X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on thutmose
    X-Spam-Level: ***
    X-Spam-Status: No, score=4.0 required=5.0 tests=BAYES_50,
            HEADER_FROM_DIFFERENT_DOMAINS,HELO_DYNAMIC_IPADDR,HTML_MESSAGE,
            MIME_QP_LONG_LINE,RDNS_DYNAMIC,T_REMOTE_IMAGE,T_SPF_HELO_TEMPERROR,
            T_SPF_TEMPERROR autolearn=no autolearn_force=no version=3.4.1      
    X-Original-To: frede...@ofb.net
    Delivered-To: frede...@ofb.net
    Received: from host-173-230-94-183.fltapsf.clients.pavlovmedia.com
            (host-173-230-94-183.fltapsf.clients.pavlovmedia.com 
[173.230.94.183])
            by ofb.net (Postfix) with SMTP id 1CF1D3FFB7
            for <frede...@ofb.net>; Fri, 16 Dec 2016 20:30:07 -0800 (PST)
    Message-ID: <756871361203-qgaxslpamnpdlkenbja...@pyzgb78.ezmicro.com>
    From: Alexandra Smith <smith_ryle...@ezmicro.com>
    Subject: Re: 1 Instacheat Request is Pending
    To: frede...@ofb.net
    Date: Sat, 17 Dec 2016 10:25:25 +0600
    Mime-Version: 1.0
    Content-Type: multipart/alternative; boundary="_av-gJPw4bNeVCqYrAQlhC5agA"
    X-My-Tags: inbox

    Content-Type: text/plain; charset=utf-8
    Content-Transfer-Encoding: quoted-printable

            *1 Instacheat Request is Pending*
      ❤ ❤ ❤

    --_av-Hmri4xobxH07rQj8ufhPIg
    Content-Type: text/html; charset=utf-8
    Content-Transfer-Encoding: quoted-printable


On Thu, Dec 15, 2016 at 08:42:36AM -0800, John Hardin wrote:
> On Thu, 15 Dec 2016, frede...@ofb.net wrote:
> 
> > sudo -u spamd sa-learn --showdots -D 1 --ham --dir ~/mail/folders/inbox
> 
> Bad idea. That learns as ham any FNs you haven't yet noticed and removed
> from your inbox.
> 
> You should only learn as ham messages that you have explicitly reviewed and
> judged as ham.
> 
> -- 
>  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
>  jhar...@impsec.org    FALaholic #11174     pgpk -a jhar...@impsec.org
>  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
> -----------------------------------------------------------------------
>   It is not the place of government to make right every tragedy and
>   woe that befalls every resident of the nation.
> -----------------------------------------------------------------------
>  Today: Bill of Rights day
> 

Reply via email to