On Tue, 15 Jan 2013, Ben Johnson wrote:
On 1/14/2013 8:16 PM, John Hardin wrote:
On Mon, 14 Jan 2013, Ben Johnson wrote:
Question: do you have any SMTP-time hard-reject DNSBL tests in place? Or
are they all performed by SA?
In postfix's main.cf:
smtpd_recipient_restrictions = permit_mynetworks,
permit_sasl_authenticated, check_recipient_access
mysql:/etc/postfix/mysql-virtual_recipient.cf,
reject_unauth_destination, reject_rbl_client bl.spamcop.net
Do you recommend something more?
Unfortunately I have no experience administering Postfix. Perhaps one of
the other listies can help.
http://www.greylisting.org/
Hmm, very interesting. No, I have no greylisting in place as yet, and
no, my userbase doesn't demand immediate delivery. I will look into
greylisting further.
One other thing you might try is publishing an SPF record for your domain.
There is anecdotal evidence that this reduces the raw spam volume to that
domain a bit.
Given this information, it concerns me that Bayes scores hardly seem to
budge when I feed sa-learn nearly identical messages 3+ times. We'll get
into that below.
If so, then I guess the only remedy here is to focus on why Bayes seems
to perform so miserably.
Agreed.
It must be a configuration issue, because I've sa-learn-ed messages
that are incredibly similar for two days now and not only do their
Bayes scores not change significantly, but sometimes they decrease.
And I have a hard time believing that one of my users is sa-train-ing
these messages as ham and negating my efforts.
This is why you retain your Bayes training corpora: so that if Bayes
goes off the rails you can review your corpora for misclassifications,
wipe and retrain. Do you have your training corpora? Or do you discard
messages once you've trained them?
I had the good sense to retain the corpora.
Yay!
_Do_ you allow your users to train Bayes? Do they do so unsupervised or
do you review their submissions? And if the process is automated, do you
retain what they have provided for training so that you can go back
later and do a troubleshooting review?
Yes, users are allowed to train Bayes, via Dovecot's Antispam plug-in.
They do so unsupervised. Why this could be a problem is obvious. And no,
I don't retain their submissions. I probably should. I wonder if I can
make a few slight modifications to the shell script that Antispam calls,
such that it simply sends a copy of the message to an administrator
rather than calling sa-learn on the message.
That would be a very good idea if the number of users doing training is
small. At the very least, the messages should be captured to a permanent
corpus mailbox.
Do your users also train ham? Are the procedures similar enough that your
users could become easily confused?
Do you have autolearn turned on? My opinion is that autolearn is only
appropriate for a large and very diverse userbase where a sufficiently
"common" corpus of ham can't be manually collected. but then, I don't
admin a Really Large Install, so YMMV.
No, I was sure to disable autolearn after the last Bayes fiasco. :)
OK.
Do you use per-user or sitewide Bayes? If per-user, then you need to
make sure that you're training Bayes as the same user that the MTA is
running SA as.
Site-wide. And I have hard-coded the username in the SA configuration to
prevent confusion in this regard:
bayes_sql_override_username amavis
What user does your MTA run SA as? What user do you train Bayes as?
The MTA should pass scanning off to "amavis". I train the DB in two
ways: via Dovecot Antispam and by calling sa-learn on my training
mailbox. Given that I have hard-coded the username, the output of
"sa-learn --dump magic" is the same whether I issue the command under my
own account or "su" to the "amavis" user.
OK, good.
I have ensured that the spam token count increases when I train these
messages. That said, I do notice that the token count does not *always*
change; sometimes, sa-learn reports "Learned tokens from 0 message(s) (1
message(s) examined)". Does this mean that all tokens from these
messages have already been learned, thereby making it pointless to
continue feeding them to sa-learn?
No, it means that Message-ID has been learned from before.
I see. So, when this happens, it means that one of my users has already
dragged the message from Inbox to Junk (which triggers the Antispam
plug-in and feeds the message to sa-learn).
Very likely.
The extremely odd thing is that you say you sometimes train a message as
spam, and its Bayes score goes *down*. Are you training a message and
then running it torough spamc to see if the score changed, or is this
about _similar_ messages rather than _that_ message?
When this scenario occurs, my efforts in feeding the same message to
sa-learn are wasted, right? Bayes doesn't "learn more" from the message
the second time, or increase it's tokens' "weight", right? It would be
nice if I could eliminate this duplicate effort.
Correct, no new information is learned.
Based on my responses, what's the next move? Backup the Bayes DB, wipe
it, and feed my corpus through the ol' chipper?
That, and configure the user-based training to at the very least capture
what they submit to a corpus so you can review it. Whether you do that
review pre-training or post-bayes-is-insane is up to you.
--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhar...@impsec.org FALaholic #11174 pgpk -a jhar...@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
The difference is that Unix has had thirty years of technical
types demanding basic functionality of it. And the Macintosh has
had fifteen years of interface fascist users shaping its progress.
Windows has the hairpin turns of the Microsoft marketing machine
and that's all. -- Red Drag Diva
-----------------------------------------------------------------------
2 days until Benjamin Franklin's 307th Birthday