On 1/14/2013 8:16 PM, John Hardin wrote: > On Mon, 14 Jan 2013, Ben Johnson wrote: > >> I understand that snowshoe spam may not hit any net tests. I guess my >> confusion is around what, exactly, classifies spam as "snowshoe". > > http://www.spamhaus.org/faq/section/Glossary > > Basically, a large number of spambots sending the message so that no one > sending IP can be easily tagged as evil. > > Question: do you have any SMTP-time hard-reject DNSBL tests in place? Or > are they all performed by SA?
In postfix's main.cf: smtpd_recipient_restrictions = permit_mynetworks, permit_sasl_authenticated, check_recipient_access mysql:/etc/postfix/mysql-virtual_recipient.cf, reject_unauth_destination, reject_rbl_client bl.spamcop.net Do you recommend something more? > Recommendation: consider using the Spamhaus ZEN DNSBL as a hard-reject > SMTP-time DNS check in your MTA. It is well-respected and very reliable. > One thing it includes is ranges of IP addresses that should not ever be > sending email, so it may help reduce snowshoe spam. > > http://www.spamhaus.org/zen/ This article looks to be pretty thorough: http://www.cyberciti.biz/faq/howto-configure-postfix-dnsrbls-under-linux-unix/ I'll add Spamhaus ZEN and a few others to the list. > Another tactic that many report good results from is Greylisting. Do you > have greylisting in place? Does your userbase demand no delays in mail > delivery? In addition to blocking spam from spambots that do not retry, > it can delay mail enough for the BLs to get a chance to list new > IPs/domains, which can reduce the leakage if you happen to be at the > leading edge of a new delivery campaign. > > http://www.greylisting.org/ Hmm, very interesting. No, I have no greylisting in place as yet, and no, my userbase doesn't demand immediate delivery. I will look into greylisting further. >> Are most/all of the BL services hash-based? > > Generally: > > DNSBL: Blacklist of IP addresses > URIBL: Blacklist of domain and host names appearing in URIs > EMAILBL: (not widely used) Blacklist of email addresses (e.g. > phishing response addresses) > Razor, Pyzor: Blacklist of message content checksums/hashes Perfect; that answers my question. >> In other words, if a known spam message was added yesterday, will it >> be considered "snowshoe" spam if the spammer sends the same message >> today and changes only one character within the body? > > No, the diverse IP addresses are the hallmark of "snowshoe", not so much > the specific message content. If you see identical or generally-similar > (e.g.) pharma spam coming from a wide range of different IP addresses, > that's snowshoe. I see. Given this information, it concerns me that Bayes scores hardly seem to budge when I feed sa-learn nearly identical messages 3+ times. We'll get into that below. >> If so, then I guess the only remedy here is to focus on why Bayes seems >> to perform so miserably. > > Agreed. > >> It must be a configuration issue, because I've sa-learn-ed messages >> that are incredibly similar for two days now and not only do their >> Bayes scores not change significantly, but sometimes they decrease. >> And I have a hard time believing that one of my users is sa-train-ing >> these messages as ham and negating my efforts. > > This is why you retain your Bayes training corpora: so that if Bayes > goes off the rails you can review your corpora for misclassifications, > wipe and retrain. Do you have your training corpora? Or do you discard > messages once you've trained them? I had the good sense to retain the corpora. > _Do_ you allow your users to train Bayes? Do they do so unsupervised or > do you review their submissions? And if the process is automated, do you > retain what they have provided for training so that you can go back > later and do a troubleshooting review? Yes, users are allowed to train Bayes, via Dovecot's Antispam plug-in. They do so unsupervised. Why this could be a problem is obvious. And no, I don't retain their submissions. I probably should. I wonder if I can make a few slight modifications to the shell script that Antispam calls, such that it simply sends a copy of the message to an administrator rather than calling sa-learn on the message. > Do you have autolearn turned on? My opinion is that autolearn is only > appropriate for a large and very diverse userbase where a sufficiently > "common" corpus of ham can't be manually collected. but then, I don't > admin a Really Large Install, so YMMV. No, I was sure to disable autolearn after the last Bayes fiasco. :) > Do you use per-user or sitewide Bayes? If per-user, then you need to > make sure that you're training Bayes as the same user that the MTA is > running SA as. Site-wide. And I have hard-coded the username in the SA configuration to prevent confusion in this regard: bayes_sql_override_username amavis > What user does your MTA run SA as? What user do you train Bayes as? The MTA should pass scanning off to "amavis". I train the DB in two ways: via Dovecot Antispam and by calling sa-learn on my training mailbox. Given that I have hard-coded the username, the output of "sa-learn --dump magic" is the same whether I issue the command under my own account or "su" to the "amavis" user. > One possibility is that the MTA is running SA as a different user than > you are training Bayes as, and you have autolearn turned on, and Bayes > has been running in its own little world since day one regardless of > what you think you're telling it to do. That is what happened last year. I hope to have eliminated those issues this time around. (I dumped the old DB and started over after that debacle.) The X-Spam-Status header always displays "autolearn=disabled". >> I have ensured that the spam token count increases when I train these >> messages. That said, I do notice that the token count does not *always* >> change; sometimes, sa-learn reports "Learned tokens from 0 message(s) (1 >> message(s) examined)". Does this mean that all tokens from these >> messages have already been learned, thereby making it pointless to >> continue feeding them to sa-learn? > > No, it means that Message-ID has been learned from before. I see. So, when this happens, it means that one of my users has already dragged the message from Inbox to Junk (which triggers the Antispam plug-in and feeds the message to sa-learn). When this scenario occurs, my efforts in feeding the same message to sa-learn are wasted, right? Bayes doesn't "learn more" from the message the second time, or increase it's tokens' "weight", right? It would be nice if I could eliminate this duplicate effort. >> Finally, I added the test you supplied to my SA configuration, restarted >> Amavis, and all messages appear to be tagged with RCVD_IN_HITALL=0.001. > > So this proves DNS lookups are indeed working for all messages. > Okay, good to know. I think we're "all clear" in the DNS/network test department. Based on my responses, what's the next move? Backup the Bayes DB, wipe it, and feed my corpus through the ol' chipper? Thanks again! -Ben