Since last Saturday we are receiving a lot of spam which goes through our
instiute-wide spamassassin.
(we run SpamAssassin version 3.1.3 with bayes_auto_learn,
use_auto_whitelist, razor2, pyzor and dcc in a configuration recommended
by GARR for Italian academic institution ; rejected spam (above score 4.5)
is archived system-wide in a daily quarantine folder ; users MAY forward
spam which goes through to an area where a daily crontab picks it up for
sa-learn ... we've been happy with the entire arrangement since a couple
of years)
The spam which goes through is a short html message containing an
href:mailto e-mail (usually gmail), and a slightly variable text in (bad)
italian (with spelling and grammar errors) stating that "80% of the people
in your [country|city|region...] is unhappy with their monthly income" and
offering a job for internet advertising.
I've seen that SOME of the messages are caught by spamassassin (some 60
message on Saturday, 80 on Sunday, 130 yesterday and some 100 this
morning), but many more go through (I counted 37 only yesterday after I
installed a procmail filter of my own based on a single key phrase), and a
colleague counted 56. My estimation is that spamassassin blocks only about
10-20% of the messages.
(as a support for this, I've noted that the total daily rate of messages
is higher by some 700 messages per day ... yesterday 3000 vs usual
2200-2500, Sunday 1700 vs 800 the previous Sunday ; and that the
percentage of spamassassin spam vs messages going through has changed from
the usual ratio of 55% to 25% to something like 40%-40% [the difference to
100% is given by rejections for other reasons, greet pause, invalid users
etc.])
Now if I look to the relatively few messages which have been blocked by
spamassassin, I see they usually have a rather low score (around 6 while
the typical spam scores 10-15) with triggered tests
BAYES_00,
HELO_DYNAMIC_IPADDR2, HELO_DYNAMIC_SPLIT_IP, HTML_30_40, HTML_MESSAGE,
MIME_HTML_ONLY, RCVD_NUMERIC_HELO
only sometimes higher scores with DCC_CHECK, RAZOR2_CHECK etc. So its
look like also the DCC and razor servers are catching up very slowly with
these messages.
What looks suspicious to me is BAYES_00. Most other spam has BAYES_99.
So I suspect the messages which go through receive also a (wrong) BAYES_00
which gives a negative score and brings them below threshold.
I've looked to the spamassin wiki, but I see no obvious way of telling WHY
a given message could be assigned a specific bayes score.
I've also run a sa-learn --dump magic and this reports
0.000 0 31125 0 non-token data: nspam
0.000 0 239162 0 non-token data: nham
0.000 0 310271 0 non-token data: ntokens
I'm not sure how to interpret those numbers. The amount of daily spam is
usually above 50% of the total traffic, and users submit to the crontab
sa-learn only the spam which goes through, and only a minority of users do
it. So why is nham much larger than nspam ?
A colleague argued that autolearn is also feeding the bayes db, and since
ham messages are all different, while spam messages repeats, a new spam
increases the score of its entry, but not the number of entries. This
looks plausible. If so the --dump magic would not be anomalous.
But then what is the best way to force bayes to "change its mind" from 00
to 99 (or at least above 50) on this sort of spam, other than waiting it
catches up on the few user submissions (myself, I won't be doing other
submission since my procmail filter diverts them to /dev/null) ?
--
Lucio Chiappetti - INAF/IASF - via Bassini 15 - I-20133 Milano (Italy)
For more info : http://www.iasf-milano.inaf.it/~lucio/personal.html
-----------------------------------------------------------------------
"Nature" on government cuts to research http://snipurl.com/4erid
"Nature" e i tagli del governo alla ricerca http://snipurl.com/4erko