Since last Saturday we are receiving a lot of spam which goes through our instiute-wide spamassassin.

(we run SpamAssassin version 3.1.3 with bayes_auto_learn, use_auto_whitelist, razor2, pyzor and dcc in a configuration recommended by GARR for Italian academic institution ; rejected spam (above score 4.5) is archived system-wide in a daily quarantine folder ; users MAY forward spam which goes through to an area where a daily crontab picks it up for sa-learn ... we've been happy with the entire arrangement since a couple of years)

The spam which goes through is a short html message containing an href:mailto e-mail (usually gmail), and a slightly variable text in (bad) italian (with spelling and grammar errors) stating that "80% of the people in your [country|city|region...] is unhappy with their monthly income" and offering a job for internet advertising.

I've seen that SOME of the messages are caught by spamassassin (some 60 message on Saturday, 80 on Sunday, 130 yesterday and some 100 this morning), but many more go through (I counted 37 only yesterday after I installed a procmail filter of my own based on a single key phrase), and a colleague counted 56. My estimation is that spamassassin blocks only about 10-20% of the messages.

(as a support for this, I've noted that the total daily rate of messages is higher by some 700 messages per day ... yesterday 3000 vs usual 2200-2500, Sunday 1700 vs 800 the previous Sunday ; and that the percentage of spamassassin spam vs messages going through has changed from the usual ratio of 55% to 25% to something like 40%-40% [the difference to 100% is given by rejections for other reasons, greet pause, invalid users etc.])

Now if I look to the relatively few messages which have been blocked by spamassassin, I see they usually have a rather low score (around 6 while the typical spam scores 10-15) with triggered tests

 BAYES_00,
 HELO_DYNAMIC_IPADDR2, HELO_DYNAMIC_SPLIT_IP, HTML_30_40, HTML_MESSAGE,
 MIME_HTML_ONLY, RCVD_NUMERIC_HELO

only sometimes higher scores with DCC_CHECK, RAZOR2_CHECK etc. So its look like also the DCC and razor servers are catching up very slowly with these messages.

What looks suspicious to me is BAYES_00. Most other spam has BAYES_99.
So I suspect the messages which go through receive also a (wrong) BAYES_00 which gives a negative score and brings them below threshold.

I've looked to the spamassin wiki, but I see no obvious way of telling WHY a given message could be assigned a specific bayes score.

I've also run a sa-learn --dump magic and this reports

0.000          0      31125          0  non-token data: nspam
0.000          0     239162          0  non-token data: nham
0.000          0     310271          0  non-token data: ntokens

I'm not sure how to interpret those numbers. The amount of daily spam is usually above 50% of the total traffic, and users submit to the crontab sa-learn only the spam which goes through, and only a minority of users do it. So why is nham much larger than nspam ?

A colleague argued that autolearn is also feeding the bayes db, and since ham messages are all different, while spam messages repeats, a new spam increases the score of its entry, but not the number of entries. This looks plausible. If so the --dump magic would not be anomalous.

But then what is the best way to force bayes to "change its mind" from 00 to 99 (or at least above 50) on this sort of spam, other than waiting it catches up on the few user submissions (myself, I won't be doing other submission since my procmail filter diverts them to /dev/null) ?

--
Lucio Chiappetti - INAF/IASF - via Bassini 15 - I-20133 Milano (Italy)
For more info : http://www.iasf-milano.inaf.it/~lucio/personal.html
-----------------------------------------------------------------------
"Nature" on government cuts to research       http://snipurl.com/4erid
"Nature" e i tagli del governo alla ricerca   http://snipurl.com/4erko

Reply via email to