quirks with bayes ?

Lucio Chiappetti Tue, 31 Mar 2009 05:38:02 -0700

Since last Saturday we are receiving a lot of spam which goes through ourinstiute-wide spamassassin.

(we run SpamAssassin version 3.1.3 with bayes_auto_learn,use_auto_whitelist, razor2, pyzor and dcc in a configuration recommendedby GARR for Italian academic institution ; rejected spam (above score 4.5)is archived system-wide in a daily quarantine folder ; users MAY forwardspam which goes through to an area where a daily crontab picks it up forsa-learn ... we've been happy with the entire arrangement since a coupleof years)

The spam which goes through is a short html message containing anhref:mailto e-mail (usually gmail), and a slightly variable text in (bad)italian (with spelling and grammar errors) stating that "80% of the peoplein your [country|city|region...] is unhappy with their monthly income" andoffering a job for internet advertising.

I've seen that SOME of the messages are caught by spamassassin (some 60message on Saturday, 80 on Sunday, 130 yesterday and some 100 thismorning), but many more go through (I counted 37 only yesterday after Iinstalled a procmail filter of my own based on a single key phrase), and acolleague counted 56. My estimation is that spamassassin blocks only about10-20% of the messages.

(as a support for this, I've noted that the total daily rate of messagesis higher by some 700 messages per day ... yesterday 3000 vs usual2200-2500, Sunday 1700 vs 800 the previous Sunday ; and that thepercentage of spamassassin spam vs messages going through has changed fromthe usual ratio of 55% to 25% to something like 40%-40% [the difference to100% is given by rejections for other reasons, greet pause, invalid usersetc.])

Now if I look to the relatively few messages which have been blocked byspamassassin, I see they usually have a rather low score (around 6 whilethe typical spam scores 10-15) with triggered tests


 BAYES_00,
 HELO_DYNAMIC_IPADDR2, HELO_DYNAMIC_SPLIT_IP, HTML_30_40, HTML_MESSAGE,
 MIME_HTML_ONLY, RCVD_NUMERIC_HELO

only sometimes higher scores with DCC_CHECK, RAZOR2_CHECK etc. So itslook like also the DCC and razor servers are catching up very slowly withthese messages.


What looks suspicious to me is BAYES_00. Most other spam has BAYES_99.

So I suspect the messages which go through receive also a (wrong) BAYES_00which gives a negative score and brings them below threshold.

I've looked to the spamassin wiki, but I see no obvious way of telling WHYa given message could be assigned a specific bayes score.


I've also run a sa-learn --dump magic and this reports

0.000          0      31125          0  non-token data: nspam
0.000          0     239162          0  non-token data: nham
0.000          0     310271          0  non-token data: ntokens

I'm not sure how to interpret those numbers. The amount of daily spam isusually above 50% of the total traffic, and users submit to the crontabsa-learn only the spam which goes through, and only a minority of users doit. So why is nham much larger than nspam ?

A colleague argued that autolearn is also feeding the bayes db, and sinceham messages are all different, while spam messages repeats, a new spamincreases the score of its entry, but not the number of entries. Thislooks plausible. If so the --dump magic would not be anomalous.

But then what is the best way to force bayes to "change its mind" from 00to 99 (or at least above 50) on this sort of spam, other than waiting itcatches up on the few user submissions (myself, I won't be doing othersubmission since my procmail filter diverts them to /dev/null) ?


--
Lucio Chiappetti - INAF/IASF - via Bassini 15 - I-20133 Milano (Italy)
For more info : http://www.iasf-milano.inaf.it/~lucio/personal.html
-----------------------------------------------------------------------
"Nature" on government cuts to research       http://snipurl.com/4erid
"Nature" e i tagli del governo alla ricerca   http://snipurl.com/4erko

quirks with bayes ?

Reply via email to