Using SpamAssassin 3.0.2 on Solaris 2.6, Perl 5.8.6.

For some reason, I'm getting BAYES_00 scores on a lot of our Nigerian scam
mail (and sometimes lottery scams).  Most other spam scores at reasonably high
Bayes values (like 95, 80, or at worst 50).  Most of the training has been
done with autolearning using the default autolearn parameters, but I have also 
manually trained some spam, including lots of Nigerian spam (probably dozens
of them).  Here is some data:

# sa-learn --dump magic
0.000          0          3          0  non-token data: bayes db version
0.000          0       3560          0  non-token data: nspam
0.000          0     104457          0  non-token data: nham
0.000          0     660517          0  non-token data: ntokens
0.000          0 1106229013          0  non-token data: oldest atime
0.000          0 1106331575          0  non-token data: newest atime
0.000          0          0          0  non-token data: last journal sync atime
0.000          0 1106284398          0  non-token data: last expiry atime
0.000          0      55318          0  non-token data: last expire atime delta
0.000          0     277915          0  non-token data: last expire reduction 
count

bayes_store_module      Mail::SpamAssassin::BayesStore::SQL
bayes_sql_dsn           dbi:mysql:spamdb
bayes_auto_learn        1
bayes_auto_expire       0               (done once a day by cron)
bayes_expiry_max_db_size  500000


I have the sa-users list mail whitelisted.  Could that be throwing off the 
Bayes data?
I see USER_IN_WHITELIST has the "noautolearn" parameter set.

Here's a debug output from a Nigerian message I manually scanned a few days ago:

debug: bayes corpus size: nspam = 3009, nham = 80430
debug: tokenize: header tokens for X-Envelope-From = " <[EMAIL PROTECTED]>"
debug: tokenize: header tokens for *F = "U*ian_elsworth D*sapibon.com D*com"
debug: tokenize: header tokens for X-Originating-IP = " 195.166.238.114"
debug: tokenize: header tokens for Bcc = " "
debug: tokenize: header tokens for Message-id = "  
18688558e3e5f39da1a45ce72e56d102 sapibon com "
debug: tokenize: header tokens for MIME-version = " 1.0"
debug: tokenize: header tokens for *x = " Merak Web Mail 5.1.0"
debug: tokenize: header tokens for Content-type = " text/plain; 
charset=us-ascii"
debug: tokenize: header tokens for Content-transfer-encoding = " 7bit"
debug: tokenize: header tokens for *RT = " "
debug: tokenize: header tokens for *RT = " "
debug: tokenize: header tokens for *RU = " [ ip=68.157.93.133 
rdns=liveradio.sapibon.com helo=mail.sapibo
n.com by=emroute2.cind.ornl.gov ident= envfrom= intl=0 [EMAIL PROTECTED] auth= 
] [
 ip=127.0.0.1 rdns=localhost helo=localhost by=mail.sapibon.com ident= envfrom= 
intl=0 id=UGHFF auth= ]"
debug: tokenize: header tokens for *r = "   localhost ([127.0.0 ip*127.0.0.1 ]) 
by mail.sapibon.com (Mera
k 7.0.1)     id UGHFF; "
debug: tokenize: header tokens for *r = "   localhost ([127.0.0 ip*127.0.0.1 ]) 
by mail.sapibon.com (Mera
k 7.0.1)     id UGHFF;     mail.sapibon.com (liveradio.sapibon.com [68.157.93 
ip*68.157.93.133 ]) by emro
ute2.cind.ornl.gov (PMDF V6.2-X27 #30899)   ESMTPS id <[EMAIL PROTECTED]>   
johnsonck
@ornlmail.ornl.gov (ORCPT [EMAIL PROTECTED]); "
debug: bayes: tok_get_all: Token Count: 328
debug: bayes token 'encourage' => 6.61665231828803e-05
debug: bayes token 'Ian' => 0.0001289858547111
debug: bayes token 'naturally' => 0.000215977519068647
debug: bayes token 'IAN' => 0.000314436002337814
debug: bayes token 'NUMBER' => 0.000358427714856762
debug: bayes token 'UD:bbc.co.uk' => 0.000471516213847502
debug: bayes token 'H*r:sk:0IAK00A' => 0.000511893434823977
debug: bayes token 'UD:stm' => 0.00057173219978746
debug: bayes token 'news.bbc.co.uk' => 0.000639714625445898
debug: bayes token 'instability' => 0.000639714625445898
debug: bayes token 'UD:news.bbc.co.uk' => 0.000639714625445898
debug: bayes token 'newsbbccouk' => 0.000639714625445898
debug: bayes token 'H*r:Merak' => 0.000852614896988907
debug: bayes token '849' => 0.00114225053078556
debug: bayes token 'decree' => 0.00121995464852608
debug: bayes token '1,400' => 0.00127790973871734
debug: bayes token 'tobacco' => 0.00127790973871734
debug: bayes token 'THROUGH' => 0.00137595907928389
debug: bayes token 'H*RU:sk:mail.sa' => 0.00149030470914127
debug: bayes token 'seize' => 0.00162537764350453
debug: bayes token 'farming' => 0.00184879725085911
debug: bayes token 'invaded' => 0.00232900432900433
debug: bayes token 'relocate' => 0.00243438914027149
debug: bayes token 'tractors' => 0.0075774647887324
debug: bayes token 'programe' => 0.00881967213114754
debug: bayes token 'Robert' => 0.00898242781809268
debug: bayes token 'H*r:sk:mail.sa' => 0.0131219512195122
debug: bayes token 'TELEPHONE' => 0.0131219512195122
debug: bayes token 'reported' => 0.020717855380927
debug: bayes token 'robert' => 0.0230435753994182
debug: bayes token 'defying' => 0.0256190476190476
debug: bayes token 'Zimbabwean' => 0.967326875775477
debug: bayes token 'zimbabwean' => 0.967326875775477
debug: bayes token 'leadership' => 0.0365506903562349
debug: bayes token 'requesting' => 0.041283032262005
debug: bayes token 'H*r:[EMAIL PROTECTED]' => 0.958
debug: bayes token 'conducive' => 0.0489090909090909
debug: bayes token 'fax' => 0.0526481939583251
debug: bayes token 'CONTACT' => 0.0532709570465476
debug: bayes token 'UD:uk' => 0.0550826950193176
debug: bayes token 'started' => 0.0554349695291097
debug: bayes token 'foreigner' => 0.935030650701486
debug: bayes token 'H*Ad:D*com' => 0.0687425145128682
debug: bayes token 'ian' => 0.0737595699016877
debug: bayes token '133' => 0.924163973411938
debug: bayes token 'mugabe' => 0.923953341040042
debug: bayes token 'far' => 0.0765081581490758
debug: bayes token 'H*UA:Mail' => 0.0809198239113451
debug: bayes token 'through' => 0.0819616447705944
debug: bayes token 'H*x:Mail' => 0.0827118234155287
debug: bayes token 'importance' => 0.0842254278621433
debug: bayes token 'behalf' => 0.0855812743715005
debug: bayes token 'stopping' => 0.0867394028210582
debug: bayes token 'arrangements' => 0.0900352643927864
debug: bayes token 'UD:co.uk' => 0.0904687946905501
debug: bayes token 'Also' => 0.0905783982661814
debug: bayes token 'Zimbabwe' => 0.909168472418891
debug: bayes token 'zimbabwe' => 0.909168472418891
debug: bayes token 'reform' => 0.908339442469175
debug: bayes token 'understanding' => 0.0924641950495834
debug: bayes token 'cheers' => 0.095641615300942
debug: bayes token 'purpose' => 0.105034207333891
debug: bayes token '1400' => 0.105537213508803
debug: bayes token 'number' => 0.105840522750578
debug: bayes token 'H*RU:sk:0IAK00A' => 0.106799672202446
debug: bayes token 'proxy' => 0.113114103572482
debug: bayes token 'Sir' => 0.882023429730427
debug: bayes token 'profits' => 0.120764146821792
debug: bayes token 'again' => 0.123492746510655
debug: bayes token 'strict' => 0.123549403126786
debug: bayes token 'meet' => 0.12556227161196
debug: bayes token 'having' => 0.128795396930409
debug: bayes token 'CNN' => 0.864930121254322
debug: bayes token 'Mugabe' => 0.863326701961039
debug: bayes token 'introduce' => 0.138054044414987
debug: bayes token 'guaranteed' => 0.859059755777108
debug: bayes token 'telephone' => 0.144765522632496
debug: bayes token 'asked' => 0.146502626334225
debug: bayes token 'seek' => 0.853172987599368
debug: bayes token 'met' => 0.147818403733398
debug: bayes token 'period' => 0.150285292200346
debug: bayes: score = 3.57800525874197e-08

It looks like suspicious words like Zimbabwe are being overwhelmed by
words like encourage, Ian, cheers, purpose...

I could blow away my Bayes database and start over, but I suspect I'd 
just run into the same problem again.  Any ideas?

Reply via email to