Using SpamAssassin 3.0.2 on Solaris 2.6, Perl 5.8.6. For some reason, I'm getting BAYES_00 scores on a lot of our Nigerian scam mail (and sometimes lottery scams). Most other spam scores at reasonably high Bayes values (like 95, 80, or at worst 50). Most of the training has been done with autolearning using the default autolearn parameters, but I have also manually trained some spam, including lots of Nigerian spam (probably dozens of them). Here is some data:
# sa-learn --dump magic 0.000 0 3 0 non-token data: bayes db version 0.000 0 3560 0 non-token data: nspam 0.000 0 104457 0 non-token data: nham 0.000 0 660517 0 non-token data: ntokens 0.000 0 1106229013 0 non-token data: oldest atime 0.000 0 1106331575 0 non-token data: newest atime 0.000 0 0 0 non-token data: last journal sync atime 0.000 0 1106284398 0 non-token data: last expiry atime 0.000 0 55318 0 non-token data: last expire atime delta 0.000 0 277915 0 non-token data: last expire reduction count bayes_store_module Mail::SpamAssassin::BayesStore::SQL bayes_sql_dsn dbi:mysql:spamdb bayes_auto_learn 1 bayes_auto_expire 0 (done once a day by cron) bayes_expiry_max_db_size 500000 I have the sa-users list mail whitelisted. Could that be throwing off the Bayes data? I see USER_IN_WHITELIST has the "noautolearn" parameter set. Here's a debug output from a Nigerian message I manually scanned a few days ago: debug: bayes corpus size: nspam = 3009, nham = 80430 debug: tokenize: header tokens for X-Envelope-From = " <[EMAIL PROTECTED]>" debug: tokenize: header tokens for *F = "U*ian_elsworth D*sapibon.com D*com" debug: tokenize: header tokens for X-Originating-IP = " 195.166.238.114" debug: tokenize: header tokens for Bcc = " " debug: tokenize: header tokens for Message-id = " 18688558e3e5f39da1a45ce72e56d102 sapibon com " debug: tokenize: header tokens for MIME-version = " 1.0" debug: tokenize: header tokens for *x = " Merak Web Mail 5.1.0" debug: tokenize: header tokens for Content-type = " text/plain; charset=us-ascii" debug: tokenize: header tokens for Content-transfer-encoding = " 7bit" debug: tokenize: header tokens for *RT = " " debug: tokenize: header tokens for *RT = " " debug: tokenize: header tokens for *RU = " [ ip=68.157.93.133 rdns=liveradio.sapibon.com helo=mail.sapibo n.com by=emroute2.cind.ornl.gov ident= envfrom= intl=0 [EMAIL PROTECTED] auth= ] [ ip=127.0.0.1 rdns=localhost helo=localhost by=mail.sapibon.com ident= envfrom= intl=0 id=UGHFF auth= ]" debug: tokenize: header tokens for *r = " localhost ([127.0.0 ip*127.0.0.1 ]) by mail.sapibon.com (Mera k 7.0.1) id UGHFF; " debug: tokenize: header tokens for *r = " localhost ([127.0.0 ip*127.0.0.1 ]) by mail.sapibon.com (Mera k 7.0.1) id UGHFF; mail.sapibon.com (liveradio.sapibon.com [68.157.93 ip*68.157.93.133 ]) by emro ute2.cind.ornl.gov (PMDF V6.2-X27 #30899) ESMTPS id <[EMAIL PROTECTED]> johnsonck @ornlmail.ornl.gov (ORCPT [EMAIL PROTECTED]); " debug: bayes: tok_get_all: Token Count: 328 debug: bayes token 'encourage' => 6.61665231828803e-05 debug: bayes token 'Ian' => 0.0001289858547111 debug: bayes token 'naturally' => 0.000215977519068647 debug: bayes token 'IAN' => 0.000314436002337814 debug: bayes token 'NUMBER' => 0.000358427714856762 debug: bayes token 'UD:bbc.co.uk' => 0.000471516213847502 debug: bayes token 'H*r:sk:0IAK00A' => 0.000511893434823977 debug: bayes token 'UD:stm' => 0.00057173219978746 debug: bayes token 'news.bbc.co.uk' => 0.000639714625445898 debug: bayes token 'instability' => 0.000639714625445898 debug: bayes token 'UD:news.bbc.co.uk' => 0.000639714625445898 debug: bayes token 'newsbbccouk' => 0.000639714625445898 debug: bayes token 'H*r:Merak' => 0.000852614896988907 debug: bayes token '849' => 0.00114225053078556 debug: bayes token 'decree' => 0.00121995464852608 debug: bayes token '1,400' => 0.00127790973871734 debug: bayes token 'tobacco' => 0.00127790973871734 debug: bayes token 'THROUGH' => 0.00137595907928389 debug: bayes token 'H*RU:sk:mail.sa' => 0.00149030470914127 debug: bayes token 'seize' => 0.00162537764350453 debug: bayes token 'farming' => 0.00184879725085911 debug: bayes token 'invaded' => 0.00232900432900433 debug: bayes token 'relocate' => 0.00243438914027149 debug: bayes token 'tractors' => 0.0075774647887324 debug: bayes token 'programe' => 0.00881967213114754 debug: bayes token 'Robert' => 0.00898242781809268 debug: bayes token 'H*r:sk:mail.sa' => 0.0131219512195122 debug: bayes token 'TELEPHONE' => 0.0131219512195122 debug: bayes token 'reported' => 0.020717855380927 debug: bayes token 'robert' => 0.0230435753994182 debug: bayes token 'defying' => 0.0256190476190476 debug: bayes token 'Zimbabwean' => 0.967326875775477 debug: bayes token 'zimbabwean' => 0.967326875775477 debug: bayes token 'leadership' => 0.0365506903562349 debug: bayes token 'requesting' => 0.041283032262005 debug: bayes token 'H*r:[EMAIL PROTECTED]' => 0.958 debug: bayes token 'conducive' => 0.0489090909090909 debug: bayes token 'fax' => 0.0526481939583251 debug: bayes token 'CONTACT' => 0.0532709570465476 debug: bayes token 'UD:uk' => 0.0550826950193176 debug: bayes token 'started' => 0.0554349695291097 debug: bayes token 'foreigner' => 0.935030650701486 debug: bayes token 'H*Ad:D*com' => 0.0687425145128682 debug: bayes token 'ian' => 0.0737595699016877 debug: bayes token '133' => 0.924163973411938 debug: bayes token 'mugabe' => 0.923953341040042 debug: bayes token 'far' => 0.0765081581490758 debug: bayes token 'H*UA:Mail' => 0.0809198239113451 debug: bayes token 'through' => 0.0819616447705944 debug: bayes token 'H*x:Mail' => 0.0827118234155287 debug: bayes token 'importance' => 0.0842254278621433 debug: bayes token 'behalf' => 0.0855812743715005 debug: bayes token 'stopping' => 0.0867394028210582 debug: bayes token 'arrangements' => 0.0900352643927864 debug: bayes token 'UD:co.uk' => 0.0904687946905501 debug: bayes token 'Also' => 0.0905783982661814 debug: bayes token 'Zimbabwe' => 0.909168472418891 debug: bayes token 'zimbabwe' => 0.909168472418891 debug: bayes token 'reform' => 0.908339442469175 debug: bayes token 'understanding' => 0.0924641950495834 debug: bayes token 'cheers' => 0.095641615300942 debug: bayes token 'purpose' => 0.105034207333891 debug: bayes token '1400' => 0.105537213508803 debug: bayes token 'number' => 0.105840522750578 debug: bayes token 'H*RU:sk:0IAK00A' => 0.106799672202446 debug: bayes token 'proxy' => 0.113114103572482 debug: bayes token 'Sir' => 0.882023429730427 debug: bayes token 'profits' => 0.120764146821792 debug: bayes token 'again' => 0.123492746510655 debug: bayes token 'strict' => 0.123549403126786 debug: bayes token 'meet' => 0.12556227161196 debug: bayes token 'having' => 0.128795396930409 debug: bayes token 'CNN' => 0.864930121254322 debug: bayes token 'Mugabe' => 0.863326701961039 debug: bayes token 'introduce' => 0.138054044414987 debug: bayes token 'guaranteed' => 0.859059755777108 debug: bayes token 'telephone' => 0.144765522632496 debug: bayes token 'asked' => 0.146502626334225 debug: bayes token 'seek' => 0.853172987599368 debug: bayes token 'met' => 0.147818403733398 debug: bayes token 'period' => 0.150285292200346 debug: bayes: score = 3.57800525874197e-08 It looks like suspicious words like Zimbabwe are being overwhelmed by words like encourage, Ian, cheers, purpose... I could blow away my Bayes database and start over, but I suspect I'd just run into the same problem again. Any ideas?