On 4/19/2013 12:12 PM, Axb wrote: > On 04/19/2013 06:02 PM, Ben Johnson wrote: > >> Still stumped here... > > do a bayes sa-learn --backup > > switch to file based in SDBM format (which is fast) > > do a > > sa-learn --restore > > feed it a few thousand NEW spams > > see what happens > > > > > >
Thanks for the suggestion, Axb. Your help and time is much appreciated. By "feed it a few thousand NEW spams", do you mean to scrap the training corpora that I've hand-sorted in favor of starting over? Or do you mean to clear the database and re-run the training script against the corpora? If your thinking is that the token data may be "stale", then I will really be stumped. When I hand-classify 12 messages with a subject and body about a retractable garden hose as spam, I expect the 13th message about the same hose to score high on the Bayes tests. Is this an unreasonable expectation? I commented-out all of the DB-related lines in my SA configuration file (local.cf) and restarted amavis-new. I also cleared the existing DB tokens (with "sa-learn --clear") after amavis had restarted, and then executed my normal training script against my ham and spam corpora. I'll keep an eye on incoming messages to see if those that "slip through" and score below 4.0 demonstrate evidence of Bayes testing. I am beginning to wonder if some kind of "corruption", for lack of a better term, had been introduced by using utf8 to store the token data (Benny Pedersen mentioned that Unicode is overkill, and he is probably right). Performance aside, could using utf8_bin (instead of ascii) introduce a problem for SA (despite no errors during "sa-learn" training or --restore commands)? The strange thing is that Bayes seems to work fine most of the time. But as I've stated previously, almost all "obvious to a human" spam that scores below 4.0 lacks evidence of Bayes testing. Since switching back to a DBM Bayes setup, the results look pretty much as expected (wrapped), and this is the type of thing I expect to see on every message: ----------------------------------------------------------- spamassassin -D -t < "/tmp/email.txt" 2>&1 | egrep '(bayes:|whitelist:|AWL)' dbg: bayes: learner_new self=Mail::SpamAssassin::Plugin::Bayes=HASH(0x37520f0), bayes_store_module=Mail::SpamAssassin::BayesStore::DBM dbg: bayes: learner_new: got store=Mail::SpamAssassin::BayesStore::DBM=HASH(0x2c52558) dbg: bayes: tie-ing to DB file R/O /var/lib/amavis/.spamassassin/bayes_toks dbg: bayes: tie-ing to DB file R/O /var/lib/amavis/.spamassassin/bayes_seen dbg: bayes: found bayes db version 3 dbg: bayes: DB journal sync: last sync: 0 dbg: bayes: DB journal sync: last sync: 0 dbg: bayes: corpus size: nspam = 6203, nham = 2479 dbg: bayes: score = 5.55111512312578e-17 dbg: bayes: DB journal sync: last sync: 0 dbg: bayes: untie-ing dbg: timing: total 2925 ms - init: 907 (31.0%), parse: 1.92 (0.1%), extract_message_metadata: 108 (3.7%), poll_dns_idle: 1040 (35.6%), get_uri_detail_list: 1.22 (0.0%), tests_pri_-1000: 19 (0.7%), compile_gen: 185 (6.3%), compile_eval: 19 (0.6%), tests_pri_-950: 5 (0.2%), tests_pri_-900: 5 (0.2%), tests_pri_-400: 32 (1.1%), check_bayes: 26 (0.9%), tests_pri_0: 836 (28.6%), dkim_load_modules: 27 (0.9%), check_dkim_signature: 1.23 (0.0%), check_dkim_adsp: 24 (0.8%), check_spf: 70 (2.4%), check_razor2: 202 (6.9%), check_pyzor: 135 (4.6%), tests_pri_500: 988 (33.8%) ----------------------------------------------------------- I'll wait and see if I receive messages without Bayes results and report back. Even if using DBM "works", I don't see this as a long-term solution -- only as a troubleshooting step. I would really like to keep my Bayes data in a MySQL or PostgreSQL database. Thanks again for the help! -Ben