On 4/19/2013 12:12 PM, Axb wrote:
> On 04/19/2013 06:02 PM, Ben Johnson wrote:
> 
>> Still stumped here...
> 
> do a bayes sa-learn --backup
> 
> switch to file based in SDBM format (which is fast)
> 
> do a
> 
> sa-learn --restore
> 
> feed it a few thousand NEW spams
> 
> see what happens
> 
> 
> 
> 
> 
> 

Thanks for the suggestion, Axb. Your help and time is much appreciated.

By "feed it a few thousand NEW spams", do you mean to scrap the training
corpora that I've hand-sorted in favor of starting over? Or do you mean
to clear the database and re-run the training script against the corpora?

If your thinking is that the token data may be "stale", then I will
really be stumped. When I hand-classify 12 messages with a subject and
body about a retractable garden hose as spam, I expect the 13th message
about the same hose to score high on the Bayes tests. Is this an
unreasonable expectation?

I commented-out all of the DB-related lines in my SA configuration file
(local.cf) and restarted amavis-new.

I also cleared the existing DB tokens (with "sa-learn --clear") after
amavis had restarted, and then executed my normal training script
against my ham and spam corpora.

I'll keep an eye on incoming messages to see if those that "slip
through" and score below 4.0 demonstrate evidence of Bayes testing.

I am beginning to wonder if some kind of "corruption", for lack of a
better term, had been introduced by using utf8 to store the token data
(Benny Pedersen mentioned that Unicode is overkill, and he is probably
right). Performance aside, could using utf8_bin (instead of ascii)
introduce a problem for SA (despite no errors during "sa-learn" training
or --restore commands)?

The strange thing is that Bayes seems to work fine most of the time. But
as I've stated previously, almost all "obvious to a human" spam that
scores below 4.0 lacks evidence of Bayes testing.

Since switching back to a DBM Bayes setup, the results look pretty much
as expected (wrapped), and this is the type of thing I expect to see on
every message:

-----------------------------------------------------------
spamassassin -D -t < "/tmp/email.txt" 2>&1 | egrep '(bayes:|whitelist:|AWL)'
dbg: bayes: learner_new
self=Mail::SpamAssassin::Plugin::Bayes=HASH(0x37520f0),
bayes_store_module=Mail::SpamAssassin::BayesStore::DBM
dbg: bayes: learner_new: got
store=Mail::SpamAssassin::BayesStore::DBM=HASH(0x2c52558)
dbg: bayes: tie-ing to DB file R/O /var/lib/amavis/.spamassassin/bayes_toks
dbg: bayes: tie-ing to DB file R/O /var/lib/amavis/.spamassassin/bayes_seen
dbg: bayes: found bayes db version 3
dbg: bayes: DB journal sync: last sync: 0
dbg: bayes: DB journal sync: last sync: 0
dbg: bayes: corpus size: nspam = 6203, nham = 2479
dbg: bayes: score = 5.55111512312578e-17
dbg: bayes: DB journal sync: last sync: 0
dbg: bayes: untie-ing
dbg: timing: total 2925 ms - init: 907 (31.0%), parse: 1.92 (0.1%),
extract_message_metadata: 108 (3.7%), poll_dns_idle: 1040 (35.6%),
get_uri_detail_list: 1.22 (0.0%), tests_pri_-1000: 19 (0.7%),
compile_gen: 185 (6.3%), compile_eval: 19 (0.6%), tests_pri_-950: 5
(0.2%), tests_pri_-900: 5 (0.2%), tests_pri_-400: 32 (1.1%),
check_bayes: 26 (0.9%), tests_pri_0: 836 (28.6%), dkim_load_modules: 27
(0.9%), check_dkim_signature: 1.23 (0.0%), check_dkim_adsp: 24 (0.8%),
check_spf: 70 (2.4%), check_razor2: 202 (6.9%), check_pyzor: 135 (4.6%),
tests_pri_500: 988 (33.8%)
-----------------------------------------------------------

I'll wait and see if I receive messages without Bayes results and report
back.

Even if using DBM "works", I don't see this as a long-term solution --
only as a troubleshooting step. I would really like to keep my Bayes
data in a MySQL or PostgreSQL database.

Thanks again for the help!

-Ben

Reply via email to