On 4/19/2013 11:42 AM, Alex wrote: > Hi, > >> Is this normal? If so, what is the explanation for this behavior? I have > > marked dozens of nearly-identical messages with the subject > "Garden hose > expands up to three times its length" as SPAM (over the course of > several weeks) as SPAM, and yet SA reports "not enough usable > tokens found". > > > If they are identical, I don't believe it will create new tokens, > per se. > > > > Is SA referring to the number of tokens in the message? Or the > Bayes DB? > > > I should also mention that while training a message, use "--progress", > as such (assuming you're running it on an mbox or message that's in mbox > format): > > # sa-learn --progress --spam --mbox mymboxfile > > It will show you how many tokens have been learned during that run. It > might also be a good idea to add the token summary flag to your config: > > add_header all Tok-Stat _TOKENSUMMARY_ > > If you run spamassassin on a message directly, and add the -t option, it > will show you the number of different types of tokens found in the message: > > X-Spam-Tok-Stat: Tokens: new, 0; hammy, 6; neutral, 84; spammy, 36. > > Regards, > Alex >
Alex, thanks very much for the quick reply. I really appreciate it. One can see from the output in my previous message (two messages back) that the user is amavis (correct for my system) and the corpus size, as well as the token count: dbg: bayes: corpus size: nspam = 6155, nham = 2342 dbg: bayes: tok_get_all: token count: 176 dbg: bayes: cannot use bayes on this message; not enough usable tokens found bayes: not scoring message, returning undef Now that I look at this output again, the "token count: 176" stands-out. That seems like a pretty low value. Is this the token count for the entire Bayes DB??? Or only the tokens that apply to the particular message being fed to SA? The "garden hose" messages are probably not *identical*, but they are very similar, so it seems that each variant should have tokens to offer. The concern I expressed around bug 6624 relates to Mark's comment, which seems to imply that while SA will not insert a token twice, it *will* increase the token "count". Here's an excerpt from Mark's comment from that bug report: "The effect of the bug with SpamAssassin is that tokens are only able to be inserted once, but their counts cannot increase, leading to terrible bayes results if the bug is not noticed. Also the conversion form db fails, as reported by Dave." Is it possible that training similar messages as SPAM is not having the intended effect due to this bug in my version of SA? My "bayes_vars" table looks like this (sorry for the wrapping, this is the best I could do): id username spam_count ham_count token_count last_expire last_atime_delta last_expire_reduce oldest_token_age newest_token_age 1 amavis 6185 2427 120092 1366364379 8380417 14747 1357985848 1366386865 The SQL query: SELECT count( * ) FROM `bayes_token` returns 120092 rows, so the above value is accurate (that is, the "token_count" value in the `bayes_vars` table matches the actual row count in the `bayes_token` table). Also, thanks for the other tips regarding the "token summary flag" directive an the -t switch. I was actually using the -t switch to produce the output that I pasted two messages back. So, it seems that the "X-Spam-Tok-Stat" output is added only when the token count is high enough to be useful. Still stumped here... -Ben