On 4/19/2013 11:42 AM, Alex wrote:
> Hi,
> 
>> Is this normal? If so, what is the explanation for this behavior? I have
> 
>         marked dozens of nearly-identical messages with the subject
>         "Garden hose
>         expands up to three times its length" as SPAM (over the course of
>         several weeks) as SPAM, and yet SA reports "not enough usable
>         tokens found".
> 
> 
>     If they are identical, I don't believe it will create new tokens,
>     per se.
> 
>      
> 
>         Is SA referring to the number of tokens in the message? Or the
>         Bayes DB?
> 
> 
> I should also mention that while training a message, use "--progress",
> as such (assuming you're running it on an mbox or message that's in mbox
> format):
> 
> # sa-learn --progress --spam --mbox mymboxfile
> 
> It will show you how many tokens have been learned during that run. It
> might also be a good idea to add the token summary flag to your config:
> 
> add_header all Tok-Stat _TOKENSUMMARY_
> 
> If you run spamassassin on a message directly, and add the -t option, it
> will show you the number of different types of tokens found in the message:
> 
> X-Spam-Tok-Stat: Tokens: new, 0; hammy, 6; neutral, 84; spammy, 36.
> 
> Regards,
> Alex
> 

Alex, thanks very much for the quick reply. I really appreciate it.

One can see from the output in my previous message (two messages back)
that the user is amavis (correct for my system) and the corpus size, as
well as the token count:

dbg: bayes: corpus size: nspam = 6155, nham = 2342
dbg: bayes: tok_get_all: token count: 176
dbg: bayes: cannot use bayes on this message; not enough usable tokens found
bayes: not scoring message, returning undef

Now that I look at this output again, the "token count: 176" stands-out.
That seems like a pretty low value. Is this the token count for the
entire Bayes DB??? Or only the tokens that apply to the particular
message being fed to SA?

The "garden hose" messages are probably not *identical*, but they are
very similar, so it seems that each variant should have tokens to offer.

The concern I expressed around bug 6624 relates to Mark's comment, which
seems to imply that while SA will not insert a token twice, it *will*
increase the token "count". Here's an excerpt from Mark's comment from
that bug report:

"The effect of the bug with SpamAssassin is that tokens are only able
to be inserted once, but their counts cannot increase, leading to
terrible bayes results if the bug is not noticed. Also the conversion
form db fails, as reported by Dave."

Is it possible that training similar messages as SPAM is not having the
intended effect due to this bug in my version of SA?

My "bayes_vars" table looks like this (sorry for the wrapping, this is
the best I could do):

id  username    spam_count  ham_count   token_count last_expire
last_atime_delta    last_expire_reduce  oldest_token_age    newest_token_age
1   amavis      6185        2427        120092          1366364379      8380417
        14747                   1357985848              1366386865

The SQL query:

SELECT count( * )
FROM `bayes_token`

returns 120092 rows, so the above value is accurate (that is, the
"token_count" value in the `bayes_vars` table matches the actual row
count in the `bayes_token` table).

Also, thanks for the other tips regarding the "token summary flag"
directive an the -t switch. I was actually using the -t switch to
produce the output that I pasted two messages back. So, it seems that
the "X-Spam-Tok-Stat" output is added only when the token count is high
enough to be useful.

Still stumped here...

-Ben

Reply via email to