Bayes db and training questions

Alex Sun, 11 Apr 2010 09:54:16 -0700

Hi,

I think I may be having a problem with my bayes database and it not
correctly identifying ham and spam. It seems to treat messages that it
should have previously seen as not definitively ham or spam when I
think it should have.


I've tried evaluating messages through '-D bayes' and manually
training large mboxes of ham and spam with frequent patterns,
increasing the database size to the point where there isn't any
unnecessary expiry, and have autolearn set to extremes (-2.0 and
16.0).

I have a few questions that I've been unable to answer in my research,
and hoped someone could help or point me in the right direction. I'm
still using sa-v3.2.5.

- Would it be better to strip the existing SA headers from the email
before manually learning or doesn't it matter?
- How do you know whether you should use --forget or just either --ham
or --spam, outside of the obvious times when you've knowingly learned
a message the wrong type?
- How or when do you need to do --rebuilddb?
- Is there any way to throttle the IO during the learning process?
It's running on a pretty powerful box, but IO becomes a bottleneck,
and learning large mboxes takes a while and makes the box otherwise
unresponsive.
- How can I tell if the database is trained incorrectly? Perhaps in
some type of 'monitor' mode where I could point sa-learn to an mbox
and it would tell me which messages it thought were mostly ham and
mostly spam or the bayes score of each message?
- When it says "Learned tokens from XX messages" is there a way to see
how many tokens from each message without having to split them up to
individual messages from the mbox?
- I'm using SA through amavis. Although I have the X-Spam-Status:
header, I'd like to add another header specifically for bayes, but it
seems amavisd strips any add_header changes I made to my SA config.
Does anyone know how to either prevent amavisd from stripping the
header or how to modify amavisd to include this info?
- Could excessive load on the system cause the bayes process to
timeout and be skipped? If so, how can I find out if that's happening?

I thought it might be helpful to include '--dump magic' here:

0.000          0          3          0  non-token data: bayes db version
0.000          0    1190999          0  non-token data: nspam
0.000          0     558109          0  non-token data: nham
0.000          0    3268221          0  non-token data: ntokens
0.000          0 1268881679          0  non-token data: oldest atime
0.000          0 1271004108          0  non-token data: newest atime
0.000          0 1271004276          0  non-token data: last journal sync atime
0.000          0 1270699541          0  non-token data: last expiry atime
0.000          0     345600          0  non-token data: last expire atime delta
0.000          0    1227601          0  non-token data: last expire
reduction count

Thanks so much.
Best regards,
Alex

Bayes db and training questions

Reply via email to