Hi, I think I may be having a problem with my bayes database and it not correctly identifying ham and spam. It seems to treat messages that it should have previously seen as not definitively ham or spam when I think it should have.
I've tried evaluating messages through '-D bayes' and manually training large mboxes of ham and spam with frequent patterns, increasing the database size to the point where there isn't any unnecessary expiry, and have autolearn set to extremes (-2.0 and 16.0). I have a few questions that I've been unable to answer in my research, and hoped someone could help or point me in the right direction. I'm still using sa-v3.2.5. - Would it be better to strip the existing SA headers from the email before manually learning or doesn't it matter? - How do you know whether you should use --forget or just either --ham or --spam, outside of the obvious times when you've knowingly learned a message the wrong type? - How or when do you need to do --rebuilddb? - Is there any way to throttle the IO during the learning process? It's running on a pretty powerful box, but IO becomes a bottleneck, and learning large mboxes takes a while and makes the box otherwise unresponsive. - How can I tell if the database is trained incorrectly? Perhaps in some type of 'monitor' mode where I could point sa-learn to an mbox and it would tell me which messages it thought were mostly ham and mostly spam or the bayes score of each message? - When it says "Learned tokens from XX messages" is there a way to see how many tokens from each message without having to split them up to individual messages from the mbox? - I'm using SA through amavis. Although I have the X-Spam-Status: header, I'd like to add another header specifically for bayes, but it seems amavisd strips any add_header changes I made to my SA config. Does anyone know how to either prevent amavisd from stripping the header or how to modify amavisd to include this info? - Could excessive load on the system cause the bayes process to timeout and be skipped? If so, how can I find out if that's happening? I thought it might be helpful to include '--dump magic' here: 0.000 0 3 0 non-token data: bayes db version 0.000 0 1190999 0 non-token data: nspam 0.000 0 558109 0 non-token data: nham 0.000 0 3268221 0 non-token data: ntokens 0.000 0 1268881679 0 non-token data: oldest atime 0.000 0 1271004108 0 non-token data: newest atime 0.000 0 1271004276 0 non-token data: last journal sync atime 0.000 0 1270699541 0 non-token data: last expiry atime 0.000 0 345600 0 non-token data: last expire atime delta 0.000 0 1227601 0 non-token data: last expire reduction count Thanks so much. Best regards, Alex