Site Wide Bayes or Per User Bayes? This is somthing I have been thinking about and thought I would share to see what other people think...
Site wide bayes has one database. Per User bayes has one per user or domain (depending on how your server is configured). For example if you have 40 users with a 10Mb bayes database each then you either have to read and write these to and from disk when an email comes in, or load all 400Mb of data into memory. 1. Most users don't know how, arn't allowed, or can't be bothered to train Bayes. In most cases spamassassin is left to auto-train bayes. 2. Most people would consider the same emails to be SPAM. 90% of what I think is spam would also be what you think is spam, with only a small percentage of emails that we disagree on. 3. The emails which we would disagree on would probably be newsletters and advertising emails from legitimate companies. Unwanted newsletters and advertising emails which people have deliberately (possibiliy due to stupidity) signed up to should not be trained as SPAM, but should be manually blacklisted if necessary. 4. Site wide bayes saves disk space and more importantly it saves significantly on disk IO or memory requirements. 5. A larger database leads to more accurate baysian identification - I am guessing this is right? Do you agree or disagree with the five above statements? Based on the five above statements I would suggest that: Site wide bayes is as good as if not slightly better (due to a potentially larger single database) than per user bayes when it comes to identifying SPAM emails. 1. What I think of as HAM emails could be widely different from what you think of as HAM emails - if I were to sort your inbox by hand (without knowing you personally) I would probably delete some good emails by mistake while getting rid of the spam. 2. If a server has one customer who is a plumber and one who is an artist, site wide bayes would learn that emails containing the words pipes or canvas are good. The plumber will get emails with the word canvas in them tagged as bayes_00 and vice versa. 3. If per user bayes is chosen then bayes_00 will only fire on emails containing words which have occurred in emails which YOU have received in the past and which scored low enough to be autolearned. 4. If a HAM email is misclasified as SPAM then users are more likely to report this to their admin or to train the filter themselves, than for SPAM emails which are not tagged. People will ignore a few spam slipping through but not false positives! Do you agree or disagree with the four above statements? Based on the four above statements I would suggest that: Per User bayes is better than Site Wide bayes when it comes to correctly identifying HAM emails. If my various assumptions are correct then perhapse there should be a third type of bayes to choose from in spamassassin? Namely one where: SPAM tokens are stored on a server wide basis - can be a LARGE database if this helps HAM tokens are stored on a per user basis - probably only needs a 1-2Mb file per user. Any comments? PS. I am not up to coding anything like this myself so don't bother suggesting that I try it and report back! -- View this message in context: http://www.nabble.com/Some-thoughts-on-Baysian-Setup...-tf4335489.html#a12347630 Sent from the SpamAssassin - Users mailing list archive at Nabble.com.