Some thoughts on Baysian Setup...

OliverScott Mon, 27 Aug 2007 07:09:48 -0700

Site Wide Bayes or Per User Bayes?

This is somthing I have been thinking about and thought I would share to see
what other people think...


Site wide bayes has one database. Per User bayes has one per user or domain
(depending on how your server is configured). For example if you have 40
users with a 10Mb bayes database each then you either have to read and write
these to and from disk when an email comes in, or load all 400Mb of data
into memory.

1. Most users don't know how, arn't allowed, or can't be bothered to train
Bayes. In most cases spamassassin is left to auto-train bayes.

2. Most people would consider the same emails to be SPAM. 90% of what I
think is spam would also be what you think is spam, with only a small
percentage of emails that we disagree on.

3. The emails which we would disagree on would probably be newsletters and
advertising emails from legitimate companies. Unwanted newsletters and
advertising emails which people have deliberately (possibiliy due to
stupidity) signed up to should not be trained as SPAM, but should be
manually blacklisted if necessary.

4. Site wide bayes saves disk space and more importantly it saves
significantly on disk IO or memory requirements.

5. A larger database leads to more accurate baysian identification - I am
guessing this is right?

Do you agree or disagree with the five above statements?

Based on the five above statements I would suggest that:
Site wide bayes is as good as if not slightly better (due to a potentially
larger single database) than per user bayes when it comes to identifying
SPAM emails.

1. What I think of as HAM emails could be widely different from what you
think of as HAM emails - if I were to sort your inbox by hand (without
knowing you personally) I would probably delete some good emails by mistake
while getting rid of the spam.

2. If a server has one customer who is a plumber and one who is an artist,
site wide bayes would learn that emails containing the words pipes or canvas
are good. The plumber will get emails with the word canvas in them tagged as
bayes_00 and vice versa.

3. If per user bayes is chosen then bayes_00 will only fire on emails
containing words which have occurred in emails which YOU have received in
the past and which scored low enough to be autolearned. 

4. If a HAM email is misclasified as SPAM then users are more likely to
report this to their admin or to train the filter themselves, than for SPAM
emails which are not tagged. People will ignore a few spam slipping through
but not false positives!

Do you agree or disagree with the four above statements?

Based on the four above statements I would suggest that:
Per User bayes is better than Site Wide bayes when it comes to correctly
identifying HAM emails.


If my various assumptions are correct then perhapse there should be a third
type of bayes to choose from in spamassassin? Namely one where:
SPAM tokens are stored on a server wide basis - can be a LARGE database if
this helps
HAM tokens are stored on a per user basis - probably only needs a 1-2Mb file
per user.

Any comments?

PS. I am not up to coding anything like this myself so don't bother
suggesting that I try it and report back!
-- 
View this message in context: 
http://www.nabble.com/Some-thoughts-on-Baysian-Setup...-tf4335489.html#a12347630
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.

Some thoughts on Baysian Setup...

Reply via email to