On Thu, 08 Jan 2004 20:50:17 -0800, Chris Petersen <[EMAIL PROTECTED]> writes:
> > Is there a > > SpamAssassin fix for it or some test I can increase to fix it?? > > Check the "lots of random words" thread from today for a couple of > (probably short-term) solutions... Other than that, I'm hoping that > bayes will start picking up on some of the less common words that get > tossed into the bunch. > Here's a formal model that should help SA exploit an effect of unknown tokens. The idea is that past a certain point in training, one should start to assume that the bayes model for ham is becoming accurate. IE, that an unknown word is an indicator of spam. So this means that we should adjust the probability assigned to unseen tokens upward as we obtain more samples of ham. More formally, instead of: P(HAM | <never before seen token> ) = .5 or an arbitrary constant have it: P(HAM | <never before seen token> ) = F(# ham messages analyzed) F could be something like F(x) = .5-(.4* min(x/1000,1) which is a linear map from .5 to .1 as the training set increases, but we can be a lot more accurate. We can directly compute what F should be using probability theory and given a corpus of ham. Take a corpus of ham, and choose a random subet of 10000 of them as S1, and a random subsets of $N$ messages as S2. Scan both corpi for tokens. Then determine what fraction of tokens in corpus S1 do not occur in S2. Note that this is the probability that a new token is a hamsign. For a more accurate result, repeat this a few times on additional random subsets. Now plot this as $N$ changes, and curve fit this to a function $G$. (I can help with this if someone more familiar with SA internals can produce the raw data.) This is exactly the function $F$ we wanted above [1]--- both are the probability that a new token is a a hamsign, as a function of ham training set size. This model isn't a panacea, but it should provide solid advice in how to tune the bayes paramater that would do a good job at catching bayes poison. Would someone with a ham corpus do the above test and make the plot? Scott [1] This isn't quite true. The function $G$ computes P(HAM | <never before seen token in *HAM*> ) not P(HAM | <never before seen token> ) = So, I believe G would be biased to overestimating the hamminess of unseen tokens that only occur in spam. I'm not concerned. ------------------------------------------------------- This SF.net email is sponsored by: Perforce Software. Perforce is the Fast Software Configuration Management System offering advanced branching capabilities and atomic changes on 50+ platforms. Free Eval! http://www.perforce.com/perforce/loadprog.html _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk