On Thu, 08 Jan 2004 20:50:17 -0800, Chris Petersen <[EMAIL PROTECTED]> writes:

> > Is there a 
> > SpamAssassin fix for it or some test I can increase to fix it??
> 
> Check the "lots of random words" thread from today for a couple of
> (probably short-term) solutions...  Other than that, I'm hoping that
> bayes will start picking up on some of the less common words that get
> tossed into the bunch.
> 

Here's a formal model that should help SA exploit an effect of unknown
tokens. The idea is that past a certain point in training, one should
start to assume that the bayes model for ham is becoming accurate. IE,
that an unknown word is an indicator of spam. 

So this means that we should adjust the probability assigned to unseen
tokens upward as we obtain more samples of ham.

More formally, instead of:

  P(HAM | <never before seen token> ) =  .5   or an arbitrary constant

have it:

  P(HAM | <never before seen token> ) =  F(# ham messages analyzed)

F could be something like F(x) = .5-(.4* min(x/1000,1) which is a
linear map from .5 to .1 as the training set increases, but we can be
a lot more accurate. We can directly compute what F should be using
probability theory and given a corpus of ham.

Take a corpus of ham, and choose a random subet of 10000 of them as
S1, and a random subsets of $N$ messages as S2.

Scan both corpi for tokens. Then determine what fraction of tokens in
corpus S1 do not occur in S2. Note that this is the probability that a
new token is a hamsign. For a more accurate result, repeat this a few
times on additional random subsets.

Now plot this as $N$ changes, and curve fit this to a function $G$. (I
can help with this if someone more familiar with SA internals can
produce the raw data.) This is exactly the function $F$ we wanted
above [1]--- both are the probability that a new token is a a hamsign, as
a function of ham training set size.

This model isn't a panacea, but it should provide solid advice in how
to tune the bayes paramater that would do a good job at catching bayes
poison.

Would someone with a ham corpus do the above test and make the plot?


Scott

[1] This isn't quite true. The function $G$ computes

  P(HAM | <never before seen token in *HAM*> ) 

not

  P(HAM | <never before seen token> ) = 

So, I believe G would be biased to overestimating the hamminess of
unseen tokens that only occur in spam. I'm not concerned.




-------------------------------------------------------
This SF.net email is sponsored by: Perforce Software.
Perforce is the Fast Software Configuration Management System offering
advanced branching capabilities and atomic changes on 50+ platforms.
Free Eval! http://www.perforce.com/perforce/loadprog.html
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to