Chris Hastie wrote:
The industry that I work in is currently having its concept of risk assessment
thoroughly shaken. The sort of risks we deal with have three main, largely
independant factors. For years we've been assigning a value to each of these
factors, and then adding them up to come up with a figure representative of
relative risk.
Then along came some bright spark who new a little bit about statistics. He
showed that we can estimate the risk of each of the three factors. Then he
pointed out that for someone to be injured, all three had to happen. And the
probability of a AND b AND c is the *product* of the three probabilities, not
the sum. It all makes sense. And frighteningly, gives quite different results
to the way we've used for years.
I suspect this has nothing to do with multiply vs add, but more with a
better data representations and better algorithms.
So I'm thinking about writing myself a policy server for Postfix. I want to
consider different things, weight them and use a combination of factors to
decide whether or not to reject mail. Much like SA does. Thinking about how to
weight things, I realised that the same principles could be applied to spam.
Perhaps.
- log(P1*P2*P3) = log(P1) + log(P2) + log(P3)
so to multiply or to add is not the question:-)
- addition is more stable numerically. so even if your knowledge about
the problem is multiplicative, you'll want to convert it into additive
computations when possible.
- the background from linear/convex/... programming is easier to apply
to additive forms
For example (fictional figures here), say 95% of mail from clients in a
particular RBL is spam. We could say, then, that such an item of mail has a
probability of 0.05 of being ham. 80% of mail from clients giving a particular
form of HELO is spam - probability of 0.2 that it is ham. 60% of SPF fails are
spam - probability of 0.4 that such a mail is ham.
Thus if a piece of mail has failed all three of these tests, the probability of
it being ham is 0.05 * 0.2 * 0.4 = 0.004, or 1/250. Or put another way, we can
be 99.6% sure it is spam.
Now I'm neither a stastician nor an expert in fighting spam, so I'm sure there
are some flaws in this idea somewhere. One of them is probably that the various
tests available are not statistically independant. But as a basic principle, is
there mileage in this, or should I stick with addition, or find another way of
weighting stuff altogether?
you might try a neural network (or friends) and check whether you can do
the computations fast enough (to avoid client timeout). then see if the
results are better than plain logic ("he's in an RBL, let's reject and
let him get out...").
I'm actually going to be away from my computer for the next ten days, so I
apologise if I don't promptly respond to your responsed, but rest assured I
will read them with great interest when I get back...