Matt Sergeant wrote:

Sean Redmond said the following on 20/11/02 15:25:


> But this is where the personalization of the corpus is important,
> because *I* never get football related mail, so that makes it
> suspicious right there.


No, it doesn't. It puts it into the "unknown" category. I assume
SpamAssassin's implementation is using the same rules as spambayes,
which means unknown words get a probability of 0.5.

I think that proves my point :-)
Right. At least the first time.

> Disclaimers are so common I don't think they would be considered in
> the calculation, right?


Wrong. How do you delimit them? I see all sorts here at work. Some up to
150 lines, including at the top and at the bottom. There's no way
SpamAssassin could effectively ignore them.
They would be ignored because as you trained the system, it would lose interest in the tokens contained in disclaimers.


> Plus, their pitch would be so buried in all the fluff that you
> wouldn't be able to find it unless they made the the linuxy text very
> small or white-on-white or clear, and those html tags would then
> become *very* statistically significant.


Use one tag. Then it's a single 1.0 score. And my example already made
it white on white.

Maybe the weakness is in converting a probability to a score. Quoting Paul Graham again:

<qoute>But the real advantage of the Bayesian approach, of course, is that you know what you're measuring. Feature-recognizing filters like SpamAssassin assign a spam "score" to email. The Bayesian approach assigns an actual probability. The problem with a "score" is that no one knows what it means. The user doesn't know what it means, but worse still, neither does the developer of the filter. How many points should an email get for having the word "sex" in it? A probability can of course be mistaken, but there is little ambiguity about what it means, or how evidence should be combined to calculate it. Based on my corpus, "sex" indicates a .97 probability of the containing email being a spam, whereas "sexy" indicates .99 probability. And Bayes' Rule, equally unambiguous, says that an email containing both words would, in the (unlikely) absence of any other evidence, have a 99.97% chance of being a spam.</quote>

So that one tag creating white text should not have a 1.0 score but near 100% probability, especially in the absence of any tags to change the color of the background (although all those tags would collectively increase the probability). My interpretation of the above paragraph is not that "sex" and "sexy" should be 2 points on the way to a 5-point threshold. They should be an indication, apart from all other evidence, that the message is almost certainly spam. If they had to be converted to a score then they should get 4.9985 points out of a five-point threshhold.

Anyway, I should stop talking. I don't use the Baysian stuff in our installation of SpamAssassin so I'm basically guessing how it implements Graham's ideas.

I've been doing bayesian filtering for about 10 months now. I've thought
about this a lot. I really don't think it's invincible, but that doesn't
mean it's ineffective.

Matt.
True enough. Nothing is foolproof, but every little bit helps.

Sean

--
Sean Redmond
BMA Information Systems

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to