Sean Redmond said the following on 20/11/02 16:16:
Matt Sergeant wrote:

Disclaimers are so common I don't think they would be considered in
the calculation, right?
Wrong. How do you delimit them? I see all sorts here at work. Some up to
150 lines, including at the top and at the bottom. There's no way
SpamAssassin could effectively ignore them.
They would be ignored because as you trained the system, it would lose interest in the tokens contained in disclaimers.
None of the classifiers I've seen discard tokens seen more than X times, which is what you're suggesting. Seeing the token more often simply makes it more interesting, not less.

Of course it may be an interesting experiment to do so, but I don't think anyone has done that yet.

Plus, their pitch would be so buried in all the fluff that you
wouldn't be able to find it unless they made the the linuxy text very
small or white-on-white or clear, and those html tags would then
become *very* statistically significant.
Use one tag. Then it's a single 1.0 score. And my example already made
it white on white.
Maybe the weakness is in converting a probability to a score.
Argh. Lingo breakage. I meant probability. The way bayes works is you get all the probabilities and combine them. So you have something like:

1.0 => html_attr_style: bgcolor: white; foreground: white;
1.0 => nigeria
1.0 => million
# a few in the middle
0.0 => linux
0.0 => kernel
0.0 => trustix
0.0 => operating
0.0 => device
0.0 => everton
0.0 => unauthorised

etc.

The idea is to make the good tokens swamp out the bad ones. So when you combine these it's weighted towards ham, and not spam, because there were a few spammy tokens, but more hammy tokens, which swung the pendulum in favour of ham.

So while 1 token might have individually a 100% probability, that doesn't mean *squat*, because Bayesian probability considers all the tokens combined.

Matt.



-------------------------------------------------------
This sf.net email is sponsored by: To learn the basics of securing your web site with SSL, click here to get a FREE TRIAL of a Thawte Server Certificate: http://www.gothawte.com/rd524.html
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to