At 04:40 PM 11/20/2002 +0000, Matt Sergeant wrote:
Sean Redmond said the following on 20/11/02 16:16:Matt Sergeant wrote:
Argh. Lingo breakage. I meant probability. The way bayes works is you get all the probabilities and combine them. So you have something like:Plus, their pitch would be so buried in all the fluff that you wouldn't be able to find it unless they made the the linuxy text very small or white-on-white or clear, and those html tags would then become *very* statistically significant.Use one tag. Then it's a single 1.0 score. And my example already made it white on white.Maybe the weakness is in converting a probability to a score.
1.0 => html_attr_style: bgcolor: white; foreground: white;
1.0 => nigeria
1.0 => million
# a few in the middle
0.0 => linux
0.0 => kernel
0.0 => trustix
0.0 => operating
0.0 => device
0.0 => everton
0.0 => unauthorised
etc.
The idea is to make the good tokens swamp out the bad ones. So when you combine these it's weighted towards ham, and not spam, because there were a few spammy tokens, but more hammy tokens, which swung the pendulum in favour of ham.
So while 1 token might have individually a 100% probability, that doesn't mean *squat*, because Bayesian probability considers all the tokens combined.
This is made even more likely because in order to reduce false positives, many Bayesian implementations require combined probabilities of .9 or so to declare a message as Spam. So the linux/kernel/everton/etc. words don't need to bring the probability down too far to get a message through.
Chris
-------------------------------------------------------
This sf.net email is sponsored by: Battle your brains against the best in the Thawte Crypto Challenge. Be the first to crack the code - register now: http://www.gothawte.com/rd521.html
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk