Sean Redmond said the following on 20/11/02 15:25:
Matt Sergeant wrote:


Also I understand his explanation, only the most interesting tokens
are considered in calculating the likelyhood that it's spam, so
watering down the body of the message should only makes the
interesting things more interesting.

But Graham's analysis is wrong here. If you want to defeat bayesian
filters, the spam of the future will look like:

__BEGIN__


Hey there. Thought you should check out the following:
http://www.27meg.com/foo

(snip: linuxy, footbally, disclaimery content)


__END__

This covers a fairly broad section of people's training data (a linuxy
type mail, a football related mail, and a corporate legal disclaimer),
and so those things will be the "interesting" tokens from the ham corpus.


But this is where the personalization of the corpus is important, because *I* never get football related mail, so that makes it suspicious right there.
No, it doesn't. It puts it into the "unknown" category. I assume SpamAssassin's implementation is using the same rules as spambayes, which means unknown words get a probability of 0.5.

I think that proves my point :-)

Disclaimers are so common I don't think they would be considered in the calculation, right?
Wrong. How do you delimit them? I see all sorts here at work. Some up to 150 lines, including at the top and at the bottom. There's no way SpamAssassin could effectively ignore them.

Plus, their pitch would be so buried in all the fluff that you wouldn't be able to find it unless they made the the linuxy text very small or white-on-white or clear, and those html tags would then become *very* statistically significant.
Use one tag. Then it's a single 1.0 score. And my example already made it white on white.

I've been doing bayesian filtering for about 10 months now. I've thought about this a lot. I really don't think it's invincible, but that doesn't mean it's ineffective.

Matt.



-------------------------------------------------------
This sf.net email is sponsored by: To learn the basics of securing your web site with SSL, click here to get a FREE TRIAL of a Thawte Server Certificate: http://www.gothawte.com/rd524.html
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to