Matt Sergeant wrote:

> Also I understand his explanation, only the most interesting tokens
> are considered in calculating the likelyhood that it's spam, so
> watering down the body of the message should only makes the
> interesting things more interesting.


But Graham's analysis is wrong here. If you want to defeat bayesian
filters, the spam of the future will look like:

__BEGIN__


Hey there. Thought you should check out the following:
http://www.27meg.com/foo

(snip: linuxy, footbally, disclaimery content)

__END__

This covers a fairly broad section of people's training data (a linuxy
type mail, a football related mail, and a corporate legal disclaimer),
and so those things will be the "interesting" tokens from the ham corpus.

But this is where the personalization of the corpus is important, because *I* never get football related mail, so that makes it suspicious right there.

Disclaimers are so common I don't think they would be considered in the calculation, right? Words that occur with equal frequency in your spam and non-spam corpus are just ignored, and so you look for other tokens (like unusual X-Mailers, for instance).

Plus, their pitch would be so buried in all the fluff that you wouldn't be able to find it unless they made the the linuxy text very small or white-on-white or clear, and those html tags would then become *very* statistically significant.

--
Sean Redmond
BMA Information Systems

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to