As a counter argument of this, what about HTML messages being abused to bypass bayes when only looking at the top N lines? (note: think this is on the right track in principle, but I can see some resulting holes)

The spammer could now bypass bayes by inserting a HTML comment at the beginning consisting of 200 bytes or 20 lines of ham, end the comment, and begin his spam message.

If you strip HTML prior to bayes this could also be done by using a white-on-white text in tiny font tag prior to the bogus ham, with lots of newlines, and then switch back to a readable color and begin their marketing. The final message once processed by the MUA as an HTML message will appear as if it has only a couple blank lines at the top (because the font is small, and HTML will ignore the newlines) but will miss bayes entirely.

Of course, there's some scoring penalties to using white text and tiny fonts in the regular rules, but this might not hurt you as bad as the bayes.

At 01:24 PM 11/20/2002 -0600, Bob Apthorpe wrote:
How much of a message does a human need to read before they classify it as
spam? And where in the message? Top? Middle? Bottom?

I'm guessing that the top 5-20 lines of the body will give a human enough
information to classify the message so limit the Bayesian analysis of the
body text to the top 20 lines or the first 200 words.

If you're trying to promote something, you need to get to the point of the
pitch very quickly. If we analyze a prominent subsection of the message,
we initially avoid analyzing any intentional noise added to 'ham up' the
message, assuming spammers put the false ham at the end of the message.


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to