At 01:24 PM 11/20/2002 -0600, Bob Apthorpe wrote:
Hi,
On Wed, 20 Nov 2002, Christopher Eykamp wrote:
> At 04:40 PM 11/20/2002 +0000, Matt Sergeant wrote:
> >Argh. Lingo breakage. I meant probability. The way bayes works is you get
> >all the probabilities and combine them. So you have something like:
> >
> >1.0 => html_attr_style: bgcolor: white; foreground: white;
> >1.0 => nigeria
> >1.0 => million
> ># a few in the middle
> >0.0 => linux
[snip]
> >etc.
> >
> >The idea is to make the good tokens swamp out the bad ones. So when you
> >combine these its weighted towards ham, and not spam, because there were
> >a few spammy tokens, but more hammy tokens, which swung the pendulum in
> >favour of ham.
> >
> >So while 1 token might have individually a 100% probability, that doesn't
> >mean *squat*, because Bayesian probability considers all the tokens combined.
>
> This is made even more likely because in order to reduce false positives,
> many Bayesian implementations require combined probabilities of .9 or so to
> declare a message as Spam. So the linux/kernel/everton/etc. words don't
> need to bring the probability down too far to get a message through.
How much of a message does a human need to read before they classify it as
spam? And where in the message? Top? Middle? Bottom?
I'm guessing that the top 5-20 lines of the body will give a human enough
information to classify the message so limit the Bayesian analysis of the
body text to the top 20 lines or the first 200 words.
If you're trying to promote something, you need to get to the point of the
pitch very quickly. If we analyze a prominent subsection of the message,
we initially avoid analyzing any intentional noise added to 'ham up' the
message, assuming spammers put the false ham at the end of the message.
Should the character of spam change as a result, we've pushed the
sales pitch deeper into the body of the message, beneath intentional noise
where the end user is less likely to recognize it or respond. This will
tend to make communication less effective, ultimately reducing whatever
benefit the sender gets from sending the spam.
This still doesn't change the results of Bayesian analysis of the header,
so with three probabilities (truncated message probability, full message
probability, and header probability), we should have at least two good
metrics for judging whether a message is spam.
I'm curious how well this would work in practice.
One added bonus: one should be able to reuse the analysis of the truncated
message as part of the full message analysis. It should only use
additional storage and not a substantial amount of extra processing power,
so the cost of the additional analysis to the recipient should be small.
Economically, this is a good strategy; increasing the burden on spammers
without substantially increasing our own is a Good Thing...
I was thinking the same thing. Would a separate Bayesian analysis of different parts of the message be more effective? Evaluate the header, the first 5 lines, the first 10 lines, the first 25 lines, and the whole body, the URLs, the HTML, capitalized words, word pairs/triplets, and last 5 or 10 lines separately, then combine these separate analyses to generate an overall score. This might make it more difficult to bypass than a more naive analysis that looks at the entire message as a whole.
Chris
-------------------------------------------------------
This sf.net email is sponsored by: Battle your brains against the best in the Thawte Crypto Challenge. Be the first to crack the code - register now: http://www.gothawte.com/rd521.html
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk