Re: Trying to understand how bayes works.

RW Fri, 11 Dec 2015 06:58:49 -0800

On Thu, 10 Dec 2015 13:54:05 -0800
Marc Perkel wrote:

> Bayes breaks the message down into some sort of tokens and then does 
> statistics on those tokens as to tokens found in spam vs. tokens
> found in ham.
> 
> But what about combinations of tokens? I'm thinking that I'd like to 
> have something that says when it sees tokens X and Y and Z then
> that's spam even though X,Y,Z might be in ham when not combined.
> 
> Does bayes do that or is there anything that does?

In general making arbitrary combinations is not practical. What some
filters do is make tokens out of word combinations in a sliding window.
This can be very useful in catching difficult spams that are composed
of common neutral words, although in my experience it's a little more
prone to FPs than Bayes.

I use Bogofilter and DSPAM.

On Thu, 10 Dec 2015 21:28:44 -0800
Marc Perkel wrote:

> I'm thinking about incorporating Bogofilter but instead of feeding it 
> messages I'm thinking about feeding it the SpamAssassin results - the 
> rule names it hit + other data about the message and then let it
> score the rules. That's what I want to experiment with.

I thought of trying something like that myself, but my filtering became
practically perfect before I got around to it, so I never bothered. And
I think there are some problems with it.

The first is that FNs in SpamAssassin tend to come from a lack of
useful information rather than the scoring system failing to combine it
well.

The second is that most rules are either fairly neutral or strongly
spammy. There are few strong ham indicators to balance the rest. You
might be able to balance it with metadata, and reputation information,
but the trick is to do it without getting a high FP rate on new senders.

If you did wish to take account of rule combinations, you'd really have
to do it yourself because sliding-window tokenization wouldn't do it
well.

Re: Trying to understand how bayes works.

Reply via email to