On Thu, 10 Dec 2015 13:54:05 -0800 Marc Perkel wrote: > Bayes breaks the message down into some sort of tokens and then does > statistics on those tokens as to tokens found in spam vs. tokens > found in ham. > > But what about combinations of tokens? I'm thinking that I'd like to > have something that says when it sees tokens X and Y and Z then > that's spam even though X,Y,Z might be in ham when not combined. > > Does bayes do that or is there anything that does?
In general making arbitrary combinations is not practical. What some filters do is make tokens out of word combinations in a sliding window. This can be very useful in catching difficult spams that are composed of common neutral words, although in my experience it's a little more prone to FPs than Bayes. I use Bogofilter and DSPAM. On Thu, 10 Dec 2015 21:28:44 -0800 Marc Perkel wrote: > I'm thinking about incorporating Bogofilter but instead of feeding it > messages I'm thinking about feeding it the SpamAssassin results - the > rule names it hit + other data about the message and then let it > score the rules. That's what I want to experiment with. I thought of trying something like that myself, but my filtering became practically perfect before I got around to it, so I never bothered. And I think there are some problems with it. The first is that FNs in SpamAssassin tend to come from a lack of useful information rather than the scoring system failing to combine it well. The second is that most rules are either fairly neutral or strongly spammy. There are few strong ham indicators to balance the rest. You might be able to balance it with metadata, and reputation information, but the trick is to do it without getting a high FP rate on new senders. If you did wish to take account of rule combinations, you'd really have to do it yourself because sliding-window tokenization wouldn't do it well.