On 09/27, Marc Perkel wrote: > Here's the kind of think I'm seeing. Spam talks about money - low > score. Spam talks about Jesus - low score. Spam talks about money > and Jesus and throw in a dear someone and it's spam. I'm hoping to > detect combinations automatcally.
You're not really talking about something bayes does. But I've thought a little bit about doing something like it. People contributing mass-check data have access to everybody else's data (just the rule hit counts, not actual email contents), so I can do statistical analysis to find patterns like this. The problem, which we come across over and over again, is not enough data. We barely get enough mass-check data to provide useful scores with the existing method, where you're only analyzing the frequency of individual rules, basically. When you start analyzing frequencies of patterns, you need a lot more data. So yeah, you could write a score generator that, instead of coming up with: test A = 0.3 test B = 0.1 test C = 4 Comes up with optimal scores for all possible combinations: test A = 0 test B = 0.1 test C = 0.2 test A+B = 6 test A+C = 5.3 test B+C = 2 test A+B+C = -0.3 (wouldn't that be fun?) But score generation requires a significant number of email samples with each test, and, "A+B" ends up becoming an additional test, with far fewer samples. It causes... exponential problems with the input data required. I might even have tried it and have code laying around somewhere. If only I had the data of a large email provider, accurately sorted into spam and non-spam. Hell, once you're doing analysis of all the possible combinations of test hits, you hardly even have a use for scores, and can just reduce your results to "this combination is spam" and "this combination is not spam". Sexy. Ooh, I can make the problem clearer. Currently, score generation won't trigger unless the mass-check corpora contains 150,000 hams (non-spams) and 150,000 spams. So say we need 300,000 emails, hand sorted, to calculate scores. And the 50_scores.cf file contains 913 rules. So, for rough estimation, say that works out to needing 300,000/913 = 328.6 emails per rule. Now how many combinations of rules can you come up with if you start with 913 rules? I don't remember how to calculate it, but I can tell you it's freaking huge. Then multiply it by 328.6 to get the number of emails we'd need to calculate accurate scores for each combination. Of course it would be likely to be useful to only track scores for combinations of up to, say, 10 rules, which would significantly reduce the problem, but it would still be nasty. Hmm, doesn't look as bad as I thought. With 913 rules, the number of combinations of 4 rules looks like 28761672940. 3 rules: 126424936 2 rules: 416328 1 rule: 913 (yay, this step at least isn't horribly wrong) So 28761672940+126424936+416328+913 = 28888515117 possible combinations of 1 to 4 rules, multiply by 328.6, and we need 9492766067446 emails, hand sorted into ham and spam, to come up with accurate scores for those combinations (of just 1-4 tests). And it looks like we're not currently even getting enough for score generation to work as is, and that's still 31 MILLION TIMES the minimum number of emails required by the current system. And that still doesn't address the problem of handling emails that hit more than 4 rules, although, in comparison, I think that one would be easy. Somebody please show me where I'm wrong on the number of emails required, and how we can actually make this happen. Because that would be fun. http://www.mathsisfun.com/combinatorics/combinations-permutations.html Combinations without Repetition http://ardoino.altervista.org/blog/index.php?id=48 - how to do factorials in bc. -- "But do you have any idea how many SuperBalls you could buy if you actually applied yourself in the world? Probably eleven, but you should still try." - http://hyperboleandahalf.blogspot.com/ http://www.ChaosReigns.com