Re: New Bayes like paradigm

darxus Tue, 27 Sep 2011 21:25:44 -0700

On 09/27, Marc Perkel wrote:
> Here's the kind of think I'm seeing. Spam talks about money - low
> score. Spam talks about Jesus - low score. Spam talks about money
> and Jesus and throw in a dear someone and it's spam. I'm hoping to
> detect combinations automatcally.


You're not really talking about something bayes does.

But I've thought a little bit about doing something like it.  People
contributing mass-check data have access to everybody else's data (just
the rule hit counts, not actual email contents), so I can do statistical
analysis to find patterns like this.  The problem, which we come across
over and over again, is not enough data.  We barely get enough mass-check
data to provide useful scores with the existing method, where you're
only analyzing the frequency of individual rules, basically.  When
you start analyzing frequencies of patterns, you need a lot more data.  

So yeah, you could write a score generator that, instead of coming up with:

test A = 0.3
test B = 0.1
test C = 4

Comes up with optimal scores for all possible combinations:

test A = 0
test B = 0.1
test C = 0.2
test A+B = 6
test A+C = 5.3
test B+C = 2
test A+B+C = -0.3  (wouldn't that be fun?)

But score generation requires a significant number of email samples with
each test, and, "A+B" ends up becoming an additional test, with far fewer
samples.  It causes... exponential problems with the input data required.
I might even have tried it and have code laying around somewhere.  If only
I had the data of a large email provider, accurately sorted into spam and
non-spam.

Hell, once you're doing analysis of all the possible combinations of test
hits, you hardly even have a use for scores, and can just reduce your
results to "this combination is spam" and "this combination is not spam".
Sexy.

Ooh, I can make the problem clearer.  Currently, score generation won't
trigger unless the mass-check corpora contains 150,000 hams (non-spams) and
150,000 spams.  So say we need 300,000 emails, hand sorted, to calculate
scores.  And the 50_scores.cf file contains 913 rules.  So, for rough
estimation, say that works out to needing 300,000/913 = 328.6 emails per
rule.  Now how many combinations of rules can you come up with if you start
with 913 rules?  I don't remember how to calculate it, but I can tell you
it's freaking huge.  Then multiply it by 328.6 to get the number of emails
we'd need to calculate accurate scores for each combination.  Of course it
would be likely to be useful to only track scores for combinations of up
to, say, 10 rules, which would significantly reduce the problem, but it
would still be nasty.  

Hmm, doesn't look as bad as I thought.  With 913 rules, the number of
combinations of 4 rules looks like 28761672940.  
3 rules: 126424936
2 rules: 416328
1 rule: 913 (yay, this step at least isn't horribly wrong)

So 28761672940+126424936+416328+913 = 28888515117 possible combinations of
1 to 4 rules, multiply by 328.6, and we need 9492766067446 emails, hand
sorted into ham and spam, to come up with accurate scores for those
combinations (of just 1-4 tests).  And it looks like we're not currently
even getting enough for score generation to work as is, and that's
still 31 MILLION TIMES the minimum number of emails required by the
current system.

And that still doesn't address the problem of handling emails that hit more
than 4 rules, although, in comparison, I think that one would be easy.


Somebody please show me where I'm wrong on the number of emails required,
and how we can actually make this happen.  Because that would be fun.


http://www.mathsisfun.com/combinatorics/combinations-permutations.html
Combinations without Repetition
http://ardoino.altervista.org/blog/index.php?id=48  - how to do factorials
in bc.

-- 
"But do you have any idea how many SuperBalls you could buy if you
actually applied yourself in the world? Probably eleven, but you should
still try." - http://hyperboleandahalf.blogspot.com/
http://www.ChaosReigns.com

Re: New Bayes like paradigm

Reply via email to