On Feb 10, 2007, at 12:14, Miles Fidelman wrote:
Dan wrote:
I've developed a new approach to scoring that I want to 1) share with everyone and 2) make into a working system thats as accurate as what I've already built, but easier to use. First, the theory:

NEW ASSUMPTION
All messages are spam unless x,y,z score says they're ham.

NEW APPROACH
Block everything, then create rules to not catch what you do want. ie, build tests that target the spam (keeping all the tests you've already built), then score the thousands of ways ham triggers on those tests.
It strikes me that the hardest part of this approach is filtering out too much ham. At least for me, it's more important to make sure that people reach me, than to filter out all spam. If we take the approach that everything is to be filtered out, except x,y,z - then the risk of filtering out too much seems pretty high.

Actually, [unparalleled] accuracy is built into this approach. Currently, a ham gets caught and you either take out the rule that caught it or make a whitelist entry.

        Lots of ongoing work = little cumulative return

With Find the Ham, whitelisting is almost obsolete. When you find an FP, you make an exception for the specific profile, the permutation of which tests/rules caught the message so this specific assembly doesn't catch any more. The rules stays at full strength for every other permutation and no whitelist is needed.

This training process is the best part of the whole approach. It begins with huge FPs, but significant improvements take only a few weeks. A few months (depending on the diversity of your ham) and FPs are very very rare.

        Little ongoing work = huge cumulative return


Dan

Reply via email to