On Feb 10, 2007, at 12:14, Miles Fidelman wrote:
Dan wrote:
I've developed a new approach to scoring that I want to 1) share
with everyone and 2) make into a working system thats as accurate
as what I've already built, but easier to use. First, the theory:
NEW ASSUMPTION
All messages are spam unless x,y,z score says they're ham.
NEW APPROACH
Block everything, then create rules to not catch what you do
want. ie, build tests that target the spam (keeping all the tests
you've already built), then score the thousands of ways ham
triggers on those tests.
It strikes me that the hardest part of this approach is filtering
out too much ham. At least for me, it's more important to make
sure that people reach me, than to filter out all spam. If we take
the approach that everything is to be filtered out, except x,y,z -
then the risk of filtering out too much seems pretty high.
Actually, [unparalleled] accuracy is built into this approach.
Currently, a ham gets caught and you either take out the rule that
caught it or make a whitelist entry.
Lots of ongoing work = little cumulative return
With Find the Ham, whitelisting is almost obsolete. When you find an
FP, you make an exception for the specific profile, the permutation
of which tests/rules caught the message so this specific assembly
doesn't catch any more. The rules stays at full strength for every
other permutation and no whitelist is needed.
This training process is the best part of the whole approach. It
begins with huge FPs, but significant improvements take only a few
weeks. A few months (depending on the diversity of your ham) and FPs
are very very rare.
Little ongoing work = huge cumulative return
Dan