I attempted to calculate more useful scores for all of the SA tests based
on my own corpora.  Because individually tuned spam filters work better,
which is why per-user bayesian stuff exists.

I managed to reduce false negatives (spams getting past SA) by 84.6%
without causing any additional false positives (SA discarding legit email).

Unfortunately, those were the numbers for the corpora I was training on.

When I split the corpora in half randomly, recalculated scores based on one
half, and tested the results on the other half... I didn't even bother
looking into the false negatives.  The median false positives was probably
around 0.6%, which is 15 times the goal I've seen mentioned for SA of 1 in
2500, or 0.04% (although I think that's the goal for success rate on the
training corpora).

So I proved to myself what I already knew:  Retraining on very small
corpora will have bad results because of all the examples not included.

The code is here:  http://www.chaosreigns.com/code/sarescore/

One run on my 1,795 email corpora takes 1.5 minutes on my linode.
In perl.  Which was quite a lot faster than anything I had previously
achieved, as a result of using per-test increments.  If a test's score
was increased twice in a row, or decreased twice in a row, its increment
was increased.  If a test's score was increased then decreased, or
decreased and then increased, its increment was decreased.

I'm curious how this compares to the genetic algorithm thingy used to
generate the normal scores for SA, but haven't poked at it.


Actual distribution of percent non-spam correct in 245 runs:

$ cut -c1-5 < verify.log | sort | uniq -c | sort -n -k 2
      1  98.2
      1  98.3
      3  98.4
      4  98.6
      4  98.7
     12  98.8
     14  98.9
     20  99.0
     17  99.1
     19  99.3
     39  99.4
     31  99.5
     33  99.6
     19  99.7
     23  99.8
      5 100.0

Actually, now that I look at the garescorer results with "approximately
1 million" emails, this doesn't look so bad.  I'm sure they're only the
percent correct on the corpora used for training, and I'd love to know the
results if the corpora were split in half and used for testing like I did.

The 99.96% correct is only set 3 (network + bayes), and I'm not doing
bayes.  Percent spam correct per test set:

98.88%: Set 0, no bayes and no network tests.
        (False positives: 1 in 89)
 
99.86%: Set 1, network but no bayes.
        (False positives: 1 in 714)
 
99.89%: Set 2, bayes but no network. 
        (False positives: 1 in 909)
 
99.96%: Set 3, network and bayes.
        (False positives: 1 in 2500)

Yeah, I'd really like to see what happens if you split the corpora in half,
train on half, and test on the other half.

Maybe vary the number of emails used in the training set to come up with a
function of the size of the input corpora to the accuracy of the test
results on the testing half of the corpora.

-- 
"If you are not paranoid... you may not be paying attention."
 - j...@creative-net.net, on an IDPA mailing list
http://www.ChaosReigns.com

Reply via email to