> ----- Original Message ----- 
 > From: "Theo Van Dinter" <[EMAIL PROTECTED]>
> To: "Bob Dickinson (BSL)" <[EMAIL PROTECTED]>
> Cc: <[EMAIL PROTECTED]>
> Sent: Thursday, August 28, 2003 7:36 AM
> Subject: Re: [SAtalk] How long should a GA run take?
>
> It's sort of complicated, which is why I'm hoping we work on score
> generation for 2.70.  First, the GA doesn't output scores for
> every rule.  I forget the ones it skips, but I have script I use
> to "normalize" the scores to have the full list for the release.
> You then run rewrite-cf... to generate the new scores, and you put
> that in ../rules/50_scores.cf.  Then you run "runGA" with some argument
> (doesn't matter), and it'll generate the STATISTICS and such.  (note:
> that's based on the unseen validation logs, so it's likely to get slightly
> worse performance than the direct GA output).

Well, first off, the guy I work with (the Linux guy around here) got tired
of helping me, so I got this stuff working under cygwin on my Windows box
and ran it all again from the top myself.  I got the same results, and it
seemed to run about twice as fast (with comparable hardware).  Go figure.

Anyway, the GA-set0.out reported:

    The Best Evaluation: 6.425832e+04.
    The Best String:

    # SUMMARY for threshold 5.0:
    # Correctly non-spam: 188787  37.16%  (99.80% of non-spam corpus)
    # Correctly spam:     291101  57.30%  (91.29% of spam corpus)
    # False positives:       387  0.08%  (0.20% of nonspam,  75597 weighted)
    # False negatives:     27769  5.47%  (8.71% of spam,  84951 weighted)
    # Average score for spam:  16.6    nonspam: 0.4
    # Average for false-pos:   5.8  false-neg: 3.1
    # TOTAL:              508044  100.00%

As expected.  But, as before, the validation produced much worse results:

    STATISTICS REPORT FOR SPAMASSASSIN RULESET

    Classification success on test corpora, at default threshold:

    # SUMMARY for threshold 5.0:
    # Correctly non-spam:  20764  36.78%  (98.79% of non-spam corpus)
    # Correctly spam:      32924  58.33%  (92.93% of spam corpus)
    # False positives:       255  0.45%  (1.21% of nonspam,  29228 weighted)
    # False negatives:      2505  4.44%  (7.07% of spam,   7837 weighted)
    # TCR: 9.372751  SpamRecall: 92.930%  SpamPrec: 99.231%  FP: 0.45%  FN:
4.44%

And just to make sure this wasn't some kind of statistical anomaly, I copied
the spam/ham files to the -validate files and I ran the "validation" again
(this time on the original "seen" messages used by the GA to generate the
scores).  And this is what I got:

    STATISTICS REPORT FOR SPAMASSASSIN RULESET

    Classification success on test corpora, at default threshold:

    # SUMMARY for threshold 5.0:
    # Correctly non-spam: 186620  36.73%  (98.65% of non-spam corpus)
    # Correctly spam:     296308  58.32%  (92.92% of spam corpus)
    # False positives:      2554  0.50%  (1.35% of nonspam, 281078 weighted)
    # False negatives:     22562  4.44%  (7.08% of spam,  70093 weighted)
    # TCR: 9.024963  SpamRecall: 92.924%  SpamPrec: 99.145%  FP: 0.50%  FN:
4.44%

So I got pretty much exactly the same horrible FP performance with the
"seen" messages as I did with the "validation" messages.  And this leads me
to believe that there is something wrong with my scores file, or maybe I
have missed some step required to make the validation work.

My exact steps were:

    runGA
    rewrite_cf_with_new_scores 0 ../rules/50_scores.cf GA-set0.scores >
50_scores.new
    mv 50_scores.new ../rules/50_scores.cf
    runGA foo

After the rewrite_cf... I manually verified that the scores in 50_scores.cf
matched the scores in GA-set0.scores.

I'm somewhat at a loss as to what to try next.  Any help would be greatly
appreciated.

Bob



-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to