> ----- Original Message -----
> From: "Theo Van Dinter" <[EMAIL PROTECTED]>
> To: "Bob Dickinson (BSL)" <[EMAIL PROTECTED]>
> Cc: <[EMAIL PROTECTED]>
> Sent: Thursday, August 28, 2003 7:36 AM
> Subject: Re: [SAtalk] How long should a GA run take?
>
> It's sort of complicated, which is why I'm hoping we work on score
> generation for 2.70. First, the GA doesn't output scores for
> every rule. I forget the ones it skips, but I have script I use
> to "normalize" the scores to have the full list for the release.
> You then run rewrite-cf... to generate the new scores, and you put
> that in ../rules/50_scores.cf. Then you run "runGA" with some argument
> (doesn't matter), and it'll generate the STATISTICS and such. (note:
> that's based on the unseen validation logs, so it's likely to get slightly
> worse performance than the direct GA output).
Well, first off, the guy I work with (the Linux guy around here) got tired
of helping me, so I got this stuff working under cygwin on my Windows box
and ran it all again from the top myself. I got the same results, and it
seemed to run about twice as fast (with comparable hardware). Go figure.
Anyway, the GA-set0.out reported:
The Best Evaluation: 6.425832e+04.
The Best String:
# SUMMARY for threshold 5.0:
# Correctly non-spam: 188787 37.16% (99.80% of non-spam corpus)
# Correctly spam: 291101 57.30% (91.29% of spam corpus)
# False positives: 387 0.08% (0.20% of nonspam, 75597 weighted)
# False negatives: 27769 5.47% (8.71% of spam, 84951 weighted)
# Average score for spam: 16.6 nonspam: 0.4
# Average for false-pos: 5.8 false-neg: 3.1
# TOTAL: 508044 100.00%
As expected. But, as before, the validation produced much worse results:
STATISTICS REPORT FOR SPAMASSASSIN RULESET
Classification success on test corpora, at default threshold:
# SUMMARY for threshold 5.0:
# Correctly non-spam: 20764 36.78% (98.79% of non-spam corpus)
# Correctly spam: 32924 58.33% (92.93% of spam corpus)
# False positives: 255 0.45% (1.21% of nonspam, 29228 weighted)
# False negatives: 2505 4.44% (7.07% of spam, 7837 weighted)
# TCR: 9.372751 SpamRecall: 92.930% SpamPrec: 99.231% FP: 0.45% FN:
4.44%
And just to make sure this wasn't some kind of statistical anomaly, I copied
the spam/ham files to the -validate files and I ran the "validation" again
(this time on the original "seen" messages used by the GA to generate the
scores). And this is what I got:
STATISTICS REPORT FOR SPAMASSASSIN RULESET
Classification success on test corpora, at default threshold:
# SUMMARY for threshold 5.0:
# Correctly non-spam: 186620 36.73% (98.65% of non-spam corpus)
# Correctly spam: 296308 58.32% (92.92% of spam corpus)
# False positives: 2554 0.50% (1.35% of nonspam, 281078 weighted)
# False negatives: 22562 4.44% (7.08% of spam, 70093 weighted)
# TCR: 9.024963 SpamRecall: 92.924% SpamPrec: 99.145% FP: 0.50% FN:
4.44%
So I got pretty much exactly the same horrible FP performance with the
"seen" messages as I did with the "validation" messages. And this leads me
to believe that there is something wrong with my scores file, or maybe I
have missed some step required to make the validation work.
My exact steps were:
runGA
rewrite_cf_with_new_scores 0 ../rules/50_scores.cf GA-set0.scores >
50_scores.new
mv 50_scores.new ../rules/50_scores.cf
runGA foo
After the rewrite_cf... I manually verified that the scores in 50_scores.cf
matched the scores in GA-set0.scores.
I'm somewhat at a loss as to what to try next. Any help would be greatly
appreciated.
Bob
-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk