> ----- Original Message ----- > From: "Theo Van Dinter" <[EMAIL PROTECTED]> > To: "Bob Dickinson (BSL)" <[EMAIL PROTECTED]> > Cc: <[EMAIL PROTECTED]> > Sent: Thursday, August 28, 2003 7:36 AM > Subject: Re: [SAtalk] How long should a GA run take? > > It's sort of complicated, which is why I'm hoping we work on score > generation for 2.70. First, the GA doesn't output scores for > every rule. I forget the ones it skips, but I have script I use > to "normalize" the scores to have the full list for the release. > You then run rewrite-cf... to generate the new scores, and you put > that in ../rules/50_scores.cf. Then you run "runGA" with some argument > (doesn't matter), and it'll generate the STATISTICS and such. (note: > that's based on the unseen validation logs, so it's likely to get slightly > worse performance than the direct GA output).
Well, first off, the guy I work with (the Linux guy around here) got tired of helping me, so I got this stuff working under cygwin on my Windows box and ran it all again from the top myself. I got the same results, and it seemed to run about twice as fast (with comparable hardware). Go figure. Anyway, the GA-set0.out reported: The Best Evaluation: 6.425832e+04. The Best String: # SUMMARY for threshold 5.0: # Correctly non-spam: 188787 37.16% (99.80% of non-spam corpus) # Correctly spam: 291101 57.30% (91.29% of spam corpus) # False positives: 387 0.08% (0.20% of nonspam, 75597 weighted) # False negatives: 27769 5.47% (8.71% of spam, 84951 weighted) # Average score for spam: 16.6 nonspam: 0.4 # Average for false-pos: 5.8 false-neg: 3.1 # TOTAL: 508044 100.00% As expected. But, as before, the validation produced much worse results: STATISTICS REPORT FOR SPAMASSASSIN RULESET Classification success on test corpora, at default threshold: # SUMMARY for threshold 5.0: # Correctly non-spam: 20764 36.78% (98.79% of non-spam corpus) # Correctly spam: 32924 58.33% (92.93% of spam corpus) # False positives: 255 0.45% (1.21% of nonspam, 29228 weighted) # False negatives: 2505 4.44% (7.07% of spam, 7837 weighted) # TCR: 9.372751 SpamRecall: 92.930% SpamPrec: 99.231% FP: 0.45% FN: 4.44% And just to make sure this wasn't some kind of statistical anomaly, I copied the spam/ham files to the -validate files and I ran the "validation" again (this time on the original "seen" messages used by the GA to generate the scores). And this is what I got: STATISTICS REPORT FOR SPAMASSASSIN RULESET Classification success on test corpora, at default threshold: # SUMMARY for threshold 5.0: # Correctly non-spam: 186620 36.73% (98.65% of non-spam corpus) # Correctly spam: 296308 58.32% (92.92% of spam corpus) # False positives: 2554 0.50% (1.35% of nonspam, 281078 weighted) # False negatives: 22562 4.44% (7.08% of spam, 70093 weighted) # TCR: 9.024963 SpamRecall: 92.924% SpamPrec: 99.145% FP: 0.50% FN: 4.44% So I got pretty much exactly the same horrible FP performance with the "seen" messages as I did with the "validation" messages. And this leads me to believe that there is something wrong with my scores file, or maybe I have missed some step required to make the validation work. My exact steps were: runGA rewrite_cf_with_new_scores 0 ../rules/50_scores.cf GA-set0.scores > 50_scores.new mv 50_scores.new ../rules/50_scores.cf runGA foo After the rewrite_cf... I manually verified that the scores in 50_scores.cf matched the scores in GA-set0.scores. I'm somewhat at a loss as to what to try next. Any help would be greatly appreciated. Bob ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk