Theo, Thanks very much for your help. I think I'm almost there...
> ----- Original Message ----- > From: "Theo Van Dinter" <[EMAIL PROTECTED]> > To: "Bob Dickinson (BSL)" <[EMAIL PROTECTED]> > Cc: <[EMAIL PROTECTED]> > Sent: Thursday, August 28, 2003 7:36 AM > Subject: Re: [SAtalk] How long should a GA run take? > > On Wed, Aug 27, 2003 at 09:43:50PM -0700, Bob Dickinson (BSL) wrote: >> We've read the readme's and looked at the code, but can't quite piece it >> together. We tried a variety of things, including running >> rewrite-cf-with-new-scores, then another runGA with a command line argument >> to trigger the [GA Validation Results] part, but none of the results seem >> correct (the recall number in STATISTICS.txt is much lower than the one that >> the GA spit out, among other things). > > It's sort of complicated, which is why I'm hoping we work on score > generation for 2.70. First, the GA doesn't output scores for > every rule. I forget the ones it skips, but I have script I use > to "normalize" the scores to have the full list for the release. > You then run rewrite-cf... to generate the new scores, and you put > that in ../rules/50_scores.cf. Then you run "runGA" with some argument > (doesn't matter), and it'll generate the STATISTICS and such. (note: > that's based on the unseen validation logs, so it's likely to get slightly > worse performance than the direct GA output). Ok, then I guess maybe I did understand it. I was mostly confused because it just didn't look like the mechanism of generating freqs before the GA and then inserting that into STATISTICS would ever work. I guess if you want the right scores in STATISTICS then you need to build freqs again after the rewrite-cf... and before the second runGA. But my real problem is how much worse the scores are. The [GA Validation Results] is using the scores in 50_scores.cf (as I understand it), which get updated from the GA scores when you run rewrite-cf..., so we should be using the right scores (I verified that the scores in 50_scores match the scores output from the GA). This is what the GA spit out at the end: The Best Evaluation: 6.681078e+04. The Best String: # SUMMARY for threshold 5.0: # Correctly non-spam: 188787 37.16% (99.80% of non-spam corpus) # Correctly spam: 291013 57.28% (91.26% of spam corpus) # False positives: 387 0.08% (0.20% of nonspam, 74890 weighted) # False negatives: 27857 5.48% (8.74% of spam, 82132 weighted) # Average score for spam: 16.7 nonspam: 0.4 # Average for false-pos: 5.7 false-neg: 2.9 # TOTAL: 508044 100.00% Given that we removed a couple hundred tests, that's about what I expected (similar FP rate to the RC set0, with an FN rate that's higher by a couple of points). But here is what [GA Validation Results] spit out: STATISTICS REPORT FOR SPAMASSASSIN RULESET Classification success on test corpora, at default threshold: # SUMMARY for threshold 5.0: # Correctly non-spam: 20634 36.55% (98.17% of non-spam corpus) # Correctly spam: 33224 58.86% (93.78% of spam corpus) # False positives: 385 0.68% (1.83% of nonspam, 43636 weighted) # False negatives: 2205 3.91% (6.22% of spam, 6907 weighted) # TCR: 8.578450 SpamRecall: 93.776% SpamPrec: 98.854% FP: 0.68% FN: 3.91% I expected the final results (the second set, from STATISTICS) to be at least in the ballpark of the first set (from the GA run), but these seem radically different, the fact that the FP rate is 9x worse being the biggest factor. It only found 387 FP's in 188,787 messages, but then found 385 FP's in the validation set of only 20,634 messages). This just doesn't seem possible (OK, it's statistics, so anything is possible, but it doesn't seem likely). And ideas as to what I might be doing wrong? Bob ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk