Theo,

Thanks very much for your help.  I think I'm almost there...

> ----- Original Message ----- 
> From: "Theo Van Dinter" <[EMAIL PROTECTED]>
> To: "Bob Dickinson (BSL)" <[EMAIL PROTECTED]>
> Cc: <[EMAIL PROTECTED]>
> Sent: Thursday, August 28, 2003 7:36 AM
> Subject: Re: [SAtalk] How long should a GA run take?
>
> On Wed, Aug 27, 2003 at 09:43:50PM -0700, Bob Dickinson (BSL) wrote:
>> We've read the readme's and looked at the code, but can't quite piece it
>> together.  We tried a variety of things, including running
>> rewrite-cf-with-new-scores, then another runGA with a command line
argument
>> to trigger the [GA Validation Results] part, but none of the results seem
>> correct (the recall number in STATISTICS.txt is much lower than the one
that
>> the GA spit out, among other things).
>
> It's sort of complicated, which is why I'm hoping we work on score
> generation for 2.70.  First, the GA doesn't output scores for
> every rule.  I forget the ones it skips, but I have script I use
> to "normalize" the scores to have the full list for the release.
> You then run rewrite-cf... to generate the new scores, and you put
> that in ../rules/50_scores.cf.  Then you run "runGA" with some argument
> (doesn't matter), and it'll generate the STATISTICS and such.  (note:
> that's based on the unseen validation logs, so it's likely to get slightly
> worse performance than the direct GA output).

Ok, then I guess maybe I did understand it.  I was mostly confused because
it just didn't look like the mechanism of generating freqs before the GA and
then inserting that into STATISTICS would ever work.  I guess if you want
the right scores in STATISTICS then you need to build freqs again after the
rewrite-cf... and before the second runGA.

But my real problem is how much worse the scores are.  The [GA Validation
Results] is using the scores in 50_scores.cf (as I understand it), which get
updated from the GA scores when you run rewrite-cf..., so we should be using
the right scores (I verified that the scores in 50_scores match the scores
output from the GA).

This is what the GA spit out at the end:

     The Best Evaluation: 6.681078e+04.
     The Best String:

     # SUMMARY for threshold 5.0:
     # Correctly non-spam: 188787  37.16%  (99.80% of non-spam corpus)
     # Correctly spam:     291013  57.28%  (91.26% of spam corpus)
     # False positives:       387  0.08%  (0.20% of nonspam,  74890
weighted)
     # False negatives:     27857  5.48%  (8.74% of spam,  82132 weighted)
     # Average score for spam:  16.7    nonspam: 0.4
     # Average for false-pos:   5.7  false-neg: 2.9
     # TOTAL:              508044  100.00%

Given that we removed a couple hundred tests, that's about what I expected
(similar FP rate to the RC set0, with an FN rate that's higher by a couple
of points).

But here is what [GA Validation Results] spit out:

     STATISTICS REPORT FOR SPAMASSASSIN RULESET

     Classification success on test corpora, at default threshold:

     # SUMMARY for threshold 5.0:
     # Correctly non-spam:  20634  36.55%  (98.17% of non-spam corpus)
     # Correctly spam:      33224  58.86%  (93.78% of spam corpus)
     # False positives:       385  0.68%  (1.83% of nonspam,  43636
weighted)
     # False negatives:      2205  3.91%  (6.22% of spam,   6907 weighted)
     # TCR: 8.578450  SpamRecall: 93.776%  SpamPrec: 98.854%  FP: 0.68%  FN:
3.91%

I expected the final results (the second set, from STATISTICS) to be at
least in the ballpark of the first set (from the GA run), but these seem
radically different, the fact that the FP rate is 9x worse being the biggest
factor.  It only found 387 FP's in 188,787 messages, but then found 385 FP's
in the validation set of only 20,634 messages).  This just doesn't seem
possible (OK, it's statistics, so anything is possible, but it doesn't seem
likely).

And ideas as to what I might be doing wrong?

Bob



-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to