Any chance of getting a run for rescoring of the SURBL lists?

--Chris (Perceptron is on my list of things to master.)

>-----Original Message-----
>From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
>Sent: Wednesday, September 29, 2004 1:26 PM
>To: Matt Kettler
>Cc: Chris Santerre; users@spamassassin.apache.org
>Subject: Re: Why such a low score? 
>
>
>-----BEGIN PGP SIGNED MESSAGE-----
>Hash: SHA1
>
>
>What Matt said ;)  the perceptron really hates FPs.
>
>Also, another feature of the perceptron is that, if two rules 
>hit the same
>spams and the same hams, it'll spread the scores equally 
>between those two
>rules.
>
>e.g.: if RULE_1 hits a certain set of spam, it may get a score 
>of 3.0. But
>if RULE_1 and RULE_2 both hit the same subset, the score will be spread
>over the two, and each will get 1.5.
>
>If RULE_2 hit the same mails, but hit more hams, its score 
>will be reduced
>and more score given to RULE_1.
>
>afaik...
>
>- --j.
>
>Matt Kettler writes:
>> At 10:55 AM 9/29/2004, Chris Santerre wrote:
>> >What was the reason WS got such a low score in SA 3.0??? .5 
>is a joke! Hell
>> >BigEvil was scored a 3 and now one complained, and it is 
>the same data!! I
>> >don't understand. Did the mass check not go well?
>> 
>> Upon closer inspection, the WS mass-check went pretty well, 
>but WS had the 
>> greatest number of nonspam hits of all the SURBL lists. It 
>also hit the 
>> most spam, but the OB list hit nearly as much spam, and 
>almost no nonspam.
>> 
>> Since the GA treats FP's as 100 times worse than FNs, the GA 
>is going to 
>> heavily bias the score of any overlapping spam hits to the 
>one that has the 
>> least nonspam hits. I suspect that in the spam cases, most 
>of the WS hits 
>> also hit either OB or SC, which have better FP ratios, and 
>the scores 
>> assigned reflect this.
>> 
>> Admittedly the amount of nonspam WS hit is small (0.4%), but 
>that's over 6 
>> times more nonspam than OB did, and 100 times more than SC did.
>> 
>> Thus WS got a lowish score not for being a bad rule, but for 
>not doing as 
>> well as it's neighbors that catch the same spams.
>> 
>>  From STATISTICS-set1.txt
>> OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
>>   10.497  15.8904   0.0008    1.000   0.98    2.01  URIBL_AB_SURBL
>>   18.019  27.2741   0.0046    1.000   0.97    3.90  URIBL_SC_SURBL
>>   49.029  74.1861   0.0654    0.999   0.74    2.00  URIBL_OB_SURBL
>>   51.999  78.4712   0.4756    0.994   0.45    0.54  URIBL_WS_SURBL
>>    0.010   0.0146   0.0012    0.927   0.39    0.84  URIBL_PH_SURBL
>> 
>>  From STATISTICS-set3.txt:
>> OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
>>    7.022  14.4233   0.0061    1.000   0.95    4.26  URIBL_SC_SURBL
>>   30.471  62.5514   0.0632    0.999   0.74    3.21  URIBL_OB_SURBL
>>    2.950   6.0208   0.0385    0.994   0.73    0.42  URIBL_AB_SURBL
>>   33.807  68.9994   0.4494    0.994   0.47    1.46  URIBL_WS_SURBL
>>    0.019   0.0390   0.0008    0.981   0.44    2.00  URIBL_PH_SURBL
>> 
>> grep SURBL 50_scores.cf:
>> score URIBL_AB_SURBL 0 2.007 0 0.417
>> score URIBL_OB_SURBL 0 1.996 0 3.213
>> score URIBL_PH_SURBL 0 0.839 0 2.000
>> score URIBL_SC_SURBL 0 3.897 0 4.263
>> score URIBL_WS_SURBL 0 0.539 0 1.462
>-----BEGIN PGP SIGNATURE-----
>Version: GnuPG v1.2.4 (GNU/Linux)
>Comment: Exmh CVS
>
>iD8DBQFBWvA8QTcbUG5Y7woRAkoRAJ9SxRe1x/0wID7By10Cz8uZn8v2iQCfaYAb
>eVpsm+sn6fIrpwvhwwmHnmc=
>=RWRo
>-----END PGP SIGNATURE-----
>

Reply via email to