Eric Kolve and I were looking at how to best set the default SpamCopURI scores for the various SURBL lists and at first we tried looking at the SpamAssassin 3.0 perceptron-generated scores as a possible guide:
> http://spamassassin.apache.org/full/3.0.x/dist/rules/50_scores.cf > > # The following block of scores were generated using the mass-checking > # scripts, and a perceptron to determine the optimum scores which > # resulted in minimum false positives or negatives. The scores are > # weighted to produce roughly 1 false positive in 2500 non-spam messages > # using the default threshold of 5.0. > score URIBL_AB_SURBL 0 2.007 0 0.417 > score URIBL_OB_SURBL 0 1.996 0 3.213 > score URIBL_PH_SURBL 0 0.839 0 2.000 > score URIBL_SC_SURBL 0 3.897 0 4.263 > score URIBL_WS_SURBL 0 0.539 0 1.462 I was trying to figure out what the different score columns meant, to which Theo Van Dinter cited: > $ perldoc Mail::SpamAssassin::Conf > [...] > If four valid scores are listed, then the score that is used > depends on how SpamAssassin is being used. The first score is used > when both Bayes and network tests are disabled (score set 0). The > second score is used when Bayes is disabled, but network tests are > enabled (score set 1). The third score is used when Bayes is > enabled and network tests are disabled (score set 2). The fourth > score is used when Bayes is enabled and network tests are enabled > (score set 3). We wondered if we could somehow use those scores with SpamCopURI and were unable to come up with a good answer. Theo suggested looking at Spam versus ham rates as a good way to set scores, to which I mentioned: > We have these test results from Justin from 25 June: > > OVERALL% SPAM% HAM% S/O RANK SCORE NAME > 121405 22516 98889 0.185 0.00 0.00 (all messages) > 100.000 18.5462 81.4538 0.185 0.00 0.00 (all messages as %) > 13.453 70.3766 0.4925 0.993 1.00 1.00 SURBL_WS > 3.807 20.3811 0.0334 0.998 0.50 1.00 SURBL_SC > 2.650 14.2565 0.0071 1.000 0.50 1.00 SURBL_AB > 0.019 0.0933 0.0020 0.979 0.50 1.00 SURBL_PH > 12.624 67.6275 0.1001 0.999 0.50 1.00 SURBL_OB > > which shows a pretty high FP rate for WS, less for the others. > Do you happen to have access to any more recent corpus check data > like this? Could be useful to have another snapshot for a more > complete picture. Which was followed up with more data and discussion: > On Saturday, September 4, 2004, 10:13:11 PM, Theo Dinter wrote: >> high spam + low ham is good from an FP standpoint, but having a "significant" >> (for your definition thereof) ham hitrate means the score shouldn't be too >> high. My handwaving scores would be something like: [Theo's wild guess scores for Justin's June data: -- Jeff C.] >> WS 1.2 >> SC 2.5 >> AB 3.5 >> OB 1.8 Theo then gave some of his own stats on a couple different corpora: >> OVERALL% SPAM% HAM% S/O RANK SCORE NAME >> 416072 365031 51041 0.877 0.00 0.00 (all messages) >> 100.000 87.7327 12.2673 0.877 0.00 0.00 (all messages as >> %) >> set1 30.923 35.2466 0.0000 1.000 0.99 0.00 URIBL_SC_SURBL >> set1 72.231 82.3273 0.0274 1.000 0.98 1.00 URIBL_OB_SURBL >> set1 19.375 22.0847 0.0000 1.000 0.98 1.00 URIBL_AB_SURBL >> set1 74.883 85.2939 0.4310 0.995 0.74 0.00 URIBL_WS_SURBL >> set1 0.001 0.0000 0.0059 0.000 0.48 0.00 URIBL_PH_SURBL > >> OVERALL% SPAM% HAM% S/O RANK SCORE NAME >> 119215 67094 52121 0.563 0.00 0.00 (all messages) >> 100.000 56.2798 43.7202 0.563 0.00 0.00 (all messages as >> %) >> set3 39.217 69.6605 0.0288 1.000 0.98 1.00 URIBL_OB_SURBL >> set3 10.340 18.3727 0.0000 1.000 0.97 0.00 URIBL_SC_SURBL >> set3 5.998 10.6582 0.0000 1.000 0.94 1.00 URIBL_AB_SURBL >> set3 42.730 75.5522 0.4797 0.994 0.73 0.00 URIBL_WS_SURBL >> set3 0.008 0.0089 0.0058 0.608 0.49 0.00 URIBL_PH_SURBL > >> so for these results, I'd probably do something like: > >> WS 1.3 >> SC 4.0 >> AB 3.0 >> OB 2.2 > >> since the hit rates and S/O are a bit higher for me, related to the fact I >> ran >> more spam through than Justin did. To which I added: > Those final scores look like an excellent fit to the data to me. and: > Also while the PH spam hit rate [from Justin's stats] is low, > the data is of hand checked phishing scams, which deserve to be > blocked due to their potential danger and damage. > > Therefore I would tend to give PH a medium-high score like > 3 to 5. So we'll probably adjust the default scores on SpamCopURI to something like: WS 1.3 SC 4.0 AB 3.0 OB 2.2 PH 4.5 and we recommend SpamCopURI users do likewise. Please be sure to use the latest version of SpamCopURI with multi.surbl.org: http://sourceforge.net/projects/spamcopuri/ http://search.cpan.org/dist/Mail-SpamAssassin-SpamCopURI/ One thing stood out for me is that the FP rate (ham%) for ws.surbl.org is way too high at about 0.45 to 0.5% across multiple corpora. That FP rate needs to be reduced for WS to be more fully useful. I think Chris or maybe Raymond suggested that they had a way to reduce FPs in WS further. If so, ***please*** try to apply it. We need to get the FPs to be much less than 0.5%. The other lists have FP rates 5 to 50 times lower. Basically the higher the FP rate, the less useful a list is. Does anyone have other corpus stats to share, in particular FP rates? Jeff C. -- Jeff Chan mailto:[EMAIL PROTECTED] http://www.surbl.org/