Eric Kolve and I were looking at how to best set the default SpamCopURI
scores for the various SURBL lists and at first we tried looking at the
SpamAssassin 3.0 perceptron-generated scores as a possible guide:

>   http://spamassassin.apache.org/full/3.0.x/dist/rules/50_scores.cf
> 
> # The following block of scores were generated using the mass-checking
> # scripts, and a perceptron to determine the optimum scores which
> # resulted in minimum false positives or negatives.  The scores are
> # weighted to produce roughly 1 false positive in 2500 non-spam messages
> # using the default threshold of 5.0.

> score URIBL_AB_SURBL 0 2.007 0 0.417
> score URIBL_OB_SURBL 0 1.996 0 3.213
> score URIBL_PH_SURBL 0 0.839 0 2.000
> score URIBL_SC_SURBL 0 3.897 0 4.263
> score URIBL_WS_SURBL 0 0.539 0 1.462

I was trying to figure out what the different score columns meant,
to which Theo Van Dinter cited:

> $ perldoc Mail::SpamAssassin::Conf
> [...]
>    If four valid scores are listed, then the score that is used
>    depends on how SpamAssassin is being used. The first score is used
>    when both Bayes and network tests are disabled (score set 0). The
>    second score is used when Bayes is disabled, but network tests are
>    enabled (score set 1). The third score is used when Bayes is
>    enabled and network tests are disabled (score set 2). The fourth
>    score is used when Bayes is enabled and network tests are enabled
>    (score set 3).

We wondered if we could somehow use those scores with SpamCopURI
and were unable to come up with a good answer.

Theo suggested looking at Spam versus ham rates as a good way to
set scores, to which I mentioned:

> We have these test results from Justin from 25 June:
> 
> OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
>  121405    22516    98889    0.185   0.00    0.00  (all messages)
> 100.000  18.5462  81.4538    0.185   0.00    0.00  (all messages as %)
>  13.453  70.3766   0.4925    0.993   1.00    1.00  SURBL_WS
>   3.807  20.3811   0.0334    0.998   0.50    1.00  SURBL_SC
>   2.650  14.2565   0.0071    1.000   0.50    1.00  SURBL_AB
>   0.019   0.0933   0.0020    0.979   0.50    1.00  SURBL_PH
>  12.624  67.6275   0.1001    0.999   0.50    1.00  SURBL_OB
> 
> which shows a pretty high FP rate for WS, less for the others.
> Do you happen to have access to any more recent corpus check data
> like this?  Could be useful to have another snapshot for a more
> complete picture.

Which was followed up with more data and discussion:

> On Saturday, September 4, 2004, 10:13:11 PM, Theo Dinter wrote:

>> high spam + low ham is good from an FP standpoint, but having a "significant"
>> (for your definition thereof) ham hitrate means the score shouldn't be too
>> high.  My handwaving scores would be something like:

[Theo's wild guess scores for Justin's June data:  -- Jeff C.]

>> WS      1.2
>> SC      2.5
>> AB      3.5
>> OB      1.8

Theo then gave some of his own stats on a couple different corpora:

>>         OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
>>          416072   365031    51041    0.877   0.00    0.00  (all messages)
>>         100.000  87.7327  12.2673    0.877   0.00    0.00  (all messages as 
>> %)
>> set1     30.923  35.2466   0.0000    1.000   0.99    0.00  URIBL_SC_SURBL
>> set1     72.231  82.3273   0.0274    1.000   0.98    1.00  URIBL_OB_SURBL
>> set1     19.375  22.0847   0.0000    1.000   0.98    1.00  URIBL_AB_SURBL
>> set1     74.883  85.2939   0.4310    0.995   0.74    0.00  URIBL_WS_SURBL
>> set1      0.001   0.0000   0.0059    0.000   0.48    0.00  URIBL_PH_SURBL
> 
>>         OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
>>          119215    67094    52121    0.563   0.00    0.00  (all messages)
>>         100.000  56.2798  43.7202    0.563   0.00    0.00  (all messages as 
>> %)
>> set3     39.217  69.6605   0.0288    1.000   0.98    1.00  URIBL_OB_SURBL
>> set3     10.340  18.3727   0.0000    1.000   0.97    0.00  URIBL_SC_SURBL
>> set3      5.998  10.6582   0.0000    1.000   0.94    1.00  URIBL_AB_SURBL
>> set3     42.730  75.5522   0.4797    0.994   0.73    0.00  URIBL_WS_SURBL
>> set3      0.008   0.0089   0.0058    0.608   0.49    0.00  URIBL_PH_SURBL
> 
>> so for these results, I'd probably do something like:
> 
>> WS      1.3
>> SC      4.0
>> AB      3.0
>> OB      2.2
> 
>> since the hit rates and S/O are a bit higher for me, related to the fact I 
>> ran
>> more spam through than Justin did.

To which I added:

> Those final scores look like an excellent fit to the data to me.

and:

> Also while the PH spam hit rate [from Justin's stats] is low,
> the data is of hand checked phishing scams, which deserve to be
> blocked due to their potential danger and damage.
> 
> Therefore I would tend to give PH a medium-high score like
> 3 to 5.

So we'll probably adjust the default scores on SpamCopURI
to something like:

  WS      1.3
  SC      4.0
  AB      3.0
  OB      2.2
  PH      4.5

and we recommend SpamCopURI users do likewise.  Please be
sure to use the latest version of SpamCopURI with
multi.surbl.org:

  http://sourceforge.net/projects/spamcopuri/
  http://search.cpan.org/dist/Mail-SpamAssassin-SpamCopURI/


One thing stood out for me is that the FP rate (ham%) for
ws.surbl.org is way too high at about 0.45 to 0.5% across
multiple corpora.  That FP rate needs to be reduced for WS
to be more fully useful.

I think Chris or maybe Raymond suggested that they had a way to
reduce FPs in WS further.  If so, ***please*** try to apply it.
We need to get the FPs to be much less than 0.5%.  The other
lists have FP rates 5 to 50 times lower.

Basically the higher the FP rate, the less useful a list is.

Does anyone have other corpus stats to share, in particular
FP rates?

Jeff C.
-- 
Jeff Chan
mailto:[EMAIL PROTECTED]
http://www.surbl.org/

Reply via email to