Back on DNSBL overlap

Adam Katz Mon, 16 Nov 2009 16:26:34 -0800

Warren reported:
>> SPAM%    HAM%    RANK RULE
>> 12.8342% 0.0021% 0.94 RCVD_IN_PSBL *
>> 12.3053% 0.0026% 0.94 RCVD_IN_XBL
>> 31.2499% 0.0827% 0.87 RCVD_IN_ANBREP_BL *2
>> 80.2578% 0.1485% 0.86 RCVD_IN_PBL
>> 27.1836% 0.1985% 0.79 RCVD_IN_SORBS_DUL
>> 19.8213% 0.1785% 0.79 RCVD_IN_SEMBLACK *
>> 90.9360% 0.3854% 0.77 RCVD_IN_BRBL_LASTEXT
>> 13.0564% 0.4838% 0.67 RCVD_IN_HOSTKARMA_BL *


Justin requested:
> any chance you could post the S/O ratios?  RANK is a bit "unportable",
> as it depends on other rules in the ruleset at the time the
> measurement takes place.

I agree with Warren in that S/O isn't as useful here.  Even SPAM% is
just a minimum threshold.  With respect to this specific problem, HAM%
is the best indicator of the standard stats performed by MassCheck.

That said, I think that a measure of independence from the other
RCVD_IN_* rules is an even better metric.  How many hits are unique
(among DNSBLs) to each DNSBL?

I do this kind of thing manually on occasion since the data is usually
at the bottom of the detail page for each rule, e.g. the last network
test (20091114-r836144-n, http://tinyurl.com/yfef2ef ) reveals:


86% of BRBL_LASTEXT appears in PBL
97% of PBL appears in BRBL_LASTEXT
The non-overlapping 3% of PBL means it's SPAM% is 2.4077

29% of BRBL_LASTEXT appears in SORBS_DUL
97% of SORBS_DUL appears in BRBL_LASTEXT
The non-overlapping 3% of SORBS_DUL means SPAM% = 0.8155

24% of BRBL_LASTEXT appears in SORBS_WEB
94% of SORBS_WEB appears in BRBL_LASTEXT
The non-overlapping 6% of SORBS_WEB means SPAM% = 1.4041

33% of BRBL_LASTEXT appears in ANBREP_BL
97% of ANBREP_BL appears in BRBL_LASTEXT
The non-overlapping 3% of ANBREP_BL means SPAM% = 0.9075

21% of BRBL_LASTEXT appears in SEMBLACK
96% of SEMBLACK appears in BRBL_LASTEXT
The non-overlapping 4% of SEMBLACK means SPAM% = 0.7129

20% of BRBL_LASTEXT appears in ANBREP_L3
98% of ANBREP_L3 appears in BRBL_LASTEXT
The non-overlapping 2% of ANBREP_L3 means SPAM% = 0.3725

(Fetched from other pages)
18% of BRBL_LASTEXT appears in BL_SPAMCOP_NET
95% of BL_SPAMCOP_NET appears in BRBL_LASTEXT
The non-overlapping 5% of BL_SPAMCOP_NET means SPAM% = 0.8711

33% of PBL appears in RCVD_IN_SORBS_DUL
97% of SORBS_DUL appears in PBL
The non-overlapping 3% of SORBS_DUL means SPAM% = 0.8155

I can't go a step further and figure out how much of SORBS_DUL is
hitting independent of /both/ BRBL and PBL other than it's between
0.0245 and 0.8155.  Removing the 27% overlap with SEMBLACK reduces
that to between 0.0179 and 0.8155, ...

This is a two-way street; rather than using BRBL + PBL + SEMBLACK to
reduce SORBS_DUL, we can reduce BRBL:  Removing just PBL from
BRBL_LASTEXT's impressive 91% match of the spam corpus reduces that to
12.7310.  Removing the rest of the aforementioned DNSBLs could reduce
that down to as low as 2.3853.


So this brings me right back to my older point:  DNSBLs catch the same
culprits even given completely independent spamtrap configurations.
While I've called this "incestuous" in the past, that term leads one
to the wrong conclusion.  They're not syndicating each other (and if
they are, either they need to be called out on it or they need to both
have permission and also advertise that fact), they just attract the
same spammers because they use the same methods.

My hypothesis, which I've anecdotally proven on my own deployment, is
that the flaws are repeated as well.  Spammers that trigger spamtraps
on multiple DNSBLs (and URIBLs) may be sending from (or linking to)
servers that also deal with legitimate traffic.  This means that
thanks to these similar indexing techniques, DNSBL overlap from
spammers' abuse of a non-spam-exclusive server can single-handedly
mark a ham as spam.

My "solution" is to counter-intuitively *remove* points from message
that hit too many DNSBLs.  They still net quite a positive score, but
that score is effectively capped at something not quite high enough to
kill a ham with DNSBLs alone.

A more elegant version of this, which Karsten and I theorize might
even happen automatically (as scored by the GA) if I were to check my
adjustor into SVN, would be to reduce most of the points on the DNSBLs
and add them back with a meta rule containing a union of the DNSBL
rules (without a "multiple" tflag).

Back on DNSBL overlap

Reply via email to