Warren reported: >> SPAM% HAM% RANK RULE >> 12.8342% 0.0021% 0.94 RCVD_IN_PSBL * >> 12.3053% 0.0026% 0.94 RCVD_IN_XBL >> 31.2499% 0.0827% 0.87 RCVD_IN_ANBREP_BL *2 >> 80.2578% 0.1485% 0.86 RCVD_IN_PBL >> 27.1836% 0.1985% 0.79 RCVD_IN_SORBS_DUL >> 19.8213% 0.1785% 0.79 RCVD_IN_SEMBLACK * >> 90.9360% 0.3854% 0.77 RCVD_IN_BRBL_LASTEXT >> 13.0564% 0.4838% 0.67 RCVD_IN_HOSTKARMA_BL *
Justin requested: > any chance you could post the S/O ratios? RANK is a bit "unportable", > as it depends on other rules in the ruleset at the time the > measurement takes place. I agree with Warren in that S/O isn't as useful here. Even SPAM% is just a minimum threshold. With respect to this specific problem, HAM% is the best indicator of the standard stats performed by MassCheck. That said, I think that a measure of independence from the other RCVD_IN_* rules is an even better metric. How many hits are unique (among DNSBLs) to each DNSBL? I do this kind of thing manually on occasion since the data is usually at the bottom of the detail page for each rule, e.g. the last network test (20091114-r836144-n, http://tinyurl.com/yfef2ef ) reveals: 86% of BRBL_LASTEXT appears in PBL 97% of PBL appears in BRBL_LASTEXT The non-overlapping 3% of PBL means it's SPAM% is 2.4077 29% of BRBL_LASTEXT appears in SORBS_DUL 97% of SORBS_DUL appears in BRBL_LASTEXT The non-overlapping 3% of SORBS_DUL means SPAM% = 0.8155 24% of BRBL_LASTEXT appears in SORBS_WEB 94% of SORBS_WEB appears in BRBL_LASTEXT The non-overlapping 6% of SORBS_WEB means SPAM% = 1.4041 33% of BRBL_LASTEXT appears in ANBREP_BL 97% of ANBREP_BL appears in BRBL_LASTEXT The non-overlapping 3% of ANBREP_BL means SPAM% = 0.9075 21% of BRBL_LASTEXT appears in SEMBLACK 96% of SEMBLACK appears in BRBL_LASTEXT The non-overlapping 4% of SEMBLACK means SPAM% = 0.7129 20% of BRBL_LASTEXT appears in ANBREP_L3 98% of ANBREP_L3 appears in BRBL_LASTEXT The non-overlapping 2% of ANBREP_L3 means SPAM% = 0.3725 (Fetched from other pages) 18% of BRBL_LASTEXT appears in BL_SPAMCOP_NET 95% of BL_SPAMCOP_NET appears in BRBL_LASTEXT The non-overlapping 5% of BL_SPAMCOP_NET means SPAM% = 0.8711 33% of PBL appears in RCVD_IN_SORBS_DUL 97% of SORBS_DUL appears in PBL The non-overlapping 3% of SORBS_DUL means SPAM% = 0.8155 I can't go a step further and figure out how much of SORBS_DUL is hitting independent of /both/ BRBL and PBL other than it's between 0.0245 and 0.8155. Removing the 27% overlap with SEMBLACK reduces that to between 0.0179 and 0.8155, ... This is a two-way street; rather than using BRBL + PBL + SEMBLACK to reduce SORBS_DUL, we can reduce BRBL: Removing just PBL from BRBL_LASTEXT's impressive 91% match of the spam corpus reduces that to 12.7310. Removing the rest of the aforementioned DNSBLs could reduce that down to as low as 2.3853. So this brings me right back to my older point: DNSBLs catch the same culprits even given completely independent spamtrap configurations. While I've called this "incestuous" in the past, that term leads one to the wrong conclusion. They're not syndicating each other (and if they are, either they need to be called out on it or they need to both have permission and also advertise that fact), they just attract the same spammers because they use the same methods. My hypothesis, which I've anecdotally proven on my own deployment, is that the flaws are repeated as well. Spammers that trigger spamtraps on multiple DNSBLs (and URIBLs) may be sending from (or linking to) servers that also deal with legitimate traffic. This means that thanks to these similar indexing techniques, DNSBL overlap from spammers' abuse of a non-spam-exclusive server can single-handedly mark a ham as spam. My "solution" is to counter-intuitively *remove* points from message that hit too many DNSBLs. They still net quite a positive score, but that score is effectively capped at something not quite high enough to kill a ham with DNSBLs alone. A more elegant version of this, which Karsten and I theorize might even happen automatically (as scored by the GA) if I were to check my adjustor into SVN, would be to reduce most of the points on the DNSBLs and add them back with a meta rule containing a union of the DNSBL rules (without a "multiple" tflag).