Re: Back on DNSBL overlap

Warren Togami Mon, 16 Nov 2009 19:23:25 -0800

On 11/16/2009 07:26 PM, Adam Katz wrote:


My hypothesis, which I've anecdotally proven on my own deployment, is
that the flaws are repeated as well.  Spammers that trigger spamtraps
on multiple DNSBLs (and URIBLs) may be sending from (or linking to)
servers that also deal with legitimate traffic.  This means that
thanks to these similar indexing techniques, DNSBL overlap from
spammers' abuse of a non-spam-exclusive server can single-handedly
mark a ham as spam.

My "solution" is to counter-intuitively *remove* points from message
that hit too many DNSBLs.  They still net quite a positive score, but
that score is effectively capped at something not quite high enough to
kill a ham with DNSBLs alone.

A more elegant version of this, which Karsten and I theorize might
even happen automatically (as scored by the GA) if I were to check my
adjustor into SVN, would be to reduce most of the points on the DNSBLs
and add them back with a meta rule containing a union of the DNSBL
rules (without a "multiple" tflag).

I think there is a lot of merit to this approach, and it might even be agreat idea. But I spoke with a machine learning expert and heard someinteresting things on this topic.

We held a small workshop yesterday in which she explained LogisticRegression and how it might be applied to automated rescoring ofspamassassin's rules. The most intriguing aspect of her explanation wasthe suggestion of using a logarithmic function in weight scoring. Iasked specifically about this issue of overlap (like BRBL_LASTEXT withevery other list) and she suggested this particular method of rescoringwouldn't have an issue with overlap.


I believe you mentioned logarithmic scoring in an earlier discussion?

It appears that we have a few very smart people interested inimplementing an alternative rescorer using Logistic Regression. We planon using an existing library for the bulk of this implementation.

I think we should proceed with our current generated scores for 3.3.0.After that we can compare the effectiveness of different approachesincluding your proposal.

Specifically on the issue of overlapping DNSBL's, there might be a fewpossibilities:


* Overlapping DNSBL's really is no problem with any method of scoring.

* Overlapping DNSBL's is only a slight problem with any method ofscoring, but if a host is blacklisted with more than one major DNSBLthey have serious issues they need to fix and we shouldn't try toworkaround for their benefit.* Overlapping DNSBL's is a real problem, but logarithmic scoring avoidsit as an issue.


rulesrc/sandbox/jm/20_bug_5984.cf:# score RCVD_IN_BRBL_LASTEXT 2.0

This apparently was set manually. It appears that spamassassin-3.2.xwas not scored when BRBL existed as a rule. Meanwhile our new GA scoresresulted in:


score RCVD_IN_BRBL_LASTEXT 0 1.644 0 1.449 # n=0 n=2

This is relatively modest. This combined with one other DNSBL alonewill not push it clearly above 5 points. I might suggest manuallyadjusting down BRBL or PBL so it requires one additional tiny score topush it over the edge. I'm personally comfortable enough to outrightreject mail from a Spamhaus listed host. Given this bias, it issufficiently cautious in my book to accept PBL + BRBL as insufficient.


Warren Togami
wtog...@redhat.com

Re: Back on DNSBL overlap

Reply via email to