On 11/16/2009 07:26 PM, Adam Katz wrote:
My hypothesis, which I've anecdotally proven on my own deployment, is
that the flaws are repeated as well. Spammers that trigger spamtraps
on multiple DNSBLs (and URIBLs) may be sending from (or linking to)
servers that also deal with legitimate traffic. This means that
thanks to these similar indexing techniques, DNSBL overlap from
spammers' abuse of a non-spam-exclusive server can single-handedly
mark a ham as spam.
My "solution" is to counter-intuitively *remove* points from message
that hit too many DNSBLs. They still net quite a positive score, but
that score is effectively capped at something not quite high enough to
kill a ham with DNSBLs alone.
A more elegant version of this, which Karsten and I theorize might
even happen automatically (as scored by the GA) if I were to check my
adjustor into SVN, would be to reduce most of the points on the DNSBLs
and add them back with a meta rule containing a union of the DNSBL
rules (without a "multiple" tflag).
I think there is a lot of merit to this approach, and it might even be a
great idea. But I spoke with a machine learning expert and heard some
interesting things on this topic.
We held a small workshop yesterday in which she explained Logistic
Regression and how it might be applied to automated rescoring of
spamassassin's rules. The most intriguing aspect of her explanation was
the suggestion of using a logarithmic function in weight scoring. I
asked specifically about this issue of overlap (like BRBL_LASTEXT with
every other list) and she suggested this particular method of rescoring
wouldn't have an issue with overlap.
I believe you mentioned logarithmic scoring in an earlier discussion?
It appears that we have a few very smart people interested in
implementing an alternative rescorer using Logistic Regression. We plan
on using an existing library for the bulk of this implementation.
I think we should proceed with our current generated scores for 3.3.0.
After that we can compare the effectiveness of different approaches
including your proposal.
Specifically on the issue of overlapping DNSBL's, there might be a few
possibilities:
* Overlapping DNSBL's really is no problem with any method of scoring.
* Overlapping DNSBL's is only a slight problem with any method of
scoring, but if a host is blacklisted with more than one major DNSBL
they have serious issues they need to fix and we shouldn't try to
workaround for their benefit.
* Overlapping DNSBL's is a real problem, but logarithmic scoring avoids
it as an issue.
rulesrc/sandbox/jm/20_bug_5984.cf:# score RCVD_IN_BRBL_LASTEXT 2.0
This apparently was set manually. It appears that spamassassin-3.2.x
was not scored when BRBL existed as a rule. Meanwhile our new GA scores
resulted in:
score RCVD_IN_BRBL_LASTEXT 0 1.644 0 1.449 # n=0 n=2
This is relatively modest. This combined with one other DNSBL alone
will not push it clearly above 5 points. I might suggest manually
adjusting down BRBL or PBL so it requires one additional tiny score to
push it over the edge. I'm personally comfortable enough to outright
reject mail from a Spamhaus listed host. Given this bias, it is
sufficiently cautious in my book to accept PBL + BRBL as insufficient.
Warren Togami
wtog...@redhat.com