URIBL update

Jared Johnson Sat, 31 Jul 2010 21:00:51 -0700

I've made a bunch more changes to the uribl plugin locally; man, we
_really_ need to get some kind of svn-to-gig thing going.  Or at least I
need to re-educate myself on git and start putting things in my github
again.  If I don't manage to do this by the time (soon) that things are
settled down a bit and we have some production testing, I'll submit a new
one to the list again; but hopefully by the time I'll have found time to
git git going again and be able to point people to that.


I got permission from Dallas @ URIBL to use the datafeed data, but also
got his opinion on the matter which is that using tld_lists the way I am
is not going to gain much, and introduces the risk that a new spammer
'haven' could be missed entirely.  After talking with my team we're going
to go with a full TLD list right now, and perhaps later we'll collect our
own stats to verify Dallas is right about the tiny benefit (he probably
is).  tld_lists has been updated to reflect this, though if anyone feels
more bold than me and wants updates to the 'pruned' list let me know.

I modified the parse_mime plugin as discussed previously on the list, now
the uribl plugin isa_plugin('mime_parser') and does lazy parsing.

I'll probably remove 'semicolon munging detection'; as Devin said, if real
(current) data doesn't show it's being used why bother.  I'd like to go
over a larger sampling of current data first though, which I plan to do
soon.

I've re-arranged the code slightly to allow not only the async plugin but
our own local plugin to easily take advantage of plugin inheritance to
avoid code duplication.  Our own plugin is now just 40 lines or so, thus
it gets to inherit the other 600 lines of uribl without any forking :)

There are some additional changes in the works that I'm curious for input
on if anyone cares:

- We're finally getting more URIBL datafeeds.  I'd like to use this data
to verify how static TXT results are for each service and, if applicable,
generate templates for them in the same script that generates tld_lists.pl
(and probably rename that to fit its more general purpose).  So for
services that do indeed have very static TXT templates, we could
optionally skip TXT lookups and instead generate our own response (e.g.
links) without the cost of one more DNS query.  A couple of additional
brainstorms on this topic:

 - Dynamically generate the TXT template by going ahead and doing TXT
   lookups until the first one we get back for each service, at which
   point we cache the template and don't do any more.  (This would have to
   be re-done in every new child process)
 - Dynamically verify the validity of the statically-set template by doing
   TXT lookups until we get one and then checking.  Still has to be
   per-child-process.
 - If we do either of the above dynamic checks, or if we don't choose to do
   any TXT-avoiding magic at all, we ought to launch TXT requests in the
   callback that receives matches on A lookups, that way we do two queries
   per _hit_, rather than two queries per URI.

I've already added the independent option to just turn off TXT queries,
for anyone who wants to save on DNS traffic at the cost of providing links
on rejection.

- We're interested in optionally resolving URL shortening links (e.g.
tinyurl, bit.ly, etc.) using HTTP::Async

My boss is still deliberating whether the URL shortening resolution thing
would be contributed, or if we would consider it part of the 'special
sauce'.  I'm hopeful he'll be in favor of contributing it.

- We'd like to change the check_headers directive to take more args than
just true/false.  0 would still mean 'don't check headers', 1 would still
mean 'check all headers', but 'all' would also mean 'check all headers';
anything else would be interpreted as the header(s) to be checked
(comma-delimited if in list form).  So you could do check_headers =
'subject' or check_headers = 'subject,received'.  The default should
probably be off.  This is mainly because I noticed check_headers
automatically checks the Received header, which is even more interesting
when combined with the SBL-XBL service; basically, this plugin is now a
replacement for SA's "RCVD_IN_SBL" rule, etc.  This is probably a good
thing... as long as you actually wanted it :)  But you should be able to
avoid it if you like and just specify the headers you're interested in.

Any comments on these?  Do they sound worthwhile?  What should the
defaults be?

-Jared

URIBL update

Reply via email to