Re: Rewritten URIBL plugin

Jared Johnson Mon, 26 Jul 2010 10:19:56 -0700

> On Sun, Jul 25, 2010 at 04:43:36AM -0500, Jared Johnson wrote:
>> The plugin has the following advantages over the original:
>
> Based on a single read-through of the code, I like most of it.  Some
> assorted observations, though:
>
> - Parsing out the MIME sections is worthwhile, though it's really
> something
>   that ought to be done in the core libs when plugins indicate they need
> it,
>   or in a utility plugin if that's practical.  This was something I meant
>   to bring up a while back -- we have quite a few different
> content-scanning
>   plugins, and if they're all obliged to invoke MIME::Parser repeatedly
> it'll
>   be even more costly than that module already is.  Not that I have time
> to
>   do it myself, though.  :P


Our internal plugin actually just uses $txn->notes('mime_body') instead of
creating a new $msg, because we do indeed have such a utility plugin.  A
bit short of being transparently handled by internals, but I could at
least include code in the plugin to use $txn->notes('mime_body') instead
of parsing when that exists.  This would also allow our internal plugin to
be slightly less forked from what winds up in upstream :)  Your idea of
doing this internally sounds cool to me, but we're unlikely to get around
to implementing it ourselves any time soon given the system that already
exists is working OK.  And unfortunately our utility plugin is really just
one aspect of a bigger plugin that wouldn't really be useful to QP, and we
have no method of 'asking' for MIME parsing from said plugin since for our
purposes it's *always* necessary to parse mime.

> - For messages below a few hundred kb or so, consider
>   MIME::Parser->output_to_core(), so we can avoid the fs churn.
> Annoyingly,
>   MIME::Parser doesn't provide a mechanism to only consider particular
>   mimetypes, or we could skip decoding non-text bodies.

The MIME::Parser docs are pretty insistent that using output_to_core is
detrimental to performance in general, and should only be bothered with
when there are actual disk constraints.  I've never tested this theory for
really small messages, but it's kept me away from trying it :)

If it's not very difficult, perhaps it could be optional via a plugin
argument, and later via the handy dandy general
mime-parsing-needed-feature if it's ever written

> - It looks like you're scanning inside all parts of multipart MIME, where
>   doing so only on text parts might be preferable.

No, in scan_body() we do:

return unless $part->effective_type ~~ [qw( text/plain text/html )];

I think that's what we want anyway...

> - The config really needs to live in a config file -- changing or
> expanding
>   the format is fine if needed, but config that lives in code can't really
>   be edited.  I'm not bothered by having per-list defaults in the plugin,
>   but one should be able to fully add a new list or alter an existing one
>   without changing Perl code.

I agree.  If you don't object to the spec I proposed in my reply to
Robert, I can get that changed soon.

> - In the unwinding of HTML entity escaping -- I suggest doing &amp; last.
>   "&amp;gt;" is "&gt;", not ">".

Good idea, thanks :)

>> the SURBL two-level and three-level lists -- but the SURBL lists are
>> pruned to exclude stuff that we're pretty sure is never going to
>> actually
>> get listed -- when was the last time you saw a spammer use .mil?
>> uribl.com never has, according to our datafeed :)
>
> We'll probably see compromised .gov and .mil sites hosting malware, much
> as we
> see in other domains.  They're not that useful for classical spammer
> purposes
> since they can't be registered in the usual fashion.

Ideally, we should only bother querying services for URIs that have some
chance of being listed on those services, regardless of their chance of
actually being used in spam.  Thus, if the datafeeds don't have any hosts
with .mil, there's no point checking them.  However, (1) we're only
deriving our list from URIBL right now, and even as we expand we can never
say that we derived our list from every service that someone might use;
(2) even if I ran my tld_list generating script every day against every
service and sent in an updated list as soon as something got listed on
.mil, there would still be the lag time to get a release out with the
updated list and then have people upgrade, during which time there would
be some misses.

I think that using 'pruned' lists is eventually going to cause some misses
somewhere.  For our own use, since we have control of update timelines,
it's worth the risk, and we also don't have to worry about legal issues
with using the data.  If either of these is a big enough problem for QP in
general, though, you could work around it by using an un-pruned
tld_lists.pl.  This would give %firstlevel_tlds 263 entries instead of
128; %surbl_secondlevel_tlds would have 3,801 instead of 2,426; and
%surbl_thirdlevel_tlds would have 155 instead of 135.  If you're worried
about the legal issue but *not* the timeliness issue, you could still
exclude a few obvious things like .mil and .gov from the first-level TLD
lists and pruning reserved .us domains
(http://www.nic.us/registrars/fcfs/dotus_reservedlist_v3.zip) from the
second and third level lists, leaving tld_lists still bloated but
eliminating most of the actual useless queries.  But note that when I
tried pruning things like .mil.* from the second level lists, I found some
of these actually *did* have hits, so I only pruned the ones that did not.

Pick your poison :)

-Jared

Re: Rewritten URIBL plugin

Reply via email to