I've made a bunch more changes to the uribl plugin locally; man, we _really_ need to get some kind of svn-to-gig thing going. Or at least I need to re-educate myself on git and start putting things in my github again. If I don't manage to do this by the time (soon) that things are settled down a bit and we have some production testing, I'll submit a new one to the list again; but hopefully by the time I'll have found time to git git going again and be able to point people to that.
I got permission from Dallas @ URIBL to use the datafeed data, but also got his opinion on the matter which is that using tld_lists the way I am is not going to gain much, and introduces the risk that a new spammer 'haven' could be missed entirely. After talking with my team we're going to go with a full TLD list right now, and perhaps later we'll collect our own stats to verify Dallas is right about the tiny benefit (he probably is). tld_lists has been updated to reflect this, though if anyone feels more bold than me and wants updates to the 'pruned' list let me know. I modified the parse_mime plugin as discussed previously on the list, now the uribl plugin isa_plugin('mime_parser') and does lazy parsing. I'll probably remove 'semicolon munging detection'; as Devin said, if real (current) data doesn't show it's being used why bother. I'd like to go over a larger sampling of current data first though, which I plan to do soon. I've re-arranged the code slightly to allow not only the async plugin but our own local plugin to easily take advantage of plugin inheritance to avoid code duplication. Our own plugin is now just 40 lines or so, thus it gets to inherit the other 600 lines of uribl without any forking :) There are some additional changes in the works that I'm curious for input on if anyone cares: - We're finally getting more URIBL datafeeds. I'd like to use this data to verify how static TXT results are for each service and, if applicable, generate templates for them in the same script that generates tld_lists.pl (and probably rename that to fit its more general purpose). So for services that do indeed have very static TXT templates, we could optionally skip TXT lookups and instead generate our own response (e.g. links) without the cost of one more DNS query. A couple of additional brainstorms on this topic: - Dynamically generate the TXT template by going ahead and doing TXT lookups until the first one we get back for each service, at which point we cache the template and don't do any more. (This would have to be re-done in every new child process) - Dynamically verify the validity of the statically-set template by doing TXT lookups until we get one and then checking. Still has to be per-child-process. - If we do either of the above dynamic checks, or if we don't choose to do any TXT-avoiding magic at all, we ought to launch TXT requests in the callback that receives matches on A lookups, that way we do two queries per _hit_, rather than two queries per URI. I've already added the independent option to just turn off TXT queries, for anyone who wants to save on DNS traffic at the cost of providing links on rejection. - We're interested in optionally resolving URL shortening links (e.g. tinyurl, bit.ly, etc.) using HTTP::Async My boss is still deliberating whether the URL shortening resolution thing would be contributed, or if we would consider it part of the 'special sauce'. I'm hopeful he'll be in favor of contributing it. - We'd like to change the check_headers directive to take more args than just true/false. 0 would still mean 'don't check headers', 1 would still mean 'check all headers', but 'all' would also mean 'check all headers'; anything else would be interpreted as the header(s) to be checked (comma-delimited if in list form). So you could do check_headers = 'subject' or check_headers = 'subject,received'. The default should probably be off. This is mainly because I noticed check_headers automatically checks the Received header, which is even more interesting when combined with the SBL-XBL service; basically, this plugin is now a replacement for SA's "RCVD_IN_SBL" rule, etc. This is probably a good thing... as long as you actually wanted it :) But you should be able to avoid it if you like and just specify the headers you're interested in. Any comments on these? Do they sound worthwhile? What should the defaults be? -Jared