On Fri, 2014-05-30 at 22:33 +0200, Andreas Schulze wrote: > I have to get an overview on http links in a specific mail stream. My > plan is to use spamassassin as it could parse message body much better > then I do :-) > There is a plugin URIDNSBL that could fire dns queries for every url > found. That's fine for me, as the url is then in my dnsserver log.
This does not necessarily get you all URIs. There are two limiting factors: (a) To lower the load on DNSBL operators and prevent unnecessary DNS queries, there is a list of URIs frequently found in mail, which will never be blacklisted anyway. These are skipped. The option clear_uridnsbl_skip_domain can be used to clear the default skip list. (b) To prevent excessive queries, the number of domains to look up is limited. You can set a higher value for uridnsbl_max_domains, if the default of 20 is not sufficient in your case. Both these options are documented here: http://spamassassin.apache.org/doc/Mail_SpamAssassin_Plugin_URIDNSBL.html Depending on what you actually want to extract from the messages, the resulting DNS queries of the URIDNSBL plugin might not be sufficient. URIDNSBL does NOT operate on actual, full URIs, but its domains only. No path information, and no hostname level. If you need more information and detail, you'll have to write a custom plugin, which has access to the complete, internal URI list. > But I like to combine it with other properties of a message. > Is ist possible to do something like this: > > if (subject =~ foo) { > uridnsbl URIBL_FOO foo.myzone. A > body URIBL_FOO eval:check_uridnsbl('URIBL_FOO') > } No, that is not possible. However, you can achieve such logic with a custom plugin. In addition to the internal URI list, a plugin can access which rules already matched. For that, the rules used as a conditional must have been completed already (lower priority, and not asynchronous). The bulk of the regex based rules are run at default priority 0, which also holds for custom header rules. By running your plugin at a higher priority level, its action can depend on conditions encoded as plain rules. Depending on your environment and needs, a plugin might be overkill and require too much effort. If the corpus is sufficiently small, and you don't plan on running the analysis frequently, you might get quick results out of a hack, harvesting -D debug output. uri __DUMP_URIS m~https?://.+~ tflags __DUMP_URIS multiple That is a sub-rule, matching any http or https URI. Due to tflags multiple, the debug output will list the matching part along with the rule's name to grep for. (Note though that this does include various internal versions, with path info stripped, etc. These duplicates need to be filtered out.) If you extract the URIs on a per-message basis, you can easily include more custom rules and have your data harvesting script use them as conditionals. -- char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1: (c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}