On Fri, 2014-05-30 at 22:33 +0200, Andreas Schulze wrote:
> I have to get an overview on http links in a specific mail stream. My
> plan is to use spamassassin as it could parse message body much better
> then I do :-)
> There is a plugin URIDNSBL that could fire dns queries for every url
> found. That's fine for me, as the url is then in my dnsserver log.

This does not necessarily get you all URIs. There are two limiting
factors:

(a) To lower the load on DNSBL operators and prevent unnecessary DNS
queries, there is a list of URIs frequently found in mail, which will
never be blacklisted anyway. These are skipped.

The option clear_uridnsbl_skip_domain can be used to clear the default
skip list.

(b) To prevent excessive queries, the number of domains to look up is
limited. You can set a higher value for uridnsbl_max_domains, if the
default of 20 is not sufficient in your case.

Both these options are documented here:

  http://spamassassin.apache.org/doc/Mail_SpamAssassin_Plugin_URIDNSBL.html


Depending on what you actually want to extract from the messages, the
resulting DNS queries of the URIDNSBL plugin might not be sufficient.
URIDNSBL does NOT operate on actual, full URIs, but its domains only. No
path information, and no hostname level.

If you need more information and detail, you'll have to write a custom
plugin, which has access to the complete, internal URI list.


> But I like to combine it with other properties of a message.
> Is ist possible to do something like this:
> 
> if (subject =~ foo) {
>   uridnsbl    URIBL_FOO       foo.myzone. A
>   body                URIBL_FOO       eval:check_uridnsbl('URIBL_FOO')
> }

No, that is not possible.

However, you can achieve such logic with a custom plugin. In addition to
the internal URI list, a plugin can access which rules already matched.
For that, the rules used as a conditional must have been completed
already (lower priority, and not asynchronous).

The bulk of the regex based rules are run at default priority 0, which
also holds for custom header rules. By running your plugin at a higher
priority level, its action can depend on conditions encoded as plain
rules.


Depending on your environment and needs, a plugin might be overkill and
require too much effort. If the corpus is sufficiently small, and you
don't plan on running the analysis frequently, you might get quick
results out of a hack, harvesting -D debug output.

  uri    __DUMP_URIS  m~https?://.+~
  tflags __DUMP_URIS  multiple

That is a sub-rule, matching any http or https URI. Due to tflags
multiple, the debug output will list the matching part along with the
rule's name to grep for. (Note though that this does include various
internal versions, with path info stripped, etc. These duplicates need
to be filtered out.)

If you extract the URIs on a per-message basis, you can easily include
more custom rules and have your data harvesting script use them as
conditionals.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Reply via email to