Magnus Bodin <[EMAIL PROTECTED]> said something to this effect on 07/03/2001:
> The extracting regular expressions though, has a lot left as an
> exercise to the reader.
> 
> I enclose here two functions that I use myself to extract URL:s
> from a message. This is not perfect and improvement suggestions
> are welcome.  Especially broken URL:s (broken on two lines)
> should probably intelligently be pasted together somehow.

As long as we are using Perl, use the URI::Find module. It takes
the text to be searched and callback to execute with each match.

  use URI::Find;
  my $msg = do { local $/; <>; };
  find_uris($msg, sub { print "$_[1]\n" });

This will produce, when run against a random message in my
mailspool (readmsg $MAIL 3 | perl -e '...'):

http://advogato.org/
http://slashdot.org/
http://digitalmass.boston.com/
www.google.com
http://www.boston.com

URI::Find requires the URI family of Perl modules; but it is very
complete and very reliable. It handles multiline URIs, and 
(un|improperly)-quoted URIs as well.

To make this sweeter, the callback to find_uri's can do anything
you want, like reformat as HTML. The subroutine gets passed a
URI::URL object, which has tons of methods. If I change the
find_uris call above to read:

  find_uris($msg, sub { print $_[0]->abs });

I get these results:

http://advogato.org/
http://slashdot.org/
http://digitalmass.boston.com/
http://www.google.com/
http://www.boston.com/

(Notice how the last two URIs are well formed, rather than being
in the malformed state the sender of the message wrote them in;
see above.)

Formatting these URLs as HTML is trivial:

  find_uris($msg, sub { printf qq(<a href="%s">%s</a>\n), $_[0]->abs, $_[1] });

This produces:

<a href="http://www.advogato.org/";>http://www.advogato.org/</a>
<a href="http://slashdot.org/";>http://slashdot.org/</a>
<a href="http://digitalmass.boston.com/";>http://digitalmass.boston.com/</a>
<a href="http://www.google.com/";>www.google.com</a>
<a href="http://www.boston.com/";>http://www.boston.com</a>

Finally, as long as we're talking about procmail, we can write
this as a procmail recipe:

:0 f
* ^[EMAIL PROTECTED]
| perl -MURI::Find -le '{local$/;$f=<>}find_uris($f,sub{print($_[1])})' >> ~/urls

(darren)

-- 
Remember, UNIX spelled backwards is XINU.

Reply via email to