Inspired by "Filters that fight back", by Paul Graham http://www.paulgraham.com/ffb.html I found a reference to a short script that scans e-mail for URL's, and then turns around and automatically references the offending page. Well, I'm not interested in doing that at the moment, but I have enhanced the script (and fixed a bug) to make it do a decent job of extracting e-mails from an mbox. Here's the script, also attached.
#! /usr/bin/perl -w use lib '..', '.'; use Mail::Box::Manager 2.00; use LWP::Simple; use URI; use HTML::Entities; # Basic idea of using Mail::Box::Manager from: # http://radio.weblogs.com/0111823/2003/11/16.html#a373 # Open the folder my $mgr=Mail::Box::Manager->new; my $folder = $mgr->open ( $ARGV[0], extract => 'ALWAYS' # Take the body ); die "Cannot open '$ARGV[0]': $!\n" unless defined $folder; # Process all messages in this folder. my @messages = $folder->messages; foreach my $message (@messages) { my @match=($message->decoded =~ /\bhref="(http[^>"]*)">.*/gi ); foreach my $match(@match) { my $u = URLDecode(HTML::Entities::decode($match)); my $url = URI->new($u)->as_string; print $url,"\n"; my $content=""; # Try and fetch the page, no-op for now. # my $rc=getstore($match,$content); } } # Finish $folder->close; exit 0; # from: http://glennf.com/writing/hexadecimal.url.encoding.html sub URLDecode { my $theURL = $_[0]; $theURL =~ tr/+/ /; $theURL =~ s/%([a-fA-F0-9]{2,2})/chr(hex($1))/eg; $theURL =~ tr [\200-\377] [\000-\177]; # delete 8th bit $theURL =~ s/[^[:print:]]//g; # remove non-printables $theURL =~ s/\s//g; # remove white space $theURL =~ s/<!--(.|\n)*-->//g; return $theURL; } I ran it as follows, with good results: % extract_urls.pl spam-mail | sort -u where 'spam-mail' is an mbox.
extract_urls.pl
Description: Binary data