Michael W. Cocke wrote:
<div class="moz-text-flowed" style="font-family: -moz-fixed">I was
told a while back that the best way to extract urls from emails was to
use code from SpamAssassin. Ok - Now, I need to do just that. Any
pointers? I've looked thru the code in SpamCopURI, but unless there
are some docs hidden somewhere I can't even figure out the entry
point. Are there some docs hidden somewhere (I hope!)?
Thanks!
Mike-
</div>
here is a little something i use to extract urls from messages. it
takes a mesage on STDIN, runs its through a empty instance of SA (no
rules, no configs loaded), and prints to STDOUT.
#!/usr/bin/perl
use Mail::SpamAssassin;
use Mail::SpamAssassin::PerMsgStatus;
&main;
# ----------------------------
sub main {
my $msg;
while (<>) { $msg .= $_; }
my $data = &geturi(\$msg);
print $data;
exit;
}
# ----------------------------
sub geturi {
my ($message) = shift;
my $sa = create_saobj();
$sa->init(0);
my $mail = $sa->parse($$message);
my $msg = Mail::SpamAssassin::PerMsgStatus->new($sa, $mail);
my @uris = $msg->get_uri_list();
my %uri_list;
foreach my $uri (@uris) {
next if ($uri =~ m/^(cid|mailto|javascript):/i);
$uri_list{$uri} = 1;
}
my $uris = join("\n", keys %uri_list, "");
return $uris;
}
# ----------------------------
sub create_saobj {
my %setup_args = ( rules_filename => undef, site_rules_filename => undef,
userprefs_filename => undef, userstate_dir => undef,
local_tests_only => 1, dont_copy_prefs => 1
);
my $sa = Mail::SpamAssassin->new(\%setup_args);
return $sa;
}
# ----------------------------
# EOF
# cat corpus/spam/canselon.com.html | perl parse_uri.pl
http://images.loveouroffers.com/general/8675_usub/USUB_101_b_02.gif
./unsubscribeOffers.html
http://images.loveouroffers.com/general/8675_usub/USUB_101_b_01.gif
http://images.loveouroffers.com/general/8675_usub/spacer.gif
list.html?clientid=12&em=&offerid=1&mailerid=1&emailid=0
http://list.html/?clientid=12&em=&offerid=1&mailerid=1&emailid=0
http://images.loveouroffers.com/general/8675_usub/USUB_101_b_03.jpg
http:///unsubscribeOffers.html
http://./unsubscribeOffers.html
Enjoy. Also, I only get digest copies from this list and dont check
them all, so please cc me if you want me to see it. :)
--
Dallas Engelken
[EMAIL PROTECTED]
http://uribl.com