Michael W. Cocke wrote:
<div class="moz-text-flowed" style="font-family: -moz-fixed">I was told a while back that the best way to extract urls from emails was to use code from SpamAssassin. Ok - Now, I need to do just that. Any pointers? I've looked thru the code in SpamCopURI, but unless there are some docs hidden somewhere I can't even figure out the entry point. Are there some docs hidden somewhere (I hope!)?

Thanks!

Mike-
</div>

here is a little something i use to extract urls from messages. it takes a mesage on STDIN, runs its through a empty instance of SA (no rules, no configs loaded), and prints to STDOUT.

#!/usr/bin/perl

use Mail::SpamAssassin;
use Mail::SpamAssassin::PerMsgStatus;

&main;

# ----------------------------

sub main {
 my $msg;
 while (<>) { $msg .= $_; }
 my $data = &geturi(\$msg);
 print $data;
 exit;
}

# ----------------------------

sub geturi {
 my ($message) = shift;
 my $sa = create_saobj();
 $sa->init(0);
 my $mail = $sa->parse($$message);
 my $msg = Mail::SpamAssassin::PerMsgStatus->new($sa, $mail);
 my @uris = $msg->get_uri_list();
 my %uri_list;
 foreach my $uri (@uris) {
   next if ($uri =~ m/^(cid|mailto|javascript):/i);
   $uri_list{$uri} = 1;
 }
 my $uris = join("\n", keys %uri_list, "");
 return $uris;
}

# ----------------------------

sub create_saobj {
 my %setup_args = ( rules_filename => undef, site_rules_filename => undef,
                    userprefs_filename => undef, userstate_dir => undef,
                    local_tests_only => 1, dont_copy_prefs => 1
                  );
 my $sa = Mail::SpamAssassin->new(\%setup_args);
 return $sa;
}

# ----------------------------
# EOF



# cat corpus/spam/canselon.com.html | perl parse_uri.pl
http://images.loveouroffers.com/general/8675_usub/USUB_101_b_02.gif
./unsubscribeOffers.html
http://images.loveouroffers.com/general/8675_usub/USUB_101_b_01.gif
http://images.loveouroffers.com/general/8675_usub/spacer.gif
list.html?clientid=12&em=&offerid=1&mailerid=1&emailid=0
http://list.html/?clientid=12&em=&offerid=1&mailerid=1&emailid=0
http://images.loveouroffers.com/general/8675_usub/USUB_101_b_03.jpg
http:///unsubscribeOffers.html
http://./unsubscribeOffers.html


Enjoy. Also, I only get digest copies from this list and dont check them all, so please cc me if you want me to see it. :)

--
Dallas Engelken
[EMAIL PROTECTED]
http://uribl.com

Reply via email to