Fellow Spamfighters, while Chris Santerre's Script [1] to extract domain names from a spam corpus is a bit Q&D, I wrote a perl script to extract URIs or domain names from the mail body using MIME::Parser. This will extract domain names from text/* MIME parts and encoded body/parts. The script currently takes a single message from STDIN and prints the found domains to STDOUT.
I will implement parsing whole mbox files and maybe other mail folder types. The results can be used for reg2rule.pl [2] with -t uri (not rawbody!). Yes, this is a very very alpha piece of code. I just post it here to get some response from you. Regards, Alex [1] http://www.merchantsoverseas.com/wwwroot/gorilla/evilrules.htm [2] http://www.wot.no-ip.com/Projects/Blocklist/reg2rule.pl ---snip--- #!/usr/bin/perl ### modules use MIME::Parser; use URI::Find; use URI; use strict; use vars qw(%hosts $entity $finder); ### vars %hosts = (); my $tmp_dir = "/tmp/spamcheck"; ### mime parser and entity object my $parser = new MIME::Parser; $parser->output_dir($tmp_dir); $entity = $parser->parse(\*STDIN); ### uri finder $finder = URI::Find->new(sub { my $uri = URI->new(shift); $uri->scheme =~ /^(http|ftp)/ && $hosts{$uri->host}++;} ); ### main split_entity($entity); print $_, "\n" foreach keys %hosts; exit; ### cleanup END { $entity->purge(); } ### sub land sub split_entity { local $entity = shift; my $num_parts = $entity->parts; # how many mime parts? if ($num_parts) { split_entity( $entity->parts($_) ) foreach (0..$num_parts-1); } else { $finder->find(\$entity->bodyhandle->as_string) if $entity->effective_type =~ /^(message|text)\//; } } ###-fin- ---snap--- -- Alex Pleiner zeitform Internet Dienste Fraunhoferstrasse 5 64283 Darmstadt, Germany http://www.zeitform.de Tel.: +49 (0)6151 155-635 mailto:[EMAIL PROTECTED] Fax: +49 (0)6151 155-634 GnuPG/PGP Key-ID: 0x613C21EA ------------------------------------------------------- This SF.net email is sponsored by OSDN developer relations Here's your chance to show off your extensive product knowledge We want to know what you know. Tell us and you have a chance to win $100 http://www.zoomerang.com/survey.zgi?HRPT1X3RYQNC5V4MLNSV3E54 _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk