Fellow Spamfighters,

while Chris Santerre's Script [1] to extract domain names from a spam
corpus is a bit Q&D, I wrote a perl script to extract URIs or domain
names from the mail body using MIME::Parser. This will extract domain
names from text/* MIME parts and encoded body/parts. The script
currently takes a single message from STDIN and prints the found domains
to STDOUT.

I will implement parsing whole mbox files and maybe other mail folder
types.

The results can be used for reg2rule.pl [2] with -t uri (not rawbody!).

Yes, this is a very very alpha piece of code. I just post it here to get
some response from you.

Regards, 
Alex

[1] http://www.merchantsoverseas.com/wwwroot/gorilla/evilrules.htm
[2] http://www.wot.no-ip.com/Projects/Blocklist/reg2rule.pl

---snip---
#!/usr/bin/perl

### modules
use MIME::Parser;
use URI::Find;
use URI;

use strict;
use vars qw(%hosts $entity $finder);

### vars
%hosts = ();
my $tmp_dir = "/tmp/spamcheck";

### mime parser and entity object
my $parser = new MIME::Parser;
$parser->output_dir($tmp_dir);
$entity = $parser->parse(\*STDIN);

### uri finder
$finder = URI::Find->new(sub { 
       my $uri = URI->new(shift); 
       $uri->scheme =~ /^(http|ftp)/ && $hosts{$uri->host}++;} );

### main
split_entity($entity);
print $_, "\n" foreach keys %hosts;
exit;

### cleanup
END { $entity->purge(); }

### sub land
sub split_entity {
  local $entity = shift;
  my $num_parts = $entity->parts; # how many mime parts?

  if ($num_parts) {
    split_entity( $entity->parts($_) ) 
      foreach (0..$num_parts-1);
  } else {
    $finder->find(\$entity->bodyhandle->as_string) 
      if $entity->effective_type =~ /^(message|text)\//;
  }
}

###-fin-
---snap---


-- 
Alex Pleiner
zeitform Internet Dienste         Fraunhoferstrasse 5
                                  64283 Darmstadt, Germany
http://www.zeitform.de            Tel.: +49 (0)6151 155-635
mailto:[EMAIL PROTECTED]        Fax:  +49 (0)6151 155-634
GnuPG/PGP Key-ID: 0x613C21EA


-------------------------------------------------------
This SF.net email is sponsored by OSDN developer relations
Here's your chance to show off your extensive product knowledge
We want to know what you know. Tell us and you have a chance to win $100
http://www.zoomerang.com/survey.zgi?HRPT1X3RYQNC5V4MLNSV3E54
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to