>     while ( m{ ( (?: [a-zA-Z0-9:./-]+ @ )?
>                  [a-zA-Z0-9][a-zA-Z0-9.;-]+\.$tld )
>                (?! \.?\w ) }gxo ) {
>         my $host = lc $1;
>         # Deal with inserted-semicolon munging, e.g. 'http://foo;.com'
>         if ( my @split = $host =~ /(.*?);(.*)/ ) {
>             my @h = split /\./, $split[0];
>             if ( $h[-1] =~ /^$tld$/ ) {

aaaaand I forgot to mention how $tld is defined.

my %firstlevel_tlds = map { $_ => undef } # 128 entries
    qw( ae aero am ar as at au az ba bd be bg bh biz br bs by bz ca cat
        cc ch cl cm cn co com cx cy cz de dk ec ee eg es eu fi fr gd ge gg
        gr gs hk hm hn hr hu id ie il im in info io ir is it je jp ke kg kr
        kz la li lt lv ma md me mk mn mobi ms mu mx my name ne net nl no np
        nu nz org pe ph pk pl pr pro pt py ro rs ru sa sc se sg si sk st su
        tc th tk tl to tp tr tt tv tw tz ua uk us uy uz vc ve vn ws za );

my $tld = my $tld_lc = join '|', keys %firstlevel_tlds;
$tld =~ s/(\w)/[$1\u$1]/g;
$tld = "(?:$tld)";

the list is derived from our URIBL datafeed, e.g. we're only interested in
tld's that actually hit some lists in the first place, therefore skipping
things like .mil which spammers don't seem to have the guts to use.  This
admittedly makes it not a *perfect* solution for munging, but close enough
IMO.  We don't (yet) have DBL or SURBL datafeeds, unfortunately, but the
SURBL two-level and three-level lists are derived from the interwebs and
not from spam traffic AFAICT, and DBL doesn't have a list because they
support wildcards.  If anyone has an SURBL and/or DBL datafeed and is
interested, I could provide a script to generate a list from those to see
if anything should be added.

The substitution business is to make the RE into a big fat ugly mess with
things like '[cC][oO][mM]', so that we can take out the /i, which makes
things about twice as fast. but it's not strictly necessary :)

-Jared

Reply via email to