> This seems to be a big improvement at least on the 3 million lines of
> random traffic i tested with, and it's a smaller patch:
[snip]

Well, it may have been an improvement over my own data, but a colleague
pointed out the following case:

check out spamsite.com;it's awesome!

And this didn't deal with Devin's example of http://spamsite.com;.net either

At this point it's a bit beyond testing on the original plugin for me, but
here's what I came up with given my own modified hostname re loop (the
minimal entity decoding + stripping is still needed)

    while ( m{ ( (?: [a-zA-Z0-9:./-]+ @ )?
                 [a-zA-Z0-9][a-zA-Z0-9.;-]+\.$tld )
               (?! \.?\w ) }gxo ) {
        my $host = lc $1;
        # Deal with inserted-semicolon munging, e.g. 'http://foo;.com'
        if ( my @split = $host =~ /(.*?);(.*)/ ) {
            my @h = split /\./, $split[0];
            if ( $h[-1] =~ /^$tld$/ ) {
                # 'foo.com;.net', 'foo.com;.net;.foo', 'foo.com;it's great'
                $host = $split[0];
            } else {
                $split[1] =~ /([^.;]+)/;
                $split[1] = $1 if $1; # 'foo;.com;.net', 'foo;.com'
                $host = $1 ? "$split[0].$1" : $split[0];
            }
        }

This works at least for the test cases i've come up with, and doesn't
otherwise change processing from what _I_ had before for the other few
million lines i'm checking against, but is not yet in production and the
change alone is not tested against the vanilla code.  If anyone's
interested, the loop continues, dealing with excluding email addresses
while including user:p...@site:


        if ( $host =~ s/.*\.\.// ) {
            next unless $host =~ /\./;
        }
        if ( $host =~ s/^\w{3,16}:\/+// ) {
            # 'http://realsite.com/cgi/em...@email.host.com'
            # 'http://user:p...@realsite.com'
            $host =~ s/\/.*//;
            $host =~ s/.*@//;
        } elsif ( $host =~ /(.*):/ ) {
            # 'user:p...@realsite.com'
            # 'mailto:em...@emailhost.com'
            next if $1 =~ /mailto/;
            $host =~ s/.*@//;
        } else {
            # 'realsite.com/cgi/em...@email.host.com'
            $host =~ s/\/.*//;
            next if $host =~ /@/;
        }

This is the main reason for the different RE above.  Also it obsoletes the
need for two RE loops.

-Jared

Reply via email to