> This seems to be a big improvement at least on the 3 million lines of > random traffic i tested with, and it's a smaller patch: [snip]
Well, it may have been an improvement over my own data, but a colleague pointed out the following case: check out spamsite.com;it's awesome! And this didn't deal with Devin's example of http://spamsite.com;.net either At this point it's a bit beyond testing on the original plugin for me, but here's what I came up with given my own modified hostname re loop (the minimal entity decoding + stripping is still needed) while ( m{ ( (?: [a-zA-Z0-9:./-]+ @ )? [a-zA-Z0-9][a-zA-Z0-9.;-]+\.$tld ) (?! \.?\w ) }gxo ) { my $host = lc $1; # Deal with inserted-semicolon munging, e.g. 'http://foo;.com' if ( my @split = $host =~ /(.*?);(.*)/ ) { my @h = split /\./, $split[0]; if ( $h[-1] =~ /^$tld$/ ) { # 'foo.com;.net', 'foo.com;.net;.foo', 'foo.com;it's great' $host = $split[0]; } else { $split[1] =~ /([^.;]+)/; $split[1] = $1 if $1; # 'foo;.com;.net', 'foo;.com' $host = $1 ? "$split[0].$1" : $split[0]; } } This works at least for the test cases i've come up with, and doesn't otherwise change processing from what _I_ had before for the other few million lines i'm checking against, but is not yet in production and the change alone is not tested against the vanilla code. If anyone's interested, the loop continues, dealing with excluding email addresses while including user:p...@site: if ( $host =~ s/.*\.\.// ) { next unless $host =~ /\./; } if ( $host =~ s/^\w{3,16}:\/+// ) { # 'http://realsite.com/cgi/em...@email.host.com' # 'http://user:p...@realsite.com' $host =~ s/\/.*//; $host =~ s/.*@//; } elsif ( $host =~ /(.*):/ ) { # 'user:p...@realsite.com' # 'mailto:em...@emailhost.com' next if $1 =~ /mailto/; $host =~ s/.*@//; } else { # 'realsite.com/cgi/em...@email.host.com' $host =~ s/\/.*//; next if $host =~ /@/; } This is the main reason for the different RE above. Also it obsoletes the need for two RE loops. -Jared