>> It seems like >> all you have to do to get around the etc. problem is to wait a >> little longer before applying the fixup -- allow the semicolon to match >> in >> the hostname search and then strip it out. > > My bad.. I guess the plugin currently only fixes up '&#\d\d\d' encoding, > not etc. maybe i'll work on that...
This seems to be a big improvement at least on the 3 million lines of random traffic i tested with, and it's a smaller patch: --- uribl.orig 2010-07-23 17:06:10.894320796 -0500 +++ uribl 2010-07-23 19:57:39.304321519 -0500 @@ -289,7 +289,13 @@ # Undo URI escape munging $l =~ s/[=%]([0-9A-Fa-f]{2,2})/chr(hex($1))/ge; # Undo HTML entity munging (e.g. in parameterized redirects) - $l =~ s/&#(\d{2,3});?/chr($1)/ge; + $l =~ s/&#(\d{2,4});?/chr($1)/ge; + # Un-encode a few common important named entities and discard the rest + $l =~ s/ / /go; + $l =~ s/&/&/go; + $l =~ s/>/>/go; + $l =~ s/</</go; + $l =~ s/&\w{2,6};//go; # Dodge inserted-semicolon munging $l =~ tr/;//d;