>> It seems like
>> all you have to do to get around the   etc. problem is to wait a
>> little longer before applying the fixup -- allow the semicolon to match
>> in
>> the hostname search and then strip it out.
>
> My bad.. I guess the plugin currently only fixes up '&#\d\d\d' encoding,
> not   etc.  maybe i'll work on that...

This seems to be a big improvement at least on the 3 million lines of
random traffic i tested with, and it's a smaller patch:

--- uribl.orig  2010-07-23 17:06:10.894320796 -0500
+++ uribl       2010-07-23 19:57:39.304321519 -0500
@@ -289,7 +289,13 @@
         # Undo URI escape munging
         $l =~ s/[=%]([0-9A-Fa-f]{2,2})/chr(hex($1))/ge;
         # Undo HTML entity munging (e.g. in parameterized redirects)
-        $l =~ s/&#(\d{2,3});?/chr($1)/ge;
+        $l =~ s/&#(\d{2,4});?/chr($1)/ge;
+        # Un-encode a few common important named entities and discard the
rest
+        $l =~ s/ / /go;
+        $l =~ s/&/&/go;
+        $l =~ s/>/>/go;
+        $l =~ s/&lt;/</go;
+        $l =~ s/&\w{2,6};//go;
         # Dodge inserted-semicolon munging
         $l =~ tr/;//d;



Reply via email to