Ah, I tried it in my own FF and the trick still works :)  It seems like
all you have to do to get around the   etc. problem is to wait a
little longer before applying the fixup -- allow the semicolon to match in
the hostname search and then strip it out.  Attached a patch, untested
(though the same change is tested on my now-greatly-forked uribl
plugin)... I'm sad to say I haven't been on github for a while so it may
be an age before i get around to submitting a patch according to the
standards, but if anybody else is interested in picking it up here it is
:)

Also, was it intended that the URI finding sub should find RHS hostnames
in email addresses as well?  I'm pretty sure it does... although again,
the testing I'm doing is no longer on the official code.  It seems
undesirable to me, I'm currently working on avoiding such hits while still
getting things like http://citibank.com:foooo...@phishingsite.com

I got approval from my boss yesterday to submit my updates to the plugin
to the ML, btw, which I'll do when it's a bit closer to finished, probably
the beginning of next week.  It's a huge change though, and notably adds
dependancies on MIME::Parser and Net::DNS::Async and completely changes
the config file format.  I don't really have approval to spend hours
creating digestible individual patches... but posting the whole plugin is
better than not posting any code, right? :)

-Jared

> On Wed, Jul 21, 2010 at 07:54:30AM -0500, Jared Johnson wrote:
>> Unlike the other bits of "dodge this sort of munging" operations,
>> examining my test results and asking uncle google has not made it clear
>> to
>> me what "inserted-semicolon munging" really is.  Can anyone shed light
>> on
>
> My memory of this is fuzzy, but my SVN log indicates it was an attempt to
> deal
> with a style of munging that exploited different browser behavior on
> encountering semicolons in the hostname component of URLs.  At the time,
> the
> form I was considering was "http://domain;.com/";.  Under firefox, the
> semicolon is taken to mean the end of the hostname and the start of the
> path,
> thus triggering the implied-.com behavior and ending up at
> "http://domain.com/;.com/";.  I forget what IE does/did, but given the
> IE/FF
> market share balance in 2005 when I added that feature, it was probably
> similar.
>
> In fact the problem is a lot more complex, because the five major browsers
> out
> there all deal differently with deviations from the norm in URLs, and
> spammers
> exploit those deviations to mislead parsers.  For the case I had in mind,
> stripping semicolons out would be the right thing, but will be misled by
> the
> munge pattern "http://domain.com;.com/foo";.  The uribl plugin deals with a
> lot
> of the munging tricks that were common back when it was written, but it's
> probably not comprehensive today and it's definitely suceptible to picking
> up
> bogus hostnames (e.g. the "nbsp;.net" behavior you note) based on what's
> left
> after the known munging tricks are unwound.
>
> http://code.google.com/p/browsersec/wiki/Part1#Uniform_Resource_Locators
> has a
> summary of some of the deviations, although it doesn't address this
> specific
> one.
>
> --
> Devin  \ aqua(at)devin.com, IRC:Requiem; http://www.devin.com
> Carraway \ 1024D/E9ABFCD2: 13E7 199E DD1E 65F0 8905 2E43 5395 CA0D E9AB
> FCD2
>
--- uribl.orig	2010-07-23 17:06:10.894320796 -0500
+++ uribl	2010-07-23 17:08:34.314345909 -0500
@@ -290,8 +290,6 @@
         $l =~ s/[=%]([0-9A-Fa-f]{2,2})/chr(hex($1))/ge;
         # Undo HTML entity munging (e.g. in parameterized redirects)
         $l =~ s/&#(\d{2,3});?/chr($1)/ge;
-        # Dodge inserted-semicolon munging
-        $l =~ tr/;//d;
 
         while ($l =~ m{
             \w{3,16}:/+            # protocol
@@ -339,7 +337,7 @@
         }
         while ($l =~ m{
             ((?:www\.)?                             # www?
-             [a-zA-Z0-9][a-zA-Z0-9\-.]+\.           # hostname
+             [a-zA-Z0-9][a-zA-Z0-9;\-.]+\.           # hostname
              (?:aero|arpa|asia|biz|cat|com|coop|    # tld
                 edu|gov|info|int|jobs|mil|mobi|
                 museum|name|net|org|pro|tel|travel|
@@ -347,6 +345,8 @@
             )(?!\w)        
             }gix) {
             my $host = lc $1;
+            # Dodge inserted-semicolon munging
+            $host =~ tr/;//d;
             my @host_domains = split /\./, $host;
             $self->log(LOGDEBUG, "uribl: matched 'www.' hostname $host");
 
@@ -372,7 +372,7 @@
             \w{3,16}:/+                 # protocol
             (?:\S+@)?                   # user/pass
             (
-	     [a-zA-Z0-9][a-zA-Z0-9\-.]+\.           # hostname
+	     [a-zA-Z0-9][a-zA-Z0-9;\-.]+\.           # hostname
 	     (?:aero|arpa|asia|biz|cat|com|coop|    # tld
                 edu|gov|info|int|jobs|mil|mobi|
                 museum|name|net|org|pro|tel|travel|
@@ -380,6 +380,8 @@
             )
             }gix) {
             my $host = lc $1;
+            # Dodge inserted-semicolon munging
+            $host =~ tr/;//d;
             my @host_domains = split /\./, $host;
             $self->log(LOGDEBUG, "uribl: matched full URI hostname $host");
 

Reply via email to