On Wed, Jul 21, 2010 at 07:54:30AM -0500, Jared Johnson wrote:
> Unlike the other bits of "dodge this sort of munging" operations,
> examining my test results and asking uncle google has not made it clear to
> me what "inserted-semicolon munging" really is.  Can anyone shed light on

My memory of this is fuzzy, but my SVN log indicates it was an attempt to deal
with a style of munging that exploited different browser behavior on
encountering semicolons in the hostname component of URLs.  At the time, the
form I was considering was "http://domain;.com/";.  Under firefox, the
semicolon is taken to mean the end of the hostname and the start of the path,
thus triggering the implied-.com behavior and ending up at
"http://domain.com/;.com/";.  I forget what IE does/did, but given the IE/FF
market share balance in 2005 when I added that feature, it was probably
similar.

In fact the problem is a lot more complex, because the five major browsers out
there all deal differently with deviations from the norm in URLs, and spammers
exploit those deviations to mislead parsers.  For the case I had in mind,
stripping semicolons out would be the right thing, but will be misled by the
munge pattern "http://domain.com;.com/foo";.  The uribl plugin deals with a lot
of the munging tricks that were common back when it was written, but it's
probably not comprehensive today and it's definitely suceptible to picking up
bogus hostnames (e.g. the "nbsp;.net" behavior you note) based on what's left
after the known munging tricks are unwound.

http://code.google.com/p/browsersec/wiki/Part1#Uniform_Resource_Locators has a
summary of some of the deviations, although it doesn't address this specific
one.

-- 
Devin  \ aqua(at)devin.com, IRC:Requiem; http://www.devin.com
Carraway \ 1024D/E9ABFCD2: 13E7 199E DD1E 65F0 8905 2E43 5395 CA0D E9AB FCD2

Reply via email to