On Wed, Jul 21, 2010 at 07:54:30AM -0500, Jared Johnson wrote: > Unlike the other bits of "dodge this sort of munging" operations, > examining my test results and asking uncle google has not made it clear to > me what "inserted-semicolon munging" really is. Can anyone shed light on
My memory of this is fuzzy, but my SVN log indicates it was an attempt to deal with a style of munging that exploited different browser behavior on encountering semicolons in the hostname component of URLs. At the time, the form I was considering was "http://domain;.com/". Under firefox, the semicolon is taken to mean the end of the hostname and the start of the path, thus triggering the implied-.com behavior and ending up at "http://domain.com/;.com/". I forget what IE does/did, but given the IE/FF market share balance in 2005 when I added that feature, it was probably similar. In fact the problem is a lot more complex, because the five major browsers out there all deal differently with deviations from the norm in URLs, and spammers exploit those deviations to mislead parsers. For the case I had in mind, stripping semicolons out would be the right thing, but will be misled by the munge pattern "http://domain.com;.com/foo". The uribl plugin deals with a lot of the munging tricks that were common back when it was written, but it's probably not comprehensive today and it's definitely suceptible to picking up bogus hostnames (e.g. the "nbsp;.net" behavior you note) based on what's left after the known munging tricks are unwound. http://code.google.com/p/browsersec/wiki/Part1#Uniform_Resource_Locators has a summary of some of the deviations, although it doesn't address this specific one. -- Devin \ aqua(at)devin.com, IRC:Requiem; http://www.devin.com Carraway \ 1024D/E9ABFCD2: 13E7 199E DD1E 65F0 8905 2E43 5395 CA0D E9AB FCD2