Shoot..I see that the href from the example was stripped once it is displayed on https://mail.gnome.org/archives/xml/2018-January/msg00010.html. Here is a gist that preserves formatting: https://gist.github.com/ptoomey3/4f684c7386229658b39b69756e262050.
On Fri, Jan 12, 2018 at 10:01 AM Patrick Toomey <patrick.too...@github.com> wrote: > While triaging a reported cross site scripting bug we were analyzing the > behavior of our HTML sanitization code and noticed that it was parsing an > input in an unexpected way. The sanitization library itself eventually > wraps Nokogiri, which is a relatively thin wrapper around libxml2. We > reached out to Kurt Seifriend and Daniel Veillard to let them know about > the observed behavior. There was discussion about whether the observed > behavior was was expected or not and eventually it was suggested that we > move the discussion to the libxml2 mailing list (the folks on that thread > reserved CVE-2016-3709 in case it is decided that this is an issue). > > In short https://github.com/GNOME/libxml2/commit/960f0e2 introduced > support for not URI escaping server side includes. However, this seems to > lead to the below behavior. > > -> libxml2 git:(bc5a5d65) ✗ cat test/HTML/ssiquote.html > <html><body><p><a > href='<!--"><script>alert(1)</script>-->'>test1</a></p></body></html> > -> libxml2 git:(bc5a5d65) ✗ make testHTML > -> libxml2 git:(bc5a5d65) ✗ ./testHTML test/HTML/ssiquote.html > <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" " > http://www.w3.org/TR/REC-html40/loose.dtd"> > <html><body><p><a > href="<!--"><script>alert(1)</script>-->">test1</a></p></body></html> > > Notice that the output results in a script tag added to the resulting > parsed output. Here is a small bit of Java/Xerces code to compare: > > import java.io.IOException; > import java.io.StringReader; > import java.io.StringWriter; > > import javax.xml.parsers.*; > import javax.xml.transform.OutputKeys; > import javax.xml.transform.Transformer; > import javax.xml.transform.TransformerException; > import javax.xml.transform.TransformerFactory; > import javax.xml.transform.dom.DOMSource; > import javax.xml.transform.stream.StreamResult; > > import org.w3c.dom.Document; > import org.w3c.dom.NamedNodeMap; > import org.w3c.dom.Node; > import org.w3c.dom.NodeList; > import org.xml.sax.InputSource; > import org.xml.sax.SAXException; > > public class XmlTester { > > public static void main(String[] args) throws > ParserConfigurationException, SAXException, IOException, > TransformerException { > String text = "<html><body><p><a > href='<!--\"><script>alert(1)</script>-->'>test1</a></p></body></html>"; > DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); > DocumentBuilder builder = factory.newDocumentBuilder(); > > // text contains the XML content > Document doc = builder.parse(new InputSource(new StringReader(text))); > System.out.println(getStringFromDocument(doc)); > } > > public static String getStringFromDocument(Document doc) throws > TransformerException { > DOMSource domSource = new DOMSource(doc); > StringWriter writer = new StringWriter(); > StreamResult result = new StreamResult(writer); > TransformerFactory tf = TransformerFactory.newInstance(); > Transformer transformer = tf.newTransformer(); > transformer.transform(domSource, result); > return writer.toString(); > } > } > > This Java code, with the same input, results in the following output: > > <html> > <body> > <p> > <a href="<!--%22><script>alert(1)</script>-->">test1</a> > </p> > </body> > </html> > > The attribute contents are quoted/escaped such that they don’t break out > of the attribute once it is parsed. This libxml2 behavior doesn’t apply to > all attributes. If we change the href to a class attribute there is no > issue. This likely makes sense since the above mentioned commit > specifically references not URI escaping. > > -> libxml2 git:(bc5a5d65) ✗ cat test/HTML/ssiquote.html > <html><body><p><a > class='<!--"><script>alert(1)</script>-->'>test1</a></p></body></html> > -> libxml2 git:(bc5a5d65) ✗ make testHTML > -> libxml2 git:(bc5a5d65) ✗ ./testHTML test/HTML/ssiquote.html > <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" " > http://www.w3.org/TR/REC-html40/loose.dtd"> > <html><body><p><a > class='<!--"><script>alert(1)</script>-->'>test1</a></p></body></html> > > So, I guess the question is, what do people think? I believe the argument > from Daniel was roughly that this would be expected behavior for server > side includes. However, this functionality seems to be in conflict with the > Xerces behavior and it also leads to a trivial way to cause new/unexpected > nodes to be introduced into the tree simply by parsing the document. > >
_______________________________________________ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org https://mail.gnome.org/mailman/listinfo/xml