While triaging a reported cross site scripting bug we were analyzing the
behavior of our HTML sanitization code and noticed that it was parsing an
input in an unexpected way.  The sanitization library itself eventually
wraps Nokogiri, which is a relatively thin wrapper around libxml2. We
reached out to Kurt Seifriend and Daniel Veillard to let them know about
the observed behavior.  There was discussion about whether the observed
behavior was was expected or not and eventually it was suggested that we
move the discussion to the libxml2 mailing list (the folks on that thread
reserved CVE-2016-3709 in case it is decided that this is an issue).

In short https://github.com/GNOME/libxml2/commit/960f0e2 introduced support
for not URI escaping server side includes. However, this seems to lead to
the below behavior.

->  libxml2 git:(bc5a5d65) ✗ cat test/HTML/ssiquote.html
<html><body><p><a
href='&lt;!--"&gt;&lt;script&gt;alert(1)&lt;/script&gt;-->'>test1</a></p></body></html>
->  libxml2 git:(bc5a5d65) ✗ make testHTML
->  libxml2 git:(bc5a5d65) ✗ ./testHTML test/HTML/ssiquote.html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "
http://www.w3.org/TR/REC-html40/loose.dtd";>
<html><body><p><a
href="<!--"><script>alert(1)</script>-->">test1</a></p></body></html>

Notice that the output results in a script tag added to the resulting
parsed output.  Here is a small bit of Java/Xerces code to compare:

import java.io.IOException;
import java.io.StringReader;
import java.io.StringWriter;

import javax.xml.parsers.*;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;

import org.w3c.dom.Document;
import org.w3c.dom.NamedNodeMap;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;

public class XmlTester {

public static void main(String[] args) throws ParserConfigurationException,
SAXException, IOException, TransformerException {
String text = "<html><body><p><a
href='&lt;!--\"&gt;&lt;script&gt;alert(1)&lt;/script&gt;-->'>test1</a></p></body></html>";
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();

// text contains the XML content
Document doc = builder.parse(new InputSource(new StringReader(text)));
System.out.println(getStringFromDocument(doc));
}

public static String getStringFromDocument(Document doc) throws
TransformerException {
DOMSource domSource = new DOMSource(doc);
StringWriter writer = new StringWriter();
StreamResult result = new StreamResult(writer);
TransformerFactory tf = TransformerFactory.newInstance();
Transformer transformer = tf.newTransformer();
transformer.transform(domSource, result);
return writer.toString();
}
}

This Java code, with the same input, results in the following output:

<html>
<body>
<p>
<a href="<!--%22><script>alert(1)</script>-->">test1</a>
</p>
</body>
</html>

The attribute contents are quoted/escaped such that  they don’t break out
of the attribute once it is parsed. This libxml2 behavior doesn’t apply to
all attributes. If we change the href to a class attribute there is no
issue. This likely makes sense since the above mentioned commit
specifically references not URI escaping.

->  libxml2 git:(bc5a5d65) ✗ cat test/HTML/ssiquote.html
<html><body><p><a
class='&lt;!--"&gt;&lt;script&gt;alert(1)&lt;/script&gt;-->'>test1</a></p></body></html>
->  libxml2 git:(bc5a5d65) ✗ make testHTML
->  libxml2 git:(bc5a5d65) ✗ ./testHTML test/HTML/ssiquote.html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "
http://www.w3.org/TR/REC-html40/loose.dtd";>
<html><body><p><a
class='<!--"><script>alert(1)</script>-->'>test1</a></p></body></html>

So, I guess the question is, what do people think? I believe the argument
from Daniel was roughly that this would be expected behavior for server
side includes. However, this functionality seems to be in conflict with the
Xerces behavior and it also leads to a trivial way to cause new/unexpected
nodes to be introduced into the tree simply by parsing the document.
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml

Reply via email to