[issue17318] xml.sax and xml.dom fetch DTDs by default

Raynard Sandwick Wed, 27 Feb 2013 16:52:28 -0800

New submission from Raynard Sandwick:

Note that URIs in the following are only meant as links when in parentheses; 
otherwise, they are identifiers and mostly will not yield useful results. I 
have only worked with xml.sax in Python 2.6 and 2.7, so I cannot speak to its 
current state in later versions.


The condition described in Python issue #2124 
(http://bugs.python.org/issue2124) may yet be a defect, and is at the least a 
reasonably important enhancement, but apparently was not sufficiently 
specified, so I will attempt to clarify. As an aside, it is similar to a 
libxml2 issue on which I have also commented today 
(https://bugzilla.gnome.org/show_bug.cgi?id=162776), whose statement of issue 
actually contains what I would expect to be correct behavior if the toggling 
action were setting an option/feature rather than importing an additional 
module.

The most common case, and the reason w3c has been inundated with the described 
requests, is that every time any user anywhere uses xml.sax in its default form 
to parse an XHTML document containing a doctype declaration, a request is sent 
to www.w3.org for the contents of that DTD from the URI in its system 
identifier. This is not documented anywhere (which would be the primary reason 
to call this a defect), and is confusing because it has the effect of using the 
terms parser and validator (or "validating parser," whichever is the preferred 
name) interchangeably.

The w3c is largely to blame, since their own definition document for XML 
(http://www.w3.org/TR/REC-xml/#sec-external-ent) defines the DTD as a "special 
kind of external entity," and then goes on to say that XML processors *MAY* use 
any combination of pubid+sysid to find an alternative method of resolving the 
reference, but otherwise *MUST* use the URI.

However, this is only necessary when *validating* XML. The DTD is a "mostly 
useless, but required" (http://en.wikipedia.org/wiki/Document_Type_Declaration) 
entity in HTML5, e.g., but is not required in XML generally. Even when present, 
the only time a processor should consult the DTD is during validation, not 
parsing. If the default parser revealed by xml.sax is a validator rather than 
just a parser, that should be communicated clearly to the user. When we discuss 
a CSV parser, we expect it to accept lines separated by some character, each 
with columns separated by commas. We do not expect it to verify that certain 
values are found in certain columns of the first line unless we specify that it 
should. In specifying that it should, we have asked for a validator rather than 
a parser. This issue is related to the XML analogue of that distinction.

The most valid and important complaint in the referenced blog post is: "don't 
fetch stuff unless you actually need it," which is what xml.sax users may be 
unwittingly doing if validation is the default behavior. Further, if xml.sax 
were actually *not* conducting validation by default, there is no reason 
whatever to retrieve the DTD, since any external entity references can remain 
unresolved in well-formed XML prior to validation.

Note that the features, http://xml.org/sax/features/external-general-entities, 
.../external-parameter-entities, and .../validation have no specified defaults 
(http://www.saxproject.org/apidoc/org/xml/sax/package-summary.html#package_description).
 Making these enabled by default causes network-required side effects, which I 
would contend is improper: unless a user asks for network activity, none should 
occur. An implicit request for network activity, such as validation, should be 
fully and widely-visibly documented as a legitimate side effect.

The set of primary use cases for the xml.sax parsers certainly include 
validation, but users will often be unaware that it is the default, and more 
importantly be unaware that the parser will therefore request the DTD from its 
URI. While the feature, .../external-general-entities, partially solves the 
problem, it is not a full solution, because a well-formed XML document can 
contain external entities regardless of the location of DTD subsets. The w3c's 
description ("special kind of external entity") is important here - the DTD is 
special for a reason, and has its own tag/specifier as a result: resolving 
general external entities after intentionally omitting an external DTD subset 
is an acceptable use case, especially in a non-validating parser.

My proposal would be to enhance/fix xml.sax by doing the following:

1) allow toggling of external DTD subset loading via a feature such as 
http://apache.org/xml/features/nonvalidating/load-external-dtd 
(http://xerces.apache.org/xerces-j/features.html),
2) cause the feature, http://xml.org/sax/features/validation, to automatically 
enable the DTD loading feature as well, just as it does for the two currently 
implemented external entity features,
3) document the default behavior, specially noting that users can expect URIs 
to be resolved, across the network/internet if necessary, after either the DTD 
feature or the validation feature is toggled to the enabled state,

and in my opinion:

4) disable the DTD feature by default, so the xml.sax-uninitiated developer who 
arrives upon the module as a solution doesn't start testing/using it without 
realizing these requests will be sent.

Sufficient documentation could override #4, since there is a 
backward-compatibility issue, but I think the detriment to the w3c is enough 
reason to rethink it nevertheless. Catalogs are a nice solution as well when 
validation is needed, but when it is not needed, there is no reason to require 
the extra work of building a catalog (that can't be guaranteed to be writable 
in situ without sysadmin access) when it is essentially purposeless.

I am continuing to search for the entry point for "<!" (and would appreciate 
any pointers) but have resorted to subclassing ExpatParser (which, again, kills 
the nice abstraction xml.sax touts) and omitting external entites with public 
identifiers starting with "-//W3C//DTD " - this is not a general solution, as 
DTDs are not guaranteed to come from w3c, and a correct solution would apply 
the appropriate omissions to the full "<!DOCTYPE ...>" entity, either by 
leaving external subsets as unparsed and orphaned entities in the document, or 
(only as a secondary potential solution since internal subsets could still be 
present and would thus become broken) by ignoring it completely. It might not 
even be reasonable to consider the latter, though when parsing only and not 
validating it would be a correctly-working result, so if the former is 
unachievable, the latter would be a decent improvement for that situation.

As a final note, in case it's helpful: my approach to a fix has been to examine 
ways to treat the DOCTYPE declaration itself, but another approach would be to 
have EntityResolver.resolveEntity receive a declaration type alongside the 
public and system identifiers, and thus the DOCTYPE declaration type could 
receive appropriate treatment within the current framework quite easily.

----------
components: XML
messages: 183193
nosy: rsandwick3
priority: normal
severity: normal
status: open
title: xml.sax and xml.dom fetch DTDs by default
type: resource usage
versions: Python 2.6, Python 2.7

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue17318>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue17318] xml.sax and xml.dom fetch DTDs by default

Reply via email to