Fredrik Lundh wrote: > Chris Spencer wrote: > > > If an XML parser reads in and then writes out a document without having > > altered it, then the new document should be the same as the original. > > says who?
Good question. There is no One True Answer even within the XML standards. It all boils down to how you define "the same". Which parts of the XML document are meaningful content that needs to be preserved and which ones are mere encoding variations that may be omitted from the internal representation? Some relevant references which may be used as guidelines: * http://www.w3.org/TR/xml-infoset The XML infoset defines 11 types of information items including document type declaration, notations and other features. It does not appear to be suitable for a lightweight API like ElementTree. * http://www.w3.org/TR/xpath-datamodel The XPath data model uses a subset of the XML infoset with "only" seven node types. http://www.w3.org/TR/xml-c14n The canonical XML recommendation is meant to describe a process but it also effectively defines a data model: anything preserved by the canonicalization process is part of the model. Anything not preserved is not part of the model. In theory, this definition should be equivalent to the xpath data model since canonical XML is defined in terms of the xpath data model. In practice, the XPath data model defines properties not required for producing canonical XML (e.g. unparsed entities associated with document note). I like this alternative "black box" definition because provides a simple touchstone for determining what is or isn't part of the model. I think it would be a good goal for ElementTree to aim for compliance with the canonical XML data model. It's already quite close. It's possible to use the canonical XML data model without being a canonical XML processor but it would be nice if parse() followed by write() actually passed the canonical XML test vectors. It's the easiest way to demonstrate compliance conclusively. So what changes are required to make ElementTree canonical? 1. PI nodes are already supported for output. Need an option to preserve them on parsing 2. Comment nodes are already support for output. Need an option to preserve them on parsing (canonical XML also defines a "no comments" canonical form) 3. Preserve Comments and PIs outside the root element (store them as children of the ElementTree object?) 4. Sorting of attributes by canonical order 5. Minor formatting and spacing issues in opening tags oh, and one more thing... 6. preserve namespace prefixes ;-) (see http://www.w3.org/TR/xml-c14n#NoNSPrefixRewriting for rationale) -- http://mail.python.org/mailman/listinfo/python-list