> I'm new to xml mongering so forgive me if there's an obvious > well-known answer to this. It's not real obvious from the library > documentation I've looked at so far. Basically I have to munch of a > bunch of xml files which contain character entities like ú > which are apparently nonstandard.
If they contain such things, and do not contain a document type definition, they are not well-formed XML files (i.e. can't be called "XML" in a meaningful sense). It would have been helpful if you had given an example of such a document. > Basically I want to know if there's a way to supply the regular parser > (preferably xml.etree but I guess I can switch to another one if > necessary) with some kind of entity table, and/or if the info is > supposed to be found in the DTD or someplace like that. Right now I'm > ignoring the DTD and simply figuring out the doc structure by > eyeballing the xml files, maybe not a perfectly approved method but > it seems to be what most people do. If there is a document type declaration in the document, the best way is to parse it in a mode where the parser downloads the DTD when parsing it, and resolves the entity references itself. In SAX, you can put an EntityResolver into the parser, and then return a file-like object from resolveEntity. This can be used to avoid the network download; the document type declaration would still have to be present. Alternatively, you can implement a skippedEntity callback in the SAX content handler. In ElementTree, the XMLTreeBuilder has an attribute entity which is a dictionary used to map entity names in entity references to their definitions. Whether you can make the parser download the DTD itself, I don't know. Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list