I have a log file within which is contained a dump of an xml message ... rubbish ///asd laksj aslf <nif_DEBUG time="Fri, 16 May 2008 13:40:17, 330"> <?xml version="1.0" encoding="UTF-8"?> <ns> <PDQ Lang="fr-FR" ID="XM;1928">content</PDQ> </ns> </nif_DEBUG> .. more junk ... then more xml """) This example is of course a summary.
I want to write a streaming filter which will throw out all the junk and just return a series of nice strings of each complete xml message. Ideally I also want to filter which messages I am interested in. e.g. the output from the above would be <?xml version="1.0" encoding="UTF-8"?> <ns> <PDQ Lang="fr-FR" ID="XM;1928">content</PDQ> </ns> Two problems. 1. clearing away junk that is nothing like XML. 2. handling the <? xml declaration that lies inside the other xml tags. the first I can handle relatively simply by reading through the string until I get what looks like a valid XML tag. I can then pass the rest onto an xml parser like xml.sax. However the parser then excepts out with : XMLSyntaxError: XML declaration allowed only at the start of the document I would like a more forgiving parser that handles bad xml by a call back that I can just say carry on to. Bear in mind also I probably will not have the end of the stream while initially processing. All suggestions and pointers welcome Andrew -- http://mail.python.org/mailman/listinfo/python-list