Re: [xml] Recovering from errors in an XML "stream"

Webb Scales Mon, 09 Sep 2019 19:41:58 -0700

On 9/7/19 12:37 AM, Liam R. E. Quin wrote:

On Fri, 2019-09-06 at 01:57 -0400, Webb Scales wrote:

The first issue is that the XML parser seems to balk entirely at the
fact that the document is preceded by a comment before the XML
declaration.  (I'm less than shocked, but it is kind of
disappointing.)

I'd be sad if it accepte it - it's not allowed.

Thanks for the BNF and the pointer to the specification. However, thefact remains that I don't control the text that I'm trying to parse, andI still need to parse it, even though it's not "well-formed".

The next issue is that the XML parser reports an error near the end
of  the document, when it notices that the document is followed by an
XML declaration.  (I'm a little closer to shocked by this.)

Feed the parser XML without errors and this won't happen. Or are you
saying there are multiple documents in the same input stream?

I've got a stream of bytes; it contains text which is "XML-like". Iwould love to break it up into chunks which are well-formed (orotherwise acceptable) XML documents and then feed it to a LibXML2function, but I need to do so without making too many assumptions aboutthe input and without having to teach my code too much about XML(otherwise, there'd be no point using LibXML2).

As it happens, there are newlines between the documents, so I tweaked mycustom I/O handler to return only up to the next newline. However,after receiving the text for a complete document, the TextReader stillcalls my handler /again/ and then issues an error because there is textafter the closing tag for the root...if it hadn't made the extra call,it wouldn't have been prompted to fail like that!

the offending text doesn't appear
until after the closing tag for the root.)

isn't that the point?

The point is that the TextReader is (I thought...) supposed to returnthe nodes or elements /as they are parsed/...so why does it reporterrors in text that is well beyond the current node (which, in fact, ithad to issue an /extra/ I/O request to get)??

Without that lookahead, I could have stopped the parse when it reachedthe end of the document, and started a /new/ reader for the nextdocument. But, instead, the current reader consumes some of the textwhich belongs to the next document, and then goes into an endless cyclewhere it returns errors without advancing to the next node.

Is there some other approach which is better for my situation than
the xmlTextReader?

XSLT 3 provides a streaming mode which does what it sounds like you
might need, but libxml supports only XSLT 1. However, it, too, needs
well-formed XML input without errors. There's also STX. Or use a SAX
parser and keep only what you need, but again you need well-formed
input. By the time you've written a program to fix the input, your
program might well be able to do what you need anyway, no??

Yes, I'm trying to avoid reinventing the wheel: if I write code whichis able to transform my input into well-formed XML, I won't need LibXMLto parse it for me.

I was hoping that there was a way to handle the errors encountered bythe TextReader, recover from them, and continue with the parse, but itsounds like that's not practical.



            Webb

--

Webb Scales
Principal Software Architect
603-673-2306
www.ursasecure.com <https://www.ursasecure.com>
w...@ursasecure.com <mailto:w...@ursasecure.com>

_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml

Re: [xml] Recovering from errors in an XML "stream"

Reply via email to