On 03/16/19 17:21, Tom Lane wrote: > Chapman Flack <c...@anastigmatix.net> writes: >> On 03/16/19 16:55, Tom Lane wrote: >>> What do you think of the idea I just posted about parsing off the DOCTYPE >>> thing for ourselves, and not letting libxml see it? > >> The principled way of doing that would be to pre-parse to find a DOCTYPE, >> and if there is one, leave it there and parse the input as we do for >> 'document'. Per XML, if there is a DOCTYPE, the document must satisfy >> the 'document' syntax requirements, and per SQL/XML:2006-and-later, >> 'content' is a proper superset of 'document', so if we were asked for >> 'content' and can successfully parse it as 'document', we're good, >> and if we see a DOCTYPE and yet it incurs a parse error as 'document', >> well, that's what needed to happen. > > Hm, so, maybe just > > (1) always try to parse as document. If successful, we're done. > > (2) otherwise, if allowed by xmloption, try to parse using our > current logic for the CONTENT case.
What I don't like about that is that (a) the input could be arbitrarily long and complex to parse (not that you can't imagine a database populated with lots of short little XML snippets, but at the same time, a query could quite plausibly deal in yooge ones), and (b), step (1) could fail at the last byte of the input, followed by total reparsing as (2). I think the safer structure is clearly that of the current patch, modulo whether the "has a DOCTYPE" test is done by libxml itself (with the assumptions you don't like) or by a pre-scan. So the current structure is: restart: asked for document? parse as document, or fail else asked for content: parse as content failed? because DOCTYPE? restart as if document else fail and a pre-scan structure could be very similar: restart: asked for document? parse as document, or fail else asked for content: pre-scan finds DOCTYPE? restart as if document else parse as content, or fail The pre-scan is a simple linear search and will ordinarily say yes or no within a couple dozen characters--you could *have* an input with 20k of leading whitespace and comments, but it's hardly the norm. Just trying to parse as 'document' first could easily parse a large fraction of the input before discovering it's followed by something that can't follow a document element. Regards, -Chap