Thank you both! I had glanced at that item in the commitfest but didn't notice it would fix this issue. I'll try to test/review this before the end of the month, much better than starting from scratch myself. A quick glance at the patch looks logical and looks like it should work for my use case.
Thanks, Ryan Lambert On Sat, Mar 16, 2019 at 4:33 PM Chapman Flack <c...@anastigmatix.net> wrote: > On 03/16/19 17:21, Tom Lane wrote: > > Chapman Flack <c...@anastigmatix.net> writes: > >> On 03/16/19 16:55, Tom Lane wrote: > >>> What do you think of the idea I just posted about parsing off the > DOCTYPE > >>> thing for ourselves, and not letting libxml see it? > > > >> The principled way of doing that would be to pre-parse to find a > DOCTYPE, > >> and if there is one, leave it there and parse the input as we do for > >> 'document'. Per XML, if there is a DOCTYPE, the document must satisfy > >> the 'document' syntax requirements, and per SQL/XML:2006-and-later, > >> 'content' is a proper superset of 'document', so if we were asked for > >> 'content' and can successfully parse it as 'document', we're good, > >> and if we see a DOCTYPE and yet it incurs a parse error as 'document', > >> well, that's what needed to happen. > > > > Hm, so, maybe just > > > > (1) always try to parse as document. If successful, we're done. > > > > (2) otherwise, if allowed by xmloption, try to parse using our > > current logic for the CONTENT case. > > What I don't like about that is that (a) the input could be > arbitrarily long and complex to parse (not that you can't imagine > a database populated with lots of short little XML snippets, but > at the same time, a query could quite plausibly deal in yooge ones), > and (b), step (1) could fail at the last byte of the input, followed > by total reparsing as (2). > > I think the safer structure is clearly that of the current patch, > modulo whether the "has a DOCTYPE" test is done by libxml itself > (with the assumptions you don't like) or by a pre-scan. > > So the current structure is: > > restart: > asked for document? > parse as document, or fail > else asked for content: > parse as content > failed? > because DOCTYPE? restart as if document > else fail > > and a pre-scan structure could be very similar: > > restart: > asked for document? > parse as document, or fail > else asked for content: > pre-scan finds DOCTYPE? > restart as if document > else parse as content, or fail > > The pre-scan is a simple linear search and will ordinarily say yes or no > within a couple dozen characters--you could *have* an input with 20k of > leading whitespace and comments, but it's hardly the norm. Just trying to > parse as 'document' first could easily parse a large fraction of the input > before discovering it's followed by something that can't follow a document > element. > > Regards, > -Chap >