Re: Fix XML handling with DOCTYPE

Chapman Flack Sat, 16 Mar 2019 15:33:48 -0700

On 03/16/19 17:21, Tom Lane wrote:
> Chapman Flack <[email protected]> writes:
>> On 03/16/19 16:55, Tom Lane wrote:
>>> What do you think of the idea I just posted about parsing off the DOCTYPE
>>> thing for ourselves, and not letting libxml see it?
> 
>> The principled way of doing that would be to pre-parse to find a DOCTYPE,
>> and if there is one, leave it there and parse the input as we do for
>> 'document'. Per XML, if there is a DOCTYPE, the document must satisfy
>> the 'document' syntax requirements, and per SQL/XML:2006-and-later,
>> 'content' is a proper superset of 'document', so if we were asked for
>> 'content' and can successfully parse it as 'document', we're good,
>> and if we see a DOCTYPE and yet it incurs a parse error as 'document',
>> well, that's what needed to happen.
> 
> Hm, so, maybe just
> 
> (1) always try to parse as document.  If successful, we're done.
> 
> (2) otherwise, if allowed by xmloption, try to parse using our
> current logic for the CONTENT case.


What I don't like about that is that (a) the input could be
arbitrarily long and complex to parse (not that you can't imagine
a database populated with lots of short little XML snippets, but
at the same time, a query could quite plausibly deal in yooge ones),
and (b), step (1) could fail at the last byte of the input, followed
by total reparsing as (2).

I think the safer structure is clearly that of the current patch,
modulo whether the "has a DOCTYPE" test is done by libxml itself
(with the assumptions you don't like) or by a pre-scan.

So the current structure is:

restart:
  asked for document?
    parse as document, or fail
  else asked for content:
    parse as content
    failed?
      because DOCTYPE? restart as if document
      else fail

and a pre-scan structure could be very similar:

restart:
  asked for document?
    parse as document, or fail
  else asked for content:
    pre-scan finds DOCTYPE?
      restart as if document
    else parse as content, or fail

The pre-scan is a simple linear search and will ordinarily say yes or no
within a couple dozen characters--you could *have* an input with 20k of
leading whitespace and comments, but it's hardly the norm. Just trying to
parse as 'document' first could easily parse a large fraction of the input
before discovering it's followed by something that can't follow a document
element.

Regards,
-Chap

Re: Fix XML handling with DOCTYPE

Reply via email to