Re: Fix XML handling with DOCTYPE

Ryan Lambert Sat, 16 Mar 2019 15:44:53 -0700

Thank you both!  I had glanced at that item in the commitfest but didn't
notice it would fix this issue.
I'll try to test/review this before the end of the month, much better than
starting from scratch myself.   A quick glance at the patch looks logical
and looks like it should work for my use case.


Thanks,

Ryan Lambert


On Sat, Mar 16, 2019 at 4:33 PM Chapman Flack <c...@anastigmatix.net> wrote:

> On 03/16/19 17:21, Tom Lane wrote:
> > Chapman Flack <c...@anastigmatix.net> writes:
> >> On 03/16/19 16:55, Tom Lane wrote:
> >>> What do you think of the idea I just posted about parsing off the
> DOCTYPE
> >>> thing for ourselves, and not letting libxml see it?
> >
> >> The principled way of doing that would be to pre-parse to find a
> DOCTYPE,
> >> and if there is one, leave it there and parse the input as we do for
> >> 'document'. Per XML, if there is a DOCTYPE, the document must satisfy
> >> the 'document' syntax requirements, and per SQL/XML:2006-and-later,
> >> 'content' is a proper superset of 'document', so if we were asked for
> >> 'content' and can successfully parse it as 'document', we're good,
> >> and if we see a DOCTYPE and yet it incurs a parse error as 'document',
> >> well, that's what needed to happen.
> >
> > Hm, so, maybe just
> >
> > (1) always try to parse as document.  If successful, we're done.
> >
> > (2) otherwise, if allowed by xmloption, try to parse using our
> > current logic for the CONTENT case.
>
> What I don't like about that is that (a) the input could be
> arbitrarily long and complex to parse (not that you can't imagine
> a database populated with lots of short little XML snippets, but
> at the same time, a query could quite plausibly deal in yooge ones),
> and (b), step (1) could fail at the last byte of the input, followed
> by total reparsing as (2).
>
> I think the safer structure is clearly that of the current patch,
> modulo whether the "has a DOCTYPE" test is done by libxml itself
> (with the assumptions you don't like) or by a pre-scan.
>
> So the current structure is:
>
> restart:
>   asked for document?
>     parse as document, or fail
>   else asked for content:
>     parse as content
>     failed?
>       because DOCTYPE? restart as if document
>       else fail
>
> and a pre-scan structure could be very similar:
>
> restart:
>   asked for document?
>     parse as document, or fail
>   else asked for content:
>     pre-scan finds DOCTYPE?
>       restart as if document
>     else parse as content, or fail
>
> The pre-scan is a simple linear search and will ordinarily say yes or no
> within a couple dozen characters--you could *have* an input with 20k of
> leading whitespace and comments, but it's hardly the norm. Just trying to
> parse as 'document' first could easily parse a large fraction of the input
> before discovering it's followed by something that can't follow a document
> element.
>
> Regards,
> -Chap
>

Re: Fix XML handling with DOCTYPE

Reply via email to