Re: [HACKERS] patch: function xmltable

Pavel Stehule Fri, 23 Sep 2016 01:31:12 -0700

2016-09-23 10:05 GMT+02:00 Craig Ringer <[email protected]>:

> On 22 September 2016 at 02:31, Pavel Stehule <[email protected]>
> wrote:
>
> > another small update - fix XMLPath parser - support multibytes characters
>
> I'm returning for another round of review.
>
> The code doesn't handle the 5 XML built-in entities correctly in
> text-typed output. It processes &apos; and &quot; but not &amp, &lt or
> &gt; . See added test. I have not fixed this, but I think it's clearly
> broken:
>
>
> + -- XML builtin entities
> + SELECT * FROM xmltable('/x/a' PASSING
> '<x><a><ent>&apos;</ent></a><a><ent>&quot;</ent></a><a><
> ent>&amp;</ent></a><a><ent>&lt;</ent></a><a><ent>&gt;</ent></a></x>'
> COLUMNS ent text);
> +   ent
> + -------
> +  '
> +  "
> +  &amp;
> +  &lt;
> +  &gt;
> + (5 rows)
>
> so I've adjusted the docs to claim that they're expanded. The code
> needs fixing to avoid entity-escaping when the output column type is
> not 'xml'.
>
>
> &apos; and &quot; entities in xml-typed output are expanded, not
> preserved. I don't know if this is intended but I suspect it is:
>
> + SELECT * FROM xmltable('/x/a' PASSING
> '<x><a><ent>&apos;</ent></a><a><ent>&quot;</ent></a><a><
> ent>&amp;</ent></a><a><ent>&lt;</ent></a><a><ent>&gt;</ent></a></x>'
> COLUMNS ent xml);
> +        ent
> + ------------------
> +  <ent>'</ent>
> +  <ent>"</ent>
> +  <ent>&amp;</ent>
> +  <ent>&lt;</ent>
> +  <ent>&gt;</ent>
> + (5 rows)
>
>
> For the docs changes relevant to the above search for "The five
> predefined XML entities". Adjust that bit of docs if I guessed wrong
> about the intended behaviour.
>
> The tests don't cover CDATA or PCDATA . I didn't try to add that, but
> they should.
>
>
> Did some docs copy-editing and integrated some examples. Explained how
> nested elements work, that multiple top level elements is an error,
> etc. Explained the time-of-evaluation stuff. Pointed out that you can
> refer to prior output columns in PATH and DEFAULT, since that's weird
> and unusual compared to normal SQL. Documented handling of multiple
> node matches, including the surprising results of somepath/text() on
> <somepath>x<!--blah-->y</somepath>. Documented handling of nested
> elements. Documented that xmltable works only on XML documents, not
> fragments/forests.
>
> Regarding evaluation time, it struck me that evaluating path
> expressions once per row means the xpath must be parsed and processed
> once per row. Isn't it desirable to store and re-use the preparsed
> xpath? I don't think this is a major problem, since we can later
> detect stable/immutable expressions including constants, evaluate only
> once in that case, and cache. It's just worth thinking about.
>
> The docs and tests don't seem to cover XML entities. What's the
> behaviour there? Core XML only defines one entity, but if a schema
> defines more how are they processed? The tests need to cover the
> predefined entities &quot; &amp; &apos; &lt; and &gt; at least.
>
> I have no idea whether the current code can fetch a DTD and use any
> <!ENTITY > declarations to expand entities, but I'm guessing not? If
> not, external DTDs, and internal DTDs with external entities should be
> documented as unsupported.
>
> It doesn't seem to cope with internal DTDs at all (libxml2 limitation?):
>
> SELECT * FROM xmltable('/' PASSING $XML$<?xml version="1.0"
> standalone="yes" ?>
> <!DOCTYPE foo [
>   <!ELEMENT foo (#PCDATA)>
>   <!ENTITY pg "PostgreSQL">
> ]>
> <foo>Hello &pg;.</foo>
> $XML$ COLUMNS foo text);
>
> + ERROR:  invalid XML content
> + LINE 1: SELECT * FROM xmltable('/' PASSING $XML$<?xml version="1.0" ...
> +                                            ^
> + DETAIL:  line 2: StartTag: invalid element name
> + <!DOCTYPE foo [
> +  ^
> + line 3: StartTag: invalid element name
> +   <!ELEMENT foo (#PCDATA)>
> +    ^
> + line 4: StartTag: invalid element name
> +   <!ENTITY pg "PostgreSQL">
> +    ^
> + line 6: Entity 'pg' not defined
> + <foo>Hello &pg;.</foo>
> +                ^
>
>
> libxml seems to support documents with internal DTDs:
>
> $ xmllint --valid /tmp/x
> <?xml version="1.0" standalone="yes"?>
> <!DOCTYPE foo [
> <!ELEMENT foo (#PCDATA)>
> <!ENTITY pg "PostgreSQL">
> ]>
> <foo>Hello &pg;.</foo>
>
>
> so presumably the issue lies in the xpath stuff? Note that it's not
> even ignoring the DTD and choking on the undefined entity, it's
> choking on the DTD its self.
>
>
> OK, code comments:
>
>
> In +ExecEvalTableExpr, shouldn't you be using PG_ENSURE_ERROR_CLEANUP
> instead of a PG_TRY() / PG_CATCH() block?
>
>
> I think the new way you handle the type stuff is much, much better,
> and with comments to explain too. Thanks very much.
>
>
> There's an oversight in tableexpr vs xmltable separation here:
>
> +        case T_TableExpr:
> +            *name = "xmltable";
> +            return 2;
>
> presumably you need to look at the node and decide what kind of table
> expression it is or just use a generic "tableexpr".
>
> Same problem here:
>
> +        case T_TableExpr:
> +            {
> +                TableExpr  *te = (TableExpr *) node;
> +
> +                /* c_expr shoud be closed in brackets */
> +                appendStringInfoString(buf, "XMLTABLE(");
>
>
This is correct, but not well commented - looks on XMLEXPR node - TableExpr
is a holder, but it is invisible for user. User running a XMLTABLE function
and should to see XMLTABLE. It will be more clean when we will support
JSON_TABLE function.



>
>
> I don't have the libxml knowledge or remaining brain to usefully
> evaluate the xpath and xml specifics in xpath.c today. It does strike
> me that the new xpath parser should probably live in its own file,
> though.
>

I'll try move it to separate file


>
> I think this is all a big improvement. Barring the notes above and my
> lack of review of the guts of the xml.c parts of it, I'm pretty happy
> with what I see now.
>

Thank you. I hope so all major issues are solved. Probably some XML
specific related issues are there - but I am happy, so you have well XML
knowledge and you will test a corner cases.

Regards

Pavel



>
>
> --
>  Craig Ringer                   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services
>

Re: [HACKERS] patch: function xmltable

Reply via email to