Re: [HACKERS] patch: function xmltable

Craig Ringer Fri, 23 Sep 2016 01:06:07 -0700

On 22 September 2016 at 02:31, Pavel Stehule <[email protected]> wrote:


> another small update - fix XMLPath parser - support multibytes characters

I'm returning for another round of review.

The code doesn't handle the 5 XML built-in entities correctly in
text-typed output. It processes &apos; and &quot; but not &amp, &lt or
&gt; . See added test. I have not fixed this, but I think it's clearly
broken:


+ -- XML builtin entities
+ SELECT * FROM xmltable('/x/a' PASSING
'<x><a><ent>&apos;</ent></a><a><ent>&quot;</ent></a><a><ent>&amp;</ent></a><a><ent>&lt;</ent></a><a><ent>&gt;</ent></a></x>'
COLUMNS ent text);
+   ent
+ -------
+  '
+  "
+  &amp;
+  &lt;
+  &gt;
+ (5 rows)

so I've adjusted the docs to claim that they're expanded. The code
needs fixing to avoid entity-escaping when the output column type is
not 'xml'.


&apos; and &quot; entities in xml-typed output are expanded, not
preserved. I don't know if this is intended but I suspect it is:

+ SELECT * FROM xmltable('/x/a' PASSING
'<x><a><ent>&apos;</ent></a><a><ent>&quot;</ent></a><a><ent>&amp;</ent></a><a><ent>&lt;</ent></a><a><ent>&gt;</ent></a></x>'
COLUMNS ent xml);
+        ent
+ ------------------
+  <ent>'</ent>
+  <ent>"</ent>
+  <ent>&amp;</ent>
+  <ent>&lt;</ent>
+  <ent>&gt;</ent>
+ (5 rows)


For the docs changes relevant to the above search for "The five
predefined XML entities". Adjust that bit of docs if I guessed wrong
about the intended behaviour.

The tests don't cover CDATA or PCDATA . I didn't try to add that, but
they should.


Did some docs copy-editing and integrated some examples. Explained how
nested elements work, that multiple top level elements is an error,
etc. Explained the time-of-evaluation stuff. Pointed out that you can
refer to prior output columns in PATH and DEFAULT, since that's weird
and unusual compared to normal SQL. Documented handling of multiple
node matches, including the surprising results of somepath/text() on
<somepath>x<!--blah-->y</somepath>. Documented handling of nested
elements. Documented that xmltable works only on XML documents, not
fragments/forests.

Regarding evaluation time, it struck me that evaluating path
expressions once per row means the xpath must be parsed and processed
once per row. Isn't it desirable to store and re-use the preparsed
xpath? I don't think this is a major problem, since we can later
detect stable/immutable expressions including constants, evaluate only
once in that case, and cache. It's just worth thinking about.

The docs and tests don't seem to cover XML entities. What's the
behaviour there? Core XML only defines one entity, but if a schema
defines more how are they processed? The tests need to cover the
predefined entities &quot; &amp; &apos; &lt; and &gt; at least.

I have no idea whether the current code can fetch a DTD and use any
<!ENTITY > declarations to expand entities, but I'm guessing not? If
not, external DTDs, and internal DTDs with external entities should be
documented as unsupported.

It doesn't seem to cope with internal DTDs at all (libxml2 limitation?):

SELECT * FROM xmltable('/' PASSING $XML$<?xml version="1.0" standalone="yes" ?>
<!DOCTYPE foo [
  <!ELEMENT foo (#PCDATA)>
  <!ENTITY pg "PostgreSQL">
]>
<foo>Hello &pg;.</foo>
$XML$ COLUMNS foo text);

+ ERROR:  invalid XML content
+ LINE 1: SELECT * FROM xmltable('/' PASSING $XML$<?xml version="1.0" ...
+                                            ^
+ DETAIL:  line 2: StartTag: invalid element name
+ <!DOCTYPE foo [
+  ^
+ line 3: StartTag: invalid element name
+   <!ELEMENT foo (#PCDATA)>
+    ^
+ line 4: StartTag: invalid element name
+   <!ENTITY pg "PostgreSQL">
+    ^
+ line 6: Entity 'pg' not defined
+ <foo>Hello &pg;.</foo>
+                ^


libxml seems to support documents with internal DTDs:

$ xmllint --valid /tmp/x
<?xml version="1.0" standalone="yes"?>
<!DOCTYPE foo [
<!ELEMENT foo (#PCDATA)>
<!ENTITY pg "PostgreSQL">
]>
<foo>Hello &pg;.</foo>


so presumably the issue lies in the xpath stuff? Note that it's not
even ignoring the DTD and choking on the undefined entity, it's
choking on the DTD its self.


OK, code comments:


In +ExecEvalTableExpr, shouldn't you be using PG_ENSURE_ERROR_CLEANUP
instead of a PG_TRY() / PG_CATCH() block?


I think the new way you handle the type stuff is much, much better,
and with comments to explain too. Thanks very much.


There's an oversight in tableexpr vs xmltable separation here:

+        case T_TableExpr:
+            *name = "xmltable";
+            return 2;

presumably you need to look at the node and decide what kind of table
expression it is or just use a generic "tableexpr".

Same problem here:

+        case T_TableExpr:
+            {
+                TableExpr  *te = (TableExpr *) node;
+
+                /* c_expr shoud be closed in brackets */
+                appendStringInfoString(buf, "XMLTABLE(");



I don't have the libxml knowledge or remaining brain to usefully
evaluate the xpath and xml specifics in xpath.c today. It does strike
me that the new xpath parser should probably live in its own file,
though.

I think this is all a big improvement. Barring the notes above and my
lack of review of the guts of the xml.c parts of it, I'm pretty happy
with what I see now.


-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] patch: function xmltable

Reply via email to