On 22 September 2016 at 02:31, Pavel Stehule <pavel.steh...@gmail.com> wrote:
> another small update - fix XMLPath parser - support multibytes characters I'm returning for another round of review. The code doesn't handle the 5 XML built-in entities correctly in text-typed output. It processes ' and " but not &, < or > . See added test. I have not fixed this, but I think it's clearly broken: + -- XML builtin entities + SELECT * FROM xmltable('/x/a' PASSING '<x><a><ent>'</ent></a><a><ent>"</ent></a><a><ent>&</ent></a><a><ent><</ent></a><a><ent>></ent></a></x>' COLUMNS ent text); + ent + ------- + ' + " + & + < + > + (5 rows) so I've adjusted the docs to claim that they're expanded. The code needs fixing to avoid entity-escaping when the output column type is not 'xml'. ' and " entities in xml-typed output are expanded, not preserved. I don't know if this is intended but I suspect it is: + SELECT * FROM xmltable('/x/a' PASSING '<x><a><ent>'</ent></a><a><ent>"</ent></a><a><ent>&</ent></a><a><ent><</ent></a><a><ent>></ent></a></x>' COLUMNS ent xml); + ent + ------------------ + <ent>'</ent> + <ent>"</ent> + <ent>&</ent> + <ent><</ent> + <ent>></ent> + (5 rows) For the docs changes relevant to the above search for "The five predefined XML entities". Adjust that bit of docs if I guessed wrong about the intended behaviour. The tests don't cover CDATA or PCDATA . I didn't try to add that, but they should. Did some docs copy-editing and integrated some examples. Explained how nested elements work, that multiple top level elements is an error, etc. Explained the time-of-evaluation stuff. Pointed out that you can refer to prior output columns in PATH and DEFAULT, since that's weird and unusual compared to normal SQL. Documented handling of multiple node matches, including the surprising results of somepath/text() on <somepath>x<!--blah-->y</somepath>. Documented handling of nested elements. Documented that xmltable works only on XML documents, not fragments/forests. Regarding evaluation time, it struck me that evaluating path expressions once per row means the xpath must be parsed and processed once per row. Isn't it desirable to store and re-use the preparsed xpath? I don't think this is a major problem, since we can later detect stable/immutable expressions including constants, evaluate only once in that case, and cache. It's just worth thinking about. The docs and tests don't seem to cover XML entities. What's the behaviour there? Core XML only defines one entity, but if a schema defines more how are they processed? The tests need to cover the predefined entities " & ' < and > at least. I have no idea whether the current code can fetch a DTD and use any <!ENTITY > declarations to expand entities, but I'm guessing not? If not, external DTDs, and internal DTDs with external entities should be documented as unsupported. It doesn't seem to cope with internal DTDs at all (libxml2 limitation?): SELECT * FROM xmltable('/' PASSING $XML$<?xml version="1.0" standalone="yes" ?> <!DOCTYPE foo [ <!ELEMENT foo (#PCDATA)> <!ENTITY pg "PostgreSQL"> ]> <foo>Hello &pg;.</foo> $XML$ COLUMNS foo text); + ERROR: invalid XML content + LINE 1: SELECT * FROM xmltable('/' PASSING $XML$<?xml version="1.0" ... + ^ + DETAIL: line 2: StartTag: invalid element name + <!DOCTYPE foo [ + ^ + line 3: StartTag: invalid element name + <!ELEMENT foo (#PCDATA)> + ^ + line 4: StartTag: invalid element name + <!ENTITY pg "PostgreSQL"> + ^ + line 6: Entity 'pg' not defined + <foo>Hello &pg;.</foo> + ^ libxml seems to support documents with internal DTDs: $ xmllint --valid /tmp/x <?xml version="1.0" standalone="yes"?> <!DOCTYPE foo [ <!ELEMENT foo (#PCDATA)> <!ENTITY pg "PostgreSQL"> ]> <foo>Hello &pg;.</foo> so presumably the issue lies in the xpath stuff? Note that it's not even ignoring the DTD and choking on the undefined entity, it's choking on the DTD its self. OK, code comments: In +ExecEvalTableExpr, shouldn't you be using PG_ENSURE_ERROR_CLEANUP instead of a PG_TRY() / PG_CATCH() block? I think the new way you handle the type stuff is much, much better, and with comments to explain too. Thanks very much. There's an oversight in tableexpr vs xmltable separation here: + case T_TableExpr: + *name = "xmltable"; + return 2; presumably you need to look at the node and decide what kind of table expression it is or just use a generic "tableexpr". Same problem here: + case T_TableExpr: + { + TableExpr *te = (TableExpr *) node; + + /* c_expr shoud be closed in brackets */ + appendStringInfoString(buf, "XMLTABLE("); I don't have the libxml knowledge or remaining brain to usefully evaluate the xpath and xml specifics in xpath.c today. It does strike me that the new xpath parser should probably live in its own file, though. I think this is all a big improvement. Barring the notes above and my lack of review of the guts of the xml.c parts of it, I'm pretty happy with what I see now. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers