2016-09-23 10:05 GMT+02:00 Craig Ringer <cr...@2ndquadrant.com>: > On 22 September 2016 at 02:31, Pavel Stehule <pavel.steh...@gmail.com> > wrote: > > > another small update - fix XMLPath parser - support multibytes characters > > I'm returning for another round of review. > > The code doesn't handle the 5 XML built-in entities correctly in > text-typed output. It processes ' and " but not &, < or > > . See added test. I have not fixed this, but I think it's clearly > broken: > > > + -- XML builtin entities > + SELECT * FROM xmltable('/x/a' PASSING > '<x><a><ent>'</ent></a><a><ent>"</ent></a><a>< > ent>&</ent></a><a><ent><</ent></a><a><ent>></ent></a></x>' > COLUMNS ent text); > + ent > + ------- > + ' > + " > + & > + < > + > > + (5 rows) > > so I've adjusted the docs to claim that they're expanded. The code > needs fixing to avoid entity-escaping when the output column type is > not 'xml'. > > > ' and " entities in xml-typed output are expanded, not > preserved. I don't know if this is intended but I suspect it is: > > + SELECT * FROM xmltable('/x/a' PASSING > '<x><a><ent>'</ent></a><a><ent>"</ent></a><a>< > ent>&</ent></a><a><ent><</ent></a><a><ent>></ent></a></x>' > COLUMNS ent xml); > + ent > + ------------------ > + <ent>'</ent> > + <ent>"</ent> > + <ent>&</ent> > + <ent><</ent> > + <ent>></ent> > + (5 rows) > > > For the docs changes relevant to the above search for "The five > predefined XML entities". Adjust that bit of docs if I guessed wrong > about the intended behaviour. > > The tests don't cover CDATA or PCDATA . I didn't try to add that, but > they should. > > > Did some docs copy-editing and integrated some examples. Explained how > nested elements work, that multiple top level elements is an error, > etc. Explained the time-of-evaluation stuff. Pointed out that you can > refer to prior output columns in PATH and DEFAULT, since that's weird > and unusual compared to normal SQL. Documented handling of multiple > node matches, including the surprising results of somepath/text() on > <somepath>x<!--blah-->y</somepath>. Documented handling of nested > elements. Documented that xmltable works only on XML documents, not > fragments/forests. > > Regarding evaluation time, it struck me that evaluating path > expressions once per row means the xpath must be parsed and processed > once per row. Isn't it desirable to store and re-use the preparsed > xpath? I don't think this is a major problem, since we can later > detect stable/immutable expressions including constants, evaluate only > once in that case, and cache. It's just worth thinking about. > > The docs and tests don't seem to cover XML entities. What's the > behaviour there? Core XML only defines one entity, but if a schema > defines more how are they processed? The tests need to cover the > predefined entities " & ' < and > at least. > > I have no idea whether the current code can fetch a DTD and use any > <!ENTITY > declarations to expand entities, but I'm guessing not? If > not, external DTDs, and internal DTDs with external entities should be > documented as unsupported. > > It doesn't seem to cope with internal DTDs at all (libxml2 limitation?): > > SELECT * FROM xmltable('/' PASSING $XML$<?xml version="1.0" > standalone="yes" ?> > <!DOCTYPE foo [ > <!ELEMENT foo (#PCDATA)> > <!ENTITY pg "PostgreSQL"> > ]> > <foo>Hello &pg;.</foo> > $XML$ COLUMNS foo text); > > + ERROR: invalid XML content > + LINE 1: SELECT * FROM xmltable('/' PASSING $XML$<?xml version="1.0" ... > + ^ > + DETAIL: line 2: StartTag: invalid element name > + <!DOCTYPE foo [ > + ^ > + line 3: StartTag: invalid element name > + <!ELEMENT foo (#PCDATA)> > + ^ > + line 4: StartTag: invalid element name > + <!ENTITY pg "PostgreSQL"> > + ^ > + line 6: Entity 'pg' not defined > + <foo>Hello &pg;.</foo> > + ^ > > > libxml seems to support documents with internal DTDs: > > $ xmllint --valid /tmp/x > <?xml version="1.0" standalone="yes"?> > <!DOCTYPE foo [ > <!ELEMENT foo (#PCDATA)> > <!ENTITY pg "PostgreSQL"> > ]> > <foo>Hello &pg;.</foo> > > > so presumably the issue lies in the xpath stuff? Note that it's not > even ignoring the DTD and choking on the undefined entity, it's > choking on the DTD its self. > > > OK, code comments: > > > In +ExecEvalTableExpr, shouldn't you be using PG_ENSURE_ERROR_CLEANUP > instead of a PG_TRY() / PG_CATCH() block? > > > I think the new way you handle the type stuff is much, much better, > and with comments to explain too. Thanks very much. > > > There's an oversight in tableexpr vs xmltable separation here: > > + case T_TableExpr: > + *name = "xmltable"; > + return 2; > > presumably you need to look at the node and decide what kind of table > expression it is or just use a generic "tableexpr". > > Same problem here: > > + case T_TableExpr: > + { > + TableExpr *te = (TableExpr *) node; > + > + /* c_expr shoud be closed in brackets */ > + appendStringInfoString(buf, "XMLTABLE("); > > This is correct, but not well commented - looks on XMLEXPR node - TableExpr is a holder, but it is invisible for user. User running a XMLTABLE function and should to see XMLTABLE. It will be more clean when we will support JSON_TABLE function.
> > > I don't have the libxml knowledge or remaining brain to usefully > evaluate the xpath and xml specifics in xpath.c today. It does strike > me that the new xpath parser should probably live in its own file, > though. > I'll try move it to separate file > > I think this is all a big improvement. Barring the notes above and my > lack of review of the guts of the xml.c parts of it, I'm pretty happy > with what I see now. > Thank you. I hope so all major issues are solved. Probably some XML specific related issues are there - but I am happy, so you have well XML knowledge and you will test a corner cases. Regards Pavel > > > -- > Craig Ringer http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services >