On Sun, Aug 20, 2017 at 08:46:03AM +0200, Pavel Stehule wrote: > 2017-08-20 4:17 GMT+02:00 Noah Misch <n...@leadboat.com>: > > On Sat, Aug 19, 2017 at 10:53:19PM +0200, Pavel Stehule wrote: > > > I am sending some POC - it does support XPATH and XMLTABLE for not UTF8 > > > server encoding. > > > > > > In this case, all strings should be converted to UTF8 before call libXML2 > > > functions, and result should be converted back from UTF8. > > > > Adding support for xpath in non-UTF8 databases is a v11 feature proposal. > > Please start a new thread for this, and add it to the open CommitFest. > > > > In this thread, would you provide the version of your patch that I > > described > > in my 2017-08-08 post to this thread? That's a back-patchable bug fix. > > > There are three issues: > > 1. processing 1byte encoding XMLs documents with encoding declaration - > should be fixed by ecoding_for_xmlCtxtReadMemory.patch. This patch is very > short and safe - can be apply immediately (there is regress tests)
We should fix that problem, yes. encoding_for_xmlCtxtReadMemory.patch is not the right fix; see below. > 2 encoding issues in XPath specification (and namespaces) - because > multibytes chars are not usually used in tag names, this issue hit minimum > users. > > 3. encoding issues in XPath and XMLTABLE results - this is bad issue - the > function XMLTABLE will not be functional on non UTF8 databases. Fortunately > - there are less users with this encoding, but probably should be apply as > fix in 10/11 Postgres. (2) and (3) are long-documented (since commit 14180f9, 2009-06) limitations in xpath functions. That's why I would treat an improvement as a new feature and not back-patch it. It is tempting to fix v10 so XMLTABLE is born without this limitation, but it is too late in the release cycle. Reviewing encoding_for_xmlCtxtReadMemory.patch: On Sat, Aug 19, 2017 at 09:13:50AM +0200, Pavel Stehule wrote: > - doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL, NULL, > 0); > + doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL, > + > pg_encoding_to_char(GetDatabaseEncoding()), 0); This assumes libxml2 understands every encoding name that pg_encoding_to_char() can return, but it does not. xpath-bugfix.patch was better. Here are the relevant parts of my review of that patch. On Mon, Aug 07, 2017 at 05:10:14PM -0700, Noah Misch wrote: > On Wed, Apr 05, 2017 at 10:53:39PM +0200, Pavel Stehule wrote: > > One possible fix - and similar technique is used more times in xml.c code > > .. xmlroot > > > + /* remove header */ > > + parse_xml_decl(string, &header_len, NULL, NULL, NULL); > > > - doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL, NULL, 0); > > + doc = xmlCtxtReadMemory(ctxt, (char *) string + header_len, len - > I like this parse_xml_decl() approach better. Would you update > your patch to use it and to add the test case I give above? Again, would you make those two edits? Thanks, nm -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers