Re: [HACKERS] possible encoding issues with libxml2 functions

Noah Misch Sun, 20 Aug 2017 00:22:33 -0700

On Sun, Aug 20, 2017 at 08:46:03AM +0200, Pavel Stehule wrote:
> 2017-08-20 4:17 GMT+02:00 Noah Misch <n...@leadboat.com>:
> > On Sat, Aug 19, 2017 at 10:53:19PM +0200, Pavel Stehule wrote:
> > > I am sending some POC  - it does support XPATH and XMLTABLE for not UTF8
> > > server encoding.
> > >
> > > In this case, all strings should be converted to UTF8 before call libXML2
> > > functions, and result should be converted back from UTF8.
> >
> > Adding support for xpath in non-UTF8 databases is a v11 feature proposal.
> > Please start a new thread for this, and add it to the open CommitFest.
> >
> > In this thread, would you provide the version of your patch that I
> > described
> > in my 2017-08-08 post to this thread?  That's a back-patchable bug fix.
> 
> 
> There are three issues:
> 
> 1. processing 1byte encoding XMLs documents with encoding declaration -
> should be fixed by ecoding_for_xmlCtxtReadMemory.patch. This patch is very
> short and safe - can be apply immediately (there is regress tests)

We should fix that problem, yes.  encoding_for_xmlCtxtReadMemory.patch is not
the right fix; see below.

> 2 encoding issues in XPath specification (and  namespaces) - because
> multibytes chars are not usually used in tag names, this issue hit minimum
> users.
> 
> 3. encoding issues in XPath and XMLTABLE results - this is bad issue - the
> function XMLTABLE will not be functional on non UTF8 databases. Fortunately
> - there are less users with this encoding, but probably should be apply as
> fix in 10/11 Postgres.

(2) and (3) are long-documented (since commit 14180f9, 2009-06) limitations in
xpath functions.  That's why I would treat an improvement as a new feature and
not back-patch it.  It is tempting to fix v10 so XMLTABLE is born without this
limitation, but it is too late in the release cycle.

Reviewing encoding_for_xmlCtxtReadMemory.patch:

On Sat, Aug 19, 2017 at 09:13:50AM +0200, Pavel Stehule wrote:
> -             doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL, NULL, 
> 0);
> +             doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL,
> +                                                             
> pg_encoding_to_char(GetDatabaseEncoding()), 0);

This assumes libxml2 understands every encoding name that
pg_encoding_to_char() can return, but it does not.  xpath-bugfix.patch was
better.  Here are the relevant parts of my review of that patch.

On Mon, Aug 07, 2017 at 05:10:14PM -0700, Noah Misch wrote:
> On Wed, Apr 05, 2017 at 10:53:39PM +0200, Pavel Stehule wrote:
> > One possible fix - and similar technique is used more times in xml.c code
> > .. xmlroot
> 
> > +   /* remove header */
> > +   parse_xml_decl(string, &header_len, NULL, NULL, NULL);
> 
> > -       doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL, NULL, 0);
> > +       doc = xmlCtxtReadMemory(ctxt, (char *) string + header_len, len -

> I like this parse_xml_decl() approach better.  Would you update
> your patch to use it and to add the test case I give above?

Again, would you make those two edits?

Thanks,
nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] possible encoding issues with libxml2 functions

Reply via email to