Hi The current thread about JSON and the ensuing discussion about the XML types' behaviour in non-UTF8 databases made me try out how well XPATH() copes with that situation. The code, at least, looks suspicious - XPATH neither verifies that the server encoding is UTF-8, not does it pass the server encoding on to libxml's xpath functions.
So I created a database with encoding ISO-8859-1 (LATIN1), and did (which aclient encoding matching my terminal's settings) CREATE TABLE X (d XML); INSERT INTO X VALUES ('<r a="ä"/>'); i.e, I inserted the XML document <r a="ä"/>, but without using an entity reference for the german Umlaut-A. Then I attempted to extract the length of r's attribute "a" with the XPATH /r/@a, both with the XPath function string-length (which works now! yay!) and with postgres' LENGTH() function. SELECT (XPATH('string-length(/r/@a)', d))[1] AS xpath_length, LENGTH((XPATH('/r/@a', d))[1]::text) AS pg_length FROM X; The XPATH() function itself doesn't complain, but libxml does - it expects UTF-8 encoded data, and screams bloody murder when it encounters the ISO-8859-1-encoded Umlaut-A ERROR: could not parse XML document DETAIL: line 1: Input is not proper UTF-8, indicate encoding ! Bytes: 0xE4 0x22 0x2F 0x3E <r a="ä"/> That might seem fine on the surface - we did, after all, error out instead of producing potentially non-sensical results. However, libxml's ability to detect this error relies on it's ability to distinguish between UTF-8 and non-UTF-8 encoded strings. Which, of course, doesn't work in the general case. So for my next try, I deliberately set client_encoding to ISO-8859-1, even though my terminal uses UTF-8, removed all data from table X, and did INSERT INTO X VALUES ('<r a="ä"/>'); again. The effect is that is that X now contains ISO-8859-1 encoded data which *happens* to look like valid UTF-8. After changing the client_encoding back to UTF-8, the value we just inserted looks like that <r a="ä"/> Now I invoked the XPATH query from above again. SELECT (XPATH('string-length(/r/@a)', d))[1] AS xpath_length, LENGTH((XPATH('/r/@a', d))[1]::text) AS pg_length FROM X; As predicted, it doesn't raise an error this time, since libxml is unable to distinguish the ISO-8859-1 string '<r a="ä"/' from valid UTF-8. But the result is still wrongs, since the string-length() function counts 'ä' as just one character, when it reality it are of course contains two. xpath_length | pg_length --------------+----------- 1 | 2 The easiest way to fix this would be to make XPATH() flat-out refuse to do anything if the server encoding isn't UTF-8. But that seems a bit harsh - things actually do work correctly as long as the XML document contains only ASCII characters, and existing applications might depend on that. So what I think we should do is tell libxml that the encoding is ASCII if the server encoding isn't UTF-8. With that change, the query above produces ERROR: could not parse XML document DETAIL: encoder error which seems sane. Replacing the data in X with ASCII-only data makes the error go away, and the result is then correct also. DELETE FROM X; INSERT INTO X VALUES ('<r a="a"/>'); SELECT (XPATH('string-length(/r/@a)', d))[1] AS xpath_length, LENGTH((XPATH('/r/@a', d))[1]::text) AS pg_length FROM X; gives xpath_length | pg_length --------------+----------- 1 | 1 Proof-of-concept patch attached, but doesn't yet include documentation updates. Comments? Thoughts? Suggestions? best regards, Florian Pflug
xpath_nonutf8.patch
Description: Binary data
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers