[Resent with pgsql-hackers re-added to the recipient list. I presume you didn't remove it on purpose]
On Jul23, 2011, at 18:11 , Joey Adams wrote: > On Sat, Jul 23, 2011 at 11:49 AM, Florian Pflug <f...@phlo.org> wrote: >> So what I think we should do is tell libxml that the encoding is ASCII >> if the server encoding isn't UTF-8. With that change, the query above >> produces > > I haven't had time to digest this situation, but there is a function > called pg_encoding_to_char for getting a string representation of the > encoding. However, it might not produce a string that libxml > understands in all cases. > > Would it be better to tell libxml the server encoding, whatever it may be? Ultimately, yes. However, I figured if it was as easy as translating our encoding names to those of libxml, the current code would probably do that instead of converting the XML to UTF-8 before validating it. (Validation and XPATH processing use a different code path there!) I'm also not aware of any actual complaints about XPATH's restriction to UTF-8, and it's not a case that I personally care for, so I'm a bit hesitant to put in the time and energy required to extend it to other encodings. But once I had stumbled over this, I didn't want to ignore it all together, so looked for simple way to make the current behaviour more bullet-proof. The patch accomplishes that, I think, and without any major change in behaviour. You only observe the difference if you indeed have non-UTF-8 XMLs which look like valid UTF-8. > In the JSON encoding discussion, the last idea (the one I was planning > to go with) was to allow non-ASCII characters in any server encoding > (like รค in ISO-8859-1), but not allow non-ASCII escapes (like \u00E4) > unless the server encoding is UTF-8. Yeah, that's how I understood your proposal, and it seems sensible. > I think your patch would more > closely match the opposite: allow any escapes, but only allow ASCII > text if the server encoding is not UTF-8. Yeah, but only for XPATH(). XML input validation uses a different code path, and seems to convert the XML to UTF-8 before verifying it's well-formedness with libxml (as you already discovered previously). The difference between JSON and XML here is that the XML types has to live with libxml's idiosyncrasies and restrictions. If we could make libxml use our encoding and text handling infrastructure, the UTF-8 restrictions would probably not exist. But as it stands, libxml has it's own machinery for dealing with encodings... I wonder, BTW, what happens if you attempt to store an XML containing a character not representable in UNICODE. If the conversion to UTF-8 simply replaces it with a placeholder, we'd be fine, since just a replacement cannot affect the well-formedness of an XML. If OTOH it raised an error, that'd be a bit unfortunate... best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers