On Mon, Jul 4, 2011 at 10:22 PM, Joseph Adams <joeyadams3.14...@gmail.com> wrote: > I'll try to submit a revised patch within the next couple days.
Sorry this is later than I said. I addressed the issues covered in the review. I also fixed a bug where "\u0022" would become """, which is invalid JSON, causing an assertion failure. However, I want to put this back into WIP for a number of reasons: * The current code accepts invalid surrogate pairs (e.g. "\uD800\uD800"). The problem with accepting them is that it would be inconsistent with PostgreSQL's Unicode support, and with the Unicode standard itself. In my opinion: as long as the server encoding is universal (i.e. UTF-8), decoding a JSON-encoded string should not fail (barring data corruption and resource limitations). * I'd like to go ahead with the parser rewrite I mentioned earlier. The new parser will be able to construct a parse tree when needed, and it won't use those overkill parsing macros. * I recently learned that not all supported server encodings can be converted to Unicode losslessly. The current code, on output, converts non-ASCII characters to Unicode escapes under some circumstances (see the comment above json_need_to_escape_unicode). I'm having a really hard time figuring out how the JSON module should handle non-Unicode character sets. \uXXXX escapes in JSON literals can be used to encode characters not available in the server encoding. On the other hand, the server encoding can encode characters not present in Unicode (see the third bullet point above). This means JSON normalization and comparison (along with member lookup) are not possible in general. Even if I assume server -> UTF-8 -> server transcoding is lossless, the situation is still ugly. Normalization could be implemented a few ways: * Convert from server encoding to UTF-8, and convert \uXXXX escapes to UTF-8 characters. This is space-efficient, but the resulting text would not be compatible with the server encoding (which may or may not matter). * Convert from server encoding to UTF-8, and convert all non-ASCII characters to \uXXXX escapes, resulting in pure ASCII. This bloats the text by a factor of three, in the worst case. * Convert \uXXXX escapes to characters in the server encoding, but only where possible. This would be extremely inefficient. The parse tree (for functions that need it) will need to store JSON member names and strings either in UTF-8 or in normalized JSON (which could be the same thing). Help and advice would be appreciated. Thanks! - Joey
json-contrib-rev1-20110714.patch.gz
Description: GNU Zip compressed data
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers