On Fri, Jul 15, 2011 at 3:56 PM, Joey Adams <joeyadams3.14...@gmail.com> wrote: > On Mon, Jul 4, 2011 at 10:22 PM, Joseph Adams > <joeyadams3.14...@gmail.com> wrote: >> I'll try to submit a revised patch within the next couple days. > > Sorry this is later than I said. > > I addressed the issues covered in the review. I also fixed a bug > where "\u0022" would become """, which is invalid JSON, causing an > assertion failure. > > However, I want to put this back into WIP for a number of reasons: > > * The current code accepts invalid surrogate pairs (e.g. > "\uD800\uD800"). The problem with accepting them is that it would be > inconsistent with PostgreSQL's Unicode support, and with the Unicode > standard itself. In my opinion: as long as the server encoding is > universal (i.e. UTF-8), decoding a JSON-encoded string should not fail > (barring data corruption and resource limitations). > > * I'd like to go ahead with the parser rewrite I mentioned earlier. > The new parser will be able to construct a parse tree when needed, and > it won't use those overkill parsing macros. > > * I recently learned that not all supported server encodings can be > converted to Unicode losslessly. The current code, on output, > converts non-ASCII characters to Unicode escapes under some > circumstances (see the comment above json_need_to_escape_unicode). > > I'm having a really hard time figuring out how the JSON module should > handle non-Unicode character sets. \uXXXX escapes in JSON literals > can be used to encode characters not available in the server encoding. > On the other hand, the server encoding can encode characters not > present in Unicode (see the third bullet point above). This means > JSON normalization and comparison (along with member lookup) are not > possible in general.
I previously suggested that, instead of trying to implement JSON, you should just try to implement JSON-without-the-restriction-that-everything-must-be-UTF8. Most people are going to be using UTF-8 simply because it's the default, and if you forget about transcoding then ISTM that this all becomes a lot simpler. We don't, in general, have the ability to support data in multiple encodings inside PostgreSQL, and it seems to me that by trying to invent a mechanism for making that work as part of this patch, you are setting the bar for yourself awfully high. One thing to think about here is that transcoding between UTF-8 and the server encoding seems like the wrong thing all around. After all, the user does not want the data in the server encoding; they want it in their chosen client encoding. If you are transcoding between UTF-8 and the server encoding, then that suggests that there's some double-transcoding going on here, which creates additional opportunities for (1) inefficiency and (2) outright failure. I'm guessing that's because you're dealing with an interface that expects the internal representation of the datum on one side and the server encoding on the other side, which gets back to the point in the preceding paragraph. You'd probably need to revise that interface in order to make this really work the way it should, and that might be more than you want to get into. At any rate, it probably is a separate project from making JSON work. If in spite of the above you're bent on continuing down your present course, then it seems to me that you'd better make the on-disk representation UTF-8, with all \uXXXX escapes converted to the corresponding characters. If you hit an invalid surrogate pair, or a character that exists in the server encoding but not UTF-8, it's not a legal JSON object and you throw an error on input, just as you would for mismatched braces or similar. On output, you should probably just use \uXXXX to represent any unrepresentable characters - i.e. option 3 from your original list. That may be slow, but I think that it's not really worth devoting a lot of mental energy to this case. Most people are going to be using UTF-8 because that's the default, and those who are not shouldn't expect a data format built around UTF-8 to work perfectly in their environment, especially if they insist on using characters that are representable in only some of the encodings they are using. But, again, why not just forget about transcoding and define it as "JSON, if you happen to be using utf-8 as the server encoding, and otherwise some variant of JSON that uses the server encoding as its native format?". It seems to me that that would be a heck of a lot simpler and more reliable, and I'm not sure it's any less useful in practice. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers