On Thu, Apr 16, 2009 at 10:54:16AM -0400, Tom Lane wrote: > Sam Mason <s...@samason.me.uk> writes: > > I'd never heard of UTF-16 surrogate pairs before this discussion and > > hence didn't realise that it's valid to have a surrogate pair in place > > of a single code point. The docs say that <D800 DF02> corresponds to > > U+10302, Python would appear to follow my intuitions in that: > > > ord(u'\uD800\uDF02') > > > results in an error instead of giving back 66306, as I'd expect. Is > > this a bug in Python, my understanding, or something else? > > I might be wrong, but I think surrogate pairs are expressly forbidden in > all representations other than UTF16/UCS2. We definitely forbid them > when validating UTF-8 strings --- that's per an RFC recommendation. > It sounds like Python is doing the same.
OK, that's good. I thought I was missing something. A minor point is that in UCS2 each 16bit value is exactly one character and characters outside the BMP aren't supported, hence the need for UTF-16. I've failed to keep up with the discussion so I'm not sure where this conversation has got to! Is the consensus for 8.4 to enable SQL2003 style U&lit escaped literals if and only if standard_conforming_strings is set? This seems easiest for client code as it can use this exclusively for knowing what to do with backslashes. -- Sam http://samason.me.uk/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers