I figured out that I need to use the function CHR to enter supplementary unicode characters (code points > FFFF, i.e. planes 1 - F), e.g.
insert into unicode(id, string) values(100, CHR(128120)); -- a mojo character, https://unicodelookup.com/#0x1f478/1 insert into unicode(id, string) values(101, CHR(128121)); -- another mojo insert into unicode(id, string) values(102, CHR(119071)); -- musical symbol g clef ottava alta insert into unicode(id, string) values(103, CHR(155648)); -- a very infrequently used Chinese character insert into unicode(id, string) values(104, CHR(155649)); -- another very infrequently used Chinese character the parameters are decimal representation of the code point values, e.g. 128120 is the decimal value of 1f478 The format U&'\03B1' only works for chars between 0000 - FFFF When entered with CHR(), PostgreSQL gets their char_length() correctly, so does substring() function. Thank you all for help. James On Wed, Dec 21, 2016 at 8:31 AM, Tom Lane <t...@sss.pgh.pa.us> wrote: > Vick Khera <vi...@khera.org> writes: > > On Wed, Dec 21, 2016 at 2:56 AM, Kyotaro HORIGUCHI < > > horiguchi.kyot...@lab.ntt.co.jp> wrote: > >> A PostgreSQL database with encoding=UTF8 just accepts the whole > >> range of Unicode, regardless that a character is defined for the > >> code or not. > > > Interesting... when I converted my application and database to utf8 > > encoding, I discovered that Postgres is picky about UTF-8. Specifically > the > > UTF-8 code point 0xed 0xa0 0x8d which maps to UNICODE code point 0xd80d. > > This looks like a proper character but in fact is not a defined character > > code point. > > Well, we're picky to the extent that RFC 3629 tells us to be picky: > http://www.faqs.org/rfcs/rfc3629.html > > The case you mention is rejected because it would be half of a UTF16 > "surrogate pair", which should not be used in any Unicode representation > other than UTF16; if we allowed it then there would be more than one way > to represent the same Unicode code point, which is undesirable for a lot > of reasons. > > > So I think when you present an actual string of UTF8 encoded characters, > > Postgres does refuse characters unknown. However, as you observe, > inserting > > the unicode code point directly does not produce an error: > > > insert into unicode(id, string) values(1, U&'\d80d'); > > INSERT 0 1 > > Hm. I think that's a bug. The lexer does know that \d80d is half of a > surrogate pair, and it expects the second half to come next. If you > supply something that isn't the second half of a surrogate pair, you > get an error as expected: > > u8=# insert into unicode(id, string) values(1, U&'\d80dfoo'); > ERROR: invalid Unicode surrogate pair at or near "foo'" > LINE 1: insert into unicode(id, string) values(1, U&'\d80dfoo'); > ^ > > But it looks like if you just end the string after the first half of a > surrogate, it just drops the character without complaint. Notice that > what got inserted was a zero-length string, not U+D08D: > > u8=# select *, length(string) from unicode; > id | string | length > ----+--------+-------- > 1 | | 0 > (1 row) > > I'd have expected a syntax error along the line of "incomplete Unicode > surrogate pair". Peter, I think this was your code to begin with --- > was it intentional to not raise error here, or is that an oversight? > > regards, tom lane > > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general >