ฤt 19. 1. 2023 v 1:20 odesรญlatel Bruce Momjian <br...@momjian.us> napsal:
> Just my luck, I had to dig into a two-"character" emoji that came to me > as part of a Google Calendar entry --- here it is: > > ๐ฉ๐ผโโ๏ธ๐ฉบ > > libc > Unicode UTF8 len > U+1F469 f0 9f 91 a9 2 woman > U+1F3FC f0 9f 8f bc 2 emoji modifier fitzpatrick type-3 (skin > tone) > U+200D e2 80 8d 0 zero width joiner (ZWJ) > U+2695 e2 9a 95 1 staff with snake > U+FE0F ef b8 8f 0 variation selector-16 (VS16) (previous > character as emoji) > U+1FA7A f0 9f a9 ba 2 stethoscope > > Now, in Debian 11 character apps like vi, I see: > > a woman(2) - a black box(2) - a staff with snake(1) - a stethoscope(2) > > Display widths are in parentheses. I also see '<200d>' in blue. > > In current Firefox, I see a woman with a stethoscope around her neck, > and then a stethoscope. Copying the Unicode string above into a browser > URL bar should show you the same thing, thought it might be too small to > see. > > For those looking for details on how these should be handled, see this > for an explanation of grapheme clusters that use things like skin tone > modifiers and zero-width joiners: > > https://tonsky.me/blog/emoji/ > > These comments explain the confusion of the term character: > > > https://stackoverflow.com/questions/27331819/whats-the-difference-between-a-character-a-code-point-a-glyph-and-a-grapheme > > and I think this comment summarizes it well: > > > https://github.com/kovidgoyal/kitty/issues/3998#issuecomment-914807237 > > This is by design. wcwidth() is utterly broken. Any terminal or > terminal > application that uses it is also utterly broken. Forget about emoji > wcwidth() doesn't even work with combining characters, zero width > joiners, flags, and a whole bunch of other things. > > I decided to see how Postgres, without ICU, handles it: > > show lc_ctype; > lc_ctype > ------------- > en_US.UTF-8 > > select octet_length('๐ฉ๐ผโโ๏ธ๐ฉบ'); > octet_length > -------------- > 21 > > select character_length('๐ฉ๐ผโโ๏ธ๐ฉบ'); > character_length > ------------------ > 6 > > The octet_length() is verified as correct by counting the UTF8 bytes > above. I think character_length() is correct if we consider the number > of Unicode characters, display and non-display. > > I then started looking at how Postgres computes and uses _display_ > width. The display width, when properly processed like by Firefox, is 4 > (two double-wide displayed characters.) Based on the libc display > lengths above and incorrect displayed character lengths in Debian 11, it > would be 7. > > libpq has PQdsplen(), which calls pg_encoding_dsplen(), which then calls > the per-encoding width function stored in pg_wchar_table.dsplen --- for > UTF8, the function is pg_utf_dsplen(). > > There is no SQL API for display length, but PQdsplen() that can be > called with a string by calling pg_wcswidth() the gdb debugger: > > pg_wcswidth(const char *pwcs, size_t len, int encoding) > UTF8 encoding == 6 > > (gdb) print (int)pg_wcswidth("abcd", 4, 6) > $8 = 4 > (gdb) print (int)pg_wcswidth("๐ฉ๐ผโโ๏ธ๐ฉบ", 21, 6)) > $9 = 7 > > Here is the psql output: > > SELECT octet_length('๐ฉ๐ผโโ๏ธ๐ฉบ'), '๐ฉ๐ผโโ๏ธ๐ฉบ', > character_length('๐ฉ๐ผโโ๏ธ๐ฉบ'); > octet_length | ?column? | character_length > --------------+----------+------------------ > 21 | ๐ฉ๐ผโโ๏ธ๐ฉบ | 6 > > More often called from psql are pg_wcssize() and pg_wcsformat(), which > also calls PQdsplen(). > > I think the question is whether we want to report a string width that > assumes the display doesn't understand the more complex UTF8 > controls/"characters" listed above. > > tsearch has p_isspecial() calls pg_dsplen() which also uses > pg_wchar_table.dsplen. p_isspecial() also has a small table of what it > calls "strange_letter", > > Here is a report about Unicode variation selector and combining > characters from May, 2022: > > > https://www.postgresql.org/message-id/flat/013f01d873bb%24ff5f64b0%24fe1e2e10%24%40ndensan.co.jp > > Is this something people want improved? > Surely it should be fixed. Unfortunately - all the terminals that I can use don't support it. So at this moment it may be premature to fix it, because the visual form will still be broken. Regards Pavel > -- > Bruce Momjian <br...@momjian.us> https://momjian.us > EDB https://enterprisedb.com > > Embrace your flaws. They make you human, rather than perfect, > which you will never be. > > >