Re: Unicode grapheme clusters

Pavel Stehule Thu, 19 Jan 2023 05:45:51 -0800

čt 19. 1. 2023 v 1:20 odesílatel Bruce Momjian <br...@momjian.us> napsal:


> Just my luck, I had to dig into a two-"character" emoji that came to me
> as part of a Google Calendar entry --- here it is:
>
>         👩🏼‍⚕️🩺
>
>                               libc
>         Unicode     UTF8      len
>         U+1F469  f0 9f 91 a9   2   woman
>         U+1F3FC  f0 9f 8f bc   2   emoji modifier fitzpatrick type-3 (skin
> tone)
>         U+200D   e2 80 8d      0   zero width joiner (ZWJ)
>         U+2695   e2 9a 95      1   staff with snake
>         U+FE0F   ef b8 8f      0   variation selector-16 (VS16) (previous
> character as emoji)
>         U+1FA7A  f0 9f a9 ba   2   stethoscope
>
> Now, in Debian 11 character apps like vi, I see:
>
>   a woman(2) - a black box(2) - a staff with snake(1) - a stethoscope(2)
>
> Display widths are in parentheses.  I also see '<200d>' in blue.
>
> In current Firefox, I see a woman with a stethoscope around her neck,
> and then a stethoscope.  Copying the Unicode string above into a browser
> URL bar should show you the same thing, thought it might be too small to
> see.
>
> For those looking for details on how these should be handled, see this
> for an explanation of grapheme clusters that use things like skin tone
> modifiers and zero-width joiners:
>
>         https://tonsky.me/blog/emoji/
>
> These comments explain the confusion of the term character:
>
>
> https://stackoverflow.com/questions/27331819/whats-the-difference-between-a-character-a-code-point-a-glyph-and-a-grapheme
>
> and I think this comment summarizes it well:
>
>
> https://github.com/kovidgoyal/kitty/issues/3998#issuecomment-914807237
>
>         This is by design. wcwidth() is utterly broken. Any terminal or
> terminal
>         application that uses it is also utterly broken. Forget about emoji
>         wcwidth() doesn't even work with combining characters, zero width
>         joiners, flags, and a whole bunch of other things.
>
> I decided to see how Postgres, without ICU, handles it:
>
>         show lc_ctype;
>           lc_ctype
>         -------------
>          en_US.UTF-8
>
>         select octet_length('👩🏼‍⚕️🩺');
>          octet_length
>         --------------
>                    21
>
>         select character_length('👩🏼‍⚕️🩺');
>          character_length
>         ------------------
>                         6
>
> The octet_length() is verified as correct by counting the UTF8 bytes
> above.  I think character_length() is correct if we consider the number
> of Unicode characters, display and non-display.
>
> I then started looking at how Postgres computes and uses _display_
> width.  The display width, when properly processed like by Firefox, is 4
> (two double-wide displayed characters.)  Based on the libc display
> lengths above and incorrect displayed character lengths in Debian 11, it
> would be 7.
>
> libpq has PQdsplen(), which calls pg_encoding_dsplen(), which then calls
> the per-encoding width function stored in pg_wchar_table.dsplen --- for
> UTF8, the function is pg_utf_dsplen().
>
> There is no SQL API for display length, but PQdsplen() that can be
> called with a string by calling pg_wcswidth() the gdb debugger:
>
>         pg_wcswidth(const char *pwcs, size_t len, int encoding)
>         UTF8 encoding == 6
>
>         (gdb) print (int)pg_wcswidth("abcd", 4, 6)
>         $8 = 4
>         (gdb) print (int)pg_wcswidth("👩🏼‍⚕️🩺", 21, 6))
>         $9 = 7
>
> Here is the psql output:
>
>         SELECT octet_length('👩🏼‍⚕️🩺'), '👩🏼‍⚕️🩺',
> character_length('👩🏼‍⚕️🩺');
>          octet_length | ?column? | character_length
>         --------------+----------+------------------
>                    21 | 👩🏼‍⚕️🩺  |                6
>
> More often called from psql are pg_wcssize() and pg_wcsformat(), which
> also calls PQdsplen().
>
> I think the question is whether we want to report a string width that
> assumes the display doesn't understand the more complex UTF8
> controls/"characters" listed above.
>
> tsearch has p_isspecial() calls pg_dsplen() which also uses
> pg_wchar_table.dsplen.  p_isspecial() also has a small table of what it
> calls "strange_letter",
>
> Here is a report about Unicode variation selector and combining
> characters from May, 2022:
>
>
> https://www.postgresql.org/message-id/flat/013f01d873bb%24ff5f64b0%24fe1e2e10%24%40ndensan.co.jp
>
> Is this something people want improved?
>

Surely it should be fixed. Unfortunately - all the terminals that I can use
don't support it. So at this moment it may be premature to fix it, because
the visual form will still be broken.

Regards

Pavel


> --
>   Bruce Momjian  <br...@momjian.us>        https://momjian.us
>   EDB                                      https://enterprisedb.com
>
> Embrace your flaws.  They make you human, rather than perfect,
> which you will never be.
>
>
>

Re: Unicode grapheme clusters

Reply via email to