On Fri, Sep 2, 2022 at 12:17 PM Kyotaro Horiguchi <horikyota....@gmail.com> wrote: > > At Thu, 01 Sep 2022 18:22:06 +0900 (JST), Kyotaro Horiguchi > <horikyota....@gmail.com> wrote in > > At Thu, 1 Sep 2022 15:00:38 +0700, John Naylor > > <john.nay...@enterprisedb.com> wrote in > > > UnicodeData.txt has this: > > > > > > 200B;ZERO WIDTH SPACE;Cf;0;BN;;;;;N;;;;; > > > 200C;ZERO WIDTH NON-JOINER;Cf;0;BN;;;;;N;;;;; > > > 200D;ZERO WIDTH JOINER;Cf;0;BN;;;;;N;;;;; > > > 200E;LEFT-TO-RIGHT MARK;Cf;0;L;;;;;N;;;;; > > > 200F;RIGHT-TO-LEFT MARK;Cf;0;R;;;;;N;;;;; > > > > > > So maybe we need to take Cf characters in this file into account, in > > > addition to Me and Mn (combining characters). > > > > Including them into unicode_combining_table.h actually worked, but I'm > > not sure it is valid to include Cf's among Mn/Me's..
Looking at the definition, Cf means "other, format" category, "Format character that affects the layout of text or the operation of text processes, but is not normally rendered". [1] > UnicodeData.txt > 174:00AD;SOFT HYPHEN;Cf;0;BN;;;;;N;;;;; > > Soft-hyphen seems like not zero-width.. usually... I gather it only appears at line breaks, which I doubt we want to handle. > 0600;ARABIC NUMBER SIGN;Cf;0;AN;;;;;N;;;;; > 110BD;KAITHI NUMBER SIGN;Cf;0;L;;;;;N;;;;; > > Mmm. These looks like not zero-width? There are glyphs, but there is something special about the first one: select U&'\0600'; Looks like this in psql (substituting 'X' to avoid systemic differences): +----------+ | ?column? | +----------+ | X | +----------+ (1 row) Copy from psql to vim or nano: +----------+ | ?column? | +----------+ | X | +----------+ (1 row) ...so it does mess up the border the same way. The second (U&'\+0110bd') doesn't render for me. > However, it seems like basically a win if we include "Cf"s to the > "combining" table? There seems to be a case for that. If we did include those, we should rename the table to match. I found this old document from 2002 on "default ignorable" characters that normally have no visible glyph: https://unicode.org/L2/L2002/02368-default-ignorable.html If there is any doubt about including all of Cf, we could also just add a branch in wchar.c to hard-code the 200B-200F range. -- John Naylor EDB: http://www.enterprisedb.com