On Sun, Sep 29, 2019 at 3:38 AM Alvaro Herrera <alvhe...@2ndquadrant.com> wrote: > > The UTF8 bits looks reasonable to me. I guess the other part of that > question is whether we support any other multibyte encoding that > supports combining characters. Maybe for cases other than UTF8 we can > test for 0-width chars (using pg_encoding_dsplen() perhaps?) and drive > the upper/lower decision off that? (For the UTF8 case, I don't know if > Juanjo's proposal is better than pg_encoding_dsplen. Both seem to boil > down to a bsearch, though unicode_norm.c's table seems much larger than > wchar.c's). >
Using pg_encoding_dsplen() looks like the way to go. The normalizarion logic included in ucs_wcwidth() already does what is need to avoid the issue, so there is no need to use unicode_norm_table.h. UTF8 is the only multibyte encoding that can return a 0-width dsplen, so this approach would also works for all the other encodings that do not use combining characters. Please find attached a patch with this approach. Regards, Juan José Santamaría Flecha
diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c index f7175df..8af62d2 100644 --- a/src/backend/utils/adt/formatting.c +++ b/src/backend/utils/adt/formatting.c @@ -1947,7 +1947,10 @@ str_initcap(const char *buff, size_t nbytes, Oid collid) wchar_t *workspace; size_t curr_char; size_t result_size; + int encoding; + char wdsplen[MAX_MULTIBYTE_CHAR_LEN]; + encoding = GetDatabaseEncoding(); /* Overflow paranoia */ if ((nbytes + 1) > (INT_MAX / sizeof(wchar_t))) ereport(ERROR, @@ -1968,7 +1971,9 @@ str_initcap(const char *buff, size_t nbytes, Oid collid) workspace[curr_char] = towlower_l(workspace[curr_char], mylocale->info.lt); else workspace[curr_char] = towupper_l(workspace[curr_char], mylocale->info.lt); - wasalnum = iswalnum_l(workspace[curr_char], mylocale->info.lt); + wchar2char(wdsplen, &workspace[curr_char], MAX_MULTIBYTE_CHAR_LEN, mylocale); + if (pg_encoding_dsplen(encoding, wdsplen) != 0) + wasalnum = iswalnum_l(workspace[curr_char], mylocale->info.lt); } else #endif @@ -1977,7 +1982,9 @@ str_initcap(const char *buff, size_t nbytes, Oid collid) workspace[curr_char] = towlower(workspace[curr_char]); else workspace[curr_char] = towupper(workspace[curr_char]); - wasalnum = iswalnum(workspace[curr_char]); + wchar2char(wdsplen, &workspace[curr_char], MAX_MULTIBYTE_CHAR_LEN, mylocale); + if (pg_encoding_dsplen(encoding, wdsplen) != 0) + wasalnum = iswalnum(workspace[curr_char]); } }