Jeff Davis wrote: > Even if it's not a collatable type, it should use the database > collation rather than going straight to libc. Again, is that something > that can ever be fixed or are we just stuck with libc semantics for > full text search permanently, even if you initialize the cluster with a > different provider?
ISTM that what backend/tsearch/wparser_def.c needs is comparable to what backend/regex/regc_pg_locale.c already does with the PG_Locale_Strategy, and the pg_wc_isxxxx functions. Looking at git history, the current invocations of is[w]digit(), is[w]alpha()... in the FTS parser have been modernized a bit by ed87e1980706 (2017) but essentially this code dates back from the original integration of FTS in core by 140d4ebcb46e (2007). These calls are made through the p_is##type macro-expanded functions: /* * In C locale with a multibyte encoding, any non-ASCII symbol is considered * an alpha character, but not a member of other char classes. */ p_iswhat(alnum, 1) p_iswhat(alpha, 1) p_iswhat(digit, 0) p_iswhat(lower, 0) p_iswhat(print, 0) p_iswhat(punct, 0) p_iswhat(space, 0) p_iswhat(upper, 0) p_iswhat(xdigit, 0) That's why in a database with the builtin or ICU provider and lc_ctype=C, the FTS parser is not Unicode-aware. I may miss something, but I don't see a technical reason why this code could not be taught to call the equivalent functions of the current collation provider, following the same principles as the regex code. Best regards, -- Daniel Vérité https://postgresql.verite.pro/