Jeff Davis wrote:

> Even if it's not a collatable type, it should use the database
> collation rather than going straight to libc. Again, is that something
> that can ever be fixed or are we just stuck with libc semantics for
> full text search permanently, even if you initialize the cluster with a
> different provider?

ISTM that what backend/tsearch/wparser_def.c needs is comparable
to what backend/regex/regc_pg_locale.c already does with the
PG_Locale_Strategy, and the pg_wc_isxxxx functions.

Looking at git history, the current invocations of is[w]digit(),
is[w]alpha()...
in the FTS parser have been modernized a bit by ed87e1980706 (2017)
but essentially this code dates back from the original integration of
FTS in core by 140d4ebcb46e (2007). These calls are made through
the p_is##type macro-expanded functions:

/*
 * In C locale with a multibyte encoding, any non-ASCII symbol is considered
 * an alpha character, but not a member of other char classes.
 */
p_iswhat(alnum, 1)
p_iswhat(alpha, 1)
p_iswhat(digit, 0)
p_iswhat(lower, 0)
p_iswhat(print, 0)
p_iswhat(punct, 0)
p_iswhat(space, 0)
p_iswhat(upper, 0)
p_iswhat(xdigit, 0)

That's why in a database with the builtin or ICU provider and lc_ctype=C,
the FTS parser is not Unicode-aware. I may miss something, but I don't see a
technical reason why this code could not be taught to call the equivalent
functions of the current collation provider, following the same principles
as the regex code.


Best regards,
-- 
Daniel Vérité 
https://postgresql.verite.pro/


Reply via email to