Re: Initcap works differently with different locale providers

Oleg Tselebrovskiy Sun, 03 Aug 2025 22:31:15 -0700

Jeff Davis wrote at 2025-07-31 02:58:

Apologies for the late answer to a review

First, it doesn't mention the "builtin" provider, which uses the same
word break rules as libc.


Completely forgot about builtin provider in the first patch, my bad

Second, word boundaries can be complex, and I'm wondering if we should
not be so precise about what ICU does or doesn't do. For instance, ICU
has options like U_TITLECASE_ADJUST_TO_CASED,
U_TITLECASE_NO_BREAK_ADJUSTMENT, etc., and I'm not sure exactly
which one of those we use.


While [1] describes the default word boundary rules and could be useful
as a starting point, I agree that in reality it probably is more
complicated. I didn't exactly find any place where
U_TITLECASE_ADJUST_TO_CASED and alike are set in non-test code, but
U_TITLECASE_ADJUST_TO_CASED was used as a default prior to ICU 60,
so initcap() will also behave differently depending on ICU version

I'd prefer that we try to explain that INITCAP() is intended for
convenient display, and the specific result should not be relied upon
(at least for ICU; maybe for all providers). If you want specific word
boundary rules, write your own function.


First patch just adds this warning about not relying on initcap() exact

result. The second one is the same, but removes the part "what is aword"

since it's could be moot because we recommend writing custom functions,

so understanding what is a word is not exactly needed. Still on thefence

about which patch is better, though

Thoughts?

[1]: https://www.unicode.org/reports/tr29/#Word_Boundaries

Regards, Oleg Tselebrovskiy

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 74a16af04ad..8a44e0ae593 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -3148,12 +3148,19 @@ SELECT NOT(ROW(table.*) IS NOT NULL) FROM TABLE; -- detect at least one null in
        </para>
        <para>
         Converts the first letter of each word to upper case and the
-        rest to lower case. When using the <literal>libc</literal> locale
-        provider, words are sequences of alphanumeric characters separated
-        by non-alphanumeric characters; when using the ICU locale provider,
-        words are separated according to
+        rest to lower case. When using the <literal>libc</literal> or
+        <literal> builtin </literal> locale provider, words are sequences
+        of alphanumeric characters separated by non-alphanumeric characters;
+        when using the ICU locale provider, words are separated according to
         <ulink url="https://www.unicode.org/reports/tr29/#Word_Boundaries";>Unicode Standard Annex #29</ulink>.
        </para>
+       <para>
+        This function is primarily used for convenient
+        display, and the specific result should not be relied upon because of
+        the differences between locale providers and between different
+        ICU versions. If specific word boundary rules are desired,
+        it is recomended to write a custom function.
+       </para>
        <para>
         <literal>initcap('hi THOMAS')</literal>
         <returnvalue>Hi Thomas</returnvalue>

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 74a16af04ad..c071d6df366 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -3148,11 +3148,14 @@ SELECT NOT(ROW(table.*) IS NOT NULL) FROM TABLE; -- detect at least one null in
        </para>
        <para>
         Converts the first letter of each word to upper case and the
-        rest to lower case. When using the <literal>libc</literal> locale
-        provider, words are sequences of alphanumeric characters separated
-        by non-alphanumeric characters; when using the ICU locale provider,
-        words are separated according to
-        <ulink url="https://www.unicode.org/reports/tr29/#Word_Boundaries";>Unicode Standard Annex #29</ulink>.
+        rest to lower case.
+       </para>
+       <para>
+        This function is primarily used for convenient
+        display, and the specific result should not be relied upon because of
+        the differences between locale providers and between different
+        ICU versions. If specific word boundary rules are desired,
+        it is recomended to write a custom function.
        </para>
        <para>
         <literal>initcap('hi THOMAS')</literal>

Re: Initcap works differently with different locale providers

Reply via email to