Jeff Davis wrote at 2025-07-31 02:58:
Apologies for the late answer to a review
First, it doesn't mention the "builtin" provider, which uses the same
word break rules as libc.
Completely forgot about builtin provider in the first patch, my bad
Second, word boundaries can be complex, and I'm wondering if we should
not be so precise about what ICU does or doesn't do. For instance, ICU
has options like U_TITLECASE_ADJUST_TO_CASED,
U_TITLECASE_NO_BREAK_ADJUSTMENT, etc., and I'm not sure exactly
which one of those we use.
While [1] describes the default word boundary rules and could be useful
as a starting point, I agree that in reality it probably is more
complicated. I didn't exactly find any place where
U_TITLECASE_ADJUST_TO_CASED and alike are set in non-test code, but
U_TITLECASE_ADJUST_TO_CASED was used as a default prior to ICU 60,
so initcap() will also behave differently depending on ICU version
I'd prefer that we try to explain that INITCAP() is intended for
convenient display, and the specific result should not be relied upon
(at least for ICU; maybe for all providers). If you want specific word
boundary rules, write your own function.
First patch just adds this warning about not relying on initcap() exact
result. The second one is the same, but removes the part "what is a
word"
since it's could be moot because we recommend writing custom functions,
so understanding what is a word is not exactly needed. Still on the
fence
about which patch is better, though
Thoughts?
[1]: https://www.unicode.org/reports/tr29/#Word_Boundaries
Regards, Oleg Tselebrovskiy
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 74a16af04ad..8a44e0ae593 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -3148,12 +3148,19 @@ SELECT NOT(ROW(table.*) IS NOT NULL) FROM TABLE; -- detect at least one null in
</para>
<para>
Converts the first letter of each word to upper case and the
- rest to lower case. When using the <literal>libc</literal> locale
- provider, words are sequences of alphanumeric characters separated
- by non-alphanumeric characters; when using the ICU locale provider,
- words are separated according to
+ rest to lower case. When using the <literal>libc</literal> or
+ <literal> builtin </literal> locale provider, words are sequences
+ of alphanumeric characters separated by non-alphanumeric characters;
+ when using the ICU locale provider, words are separated according to
<ulink url="https://www.unicode.org/reports/tr29/#Word_Boundaries">Unicode Standard Annex #29</ulink>.
</para>
+ <para>
+ This function is primarily used for convenient
+ display, and the specific result should not be relied upon because of
+ the differences between locale providers and between different
+ ICU versions. If specific word boundary rules are desired,
+ it is recomended to write a custom function.
+ </para>
<para>
<literal>initcap('hi THOMAS')</literal>
<returnvalue>Hi Thomas</returnvalue>
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 74a16af04ad..c071d6df366 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -3148,11 +3148,14 @@ SELECT NOT(ROW(table.*) IS NOT NULL) FROM TABLE; -- detect at least one null in
</para>
<para>
Converts the first letter of each word to upper case and the
- rest to lower case. When using the <literal>libc</literal> locale
- provider, words are sequences of alphanumeric characters separated
- by non-alphanumeric characters; when using the ICU locale provider,
- words are separated according to
- <ulink url="https://www.unicode.org/reports/tr29/#Word_Boundaries">Unicode Standard Annex #29</ulink>.
+ rest to lower case.
+ </para>
+ <para>
+ This function is primarily used for convenient
+ display, and the specific result should not be relied upon because of
+ the differences between locale providers and between different
+ ICU versions. If specific word boundary rules are desired,
+ it is recomended to write a custom function.
</para>
<para>
<literal>initcap('hi THOMAS')</literal>