Initcap works differently with different locale providers

Oleg Tselebrovskiy Wed, 25 Sep 2024 08:13:52 -0700

Greetings, everyone!

One of our clients has found a difference in behaviour of initcapfunction when

using different locale providers, shown below

postgres=# create database test_db_1 locale_provider=iculocale="ru_RU.UTF-8" template=template0;

        NOTICE:  using standard form "ru-RU" for ICU locale "ru_RU.UTF-8"
        CREATE DATABASE
        postgres=# \c test_db_1;
        You are now connected to database "test_db_1" as user "postgres".
        test_db_1=# select initcap('ЧиЮ А.Ю.');
        initcap
        ----------
        Чию А.ю.
        (1 row)
        test_db_1=# select initcap('joHn d.e.');
        initcap
        -----------
        John D.e.
        (1 row)

postgres=# create database test_db_2 locale_provider=libclocale="ru_RU.UTF-8" template=template0;

        CREATE DATABASE
        postgres=# \c test_db_2
        You are now connected to database "test_db_2" as user "postgres".
        test_db_2=# select initcap('ЧиЮ А.Ю.');
        initcap
        ----------
        Чию А.Ю.
        (1 row)
        test_db_2=# select initcap('joHn d.e.');
        initcap
        -----------
        John D.E.
        (1 row)

And an easier reproduction (should work for REL_12_STABLE and up)

        postgres=# SELECT initcap('first.second' COLLATE "en-x-icu");
        initcap
        --------------
        First.second
        (1 row)
        postgres=# SELECT initcap('first.second' COLLATE "en_US");
        initcap
        --------------
        First.Second
        (1 row)

This behaviour is reproducible on REL_12_STABLE and up to master

I don't believe that this is an erroneous behaviour, just a differingone, hence

just a documentation change proposition

I suggest adding a clarification that this function works differentlywith libcand ICU providers because there is a difference in what a "word" isbetween them


In libc a word is a sequence of alphanumeric characters, separated by

non-alphanumeric characters (as it is written in documentation rightnow)

In ICU words are divided according to Unicode® Standard Annex #29 [1]

Similar issue was briefly discussed in [2]

The suggested documentation patch is attached (versions forREL_13_STABLE+ and

for REL_12_STABLE only)

[1]: https://www.unicode.org/reports/tr29/#Word_Boundaries

[2]:https://www.postgresql.org/message-id/CAEwbS1R8pwhRkwRo3XsPt24ErBNtFWuReAZhVPJwA3oqo148tA%40mail.gmail.com


Oleg Tselebrovskiy, Postgres Professional

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 1bde4091ca6..3ce5ad1d1f1 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -3100,8 +3100,11 @@ SELECT NOT(ROW(table.*) IS NOT NULL) FROM TABLE; -- detect at least one null in
        </para>
        <para>
         Converts the first letter of each word to upper case and the
-        rest to lower case. Words are sequences of alphanumeric
-        characters separated by non-alphanumeric characters.
+        rest to lower case. When using the <literal>libc</literal> locale
+        provider, words are sequences of alphanumeric characters separated
+        by non-alphanumeric characters; when using the ICU locale provider,
+        words are separated according to
+        <ulink url="https://www.unicode.org/reports/tr29/#Word_Boundaries";>UnicodeÂ® Standard Annex #29</ulink>.
        </para>
        <para>
         <literal>initcap('hi THOMAS')</literal>

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 487bb103637..1cd281dd90b 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -1932,8 +1932,11 @@
        <entry><type>text</type></entry>
        <entry>
         Convert the first letter of each word to upper case and the
-        rest to lower case. Words are sequences of alphanumeric
-        characters separated by non-alphanumeric characters.
+        rest to lower case. When using the <literal>libc</literal> locale
+        provider, words are sequences of alphanumeric characters separated
+		by non-alphanumeric characters; when using the ICU locale provider,
+		words are separated according to
+		<ulink url="https://www.unicode.org/reports/tr29/#Word_Boundaries";>UnicodeÂ® Standard Annex #29</ulink>.
        </entry>
        <entry><literal>initcap('hi THOMAS')</literal></entry>
        <entry><literal>Hi Thomas</literal></entry>

Initcap works differently with different locale providers

Reply via email to