Greetings, everyone!
One of our clients has found a difference in behaviour of initcap
function when
using different locale providers, shown below
postgres=# create database test_db_1 locale_provider=icu
locale="ru_RU.UTF-8" template=template0;
NOTICE: using standard form "ru-RU" for ICU locale "ru_RU.UTF-8"
CREATE DATABASE
postgres=# \c test_db_1;
You are now connected to database "test_db_1" as user "postgres".
test_db_1=# select initcap('ЧиЮ А.Ю.');
initcap
----------
Чию А.ю.
(1 row)
test_db_1=# select initcap('joHn d.e.');
initcap
-----------
John D.e.
(1 row)
postgres=# create database test_db_2 locale_provider=libc
locale="ru_RU.UTF-8" template=template0;
CREATE DATABASE
postgres=# \c test_db_2
You are now connected to database "test_db_2" as user "postgres".
test_db_2=# select initcap('ЧиЮ А.Ю.');
initcap
----------
Чию А.Ю.
(1 row)
test_db_2=# select initcap('joHn d.e.');
initcap
-----------
John D.E.
(1 row)
And an easier reproduction (should work for REL_12_STABLE and up)
postgres=# SELECT initcap('first.second' COLLATE "en-x-icu");
initcap
--------------
First.second
(1 row)
postgres=# SELECT initcap('first.second' COLLATE "en_US");
initcap
--------------
First.Second
(1 row)
This behaviour is reproducible on REL_12_STABLE and up to master
I don't believe that this is an erroneous behaviour, just a differing
one, hence
just a documentation change proposition
I suggest adding a clarification that this function works differently
with libc
and ICU providers because there is a difference in what a "word" is
between them
In libc a word is a sequence of alphanumeric characters, separated by
non-alphanumeric characters (as it is written in documentation right
now)
In ICU words are divided according to Unicode® Standard Annex #29 [1]
Similar issue was briefly discussed in [2]
The suggested documentation patch is attached (versions for
REL_13_STABLE+ and
for REL_12_STABLE only)
[1]: https://www.unicode.org/reports/tr29/#Word_Boundaries
[2]:
https://www.postgresql.org/message-id/CAEwbS1R8pwhRkwRo3XsPt24ErBNtFWuReAZhVPJwA3oqo148tA%40mail.gmail.com
Oleg Tselebrovskiy, Postgres Professional
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 1bde4091ca6..3ce5ad1d1f1 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -3100,8 +3100,11 @@ SELECT NOT(ROW(table.*) IS NOT NULL) FROM TABLE; -- detect at least one null in
</para>
<para>
Converts the first letter of each word to upper case and the
- rest to lower case. Words are sequences of alphanumeric
- characters separated by non-alphanumeric characters.
+ rest to lower case. When using the <literal>libc</literal> locale
+ provider, words are sequences of alphanumeric characters separated
+ by non-alphanumeric characters; when using the ICU locale provider,
+ words are separated according to
+ <ulink url="https://www.unicode.org/reports/tr29/#Word_Boundaries">Unicode® Standard Annex #29</ulink>.
</para>
<para>
<literal>initcap('hi THOMAS')</literal>
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 487bb103637..1cd281dd90b 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -1932,8 +1932,11 @@
<entry><type>text</type></entry>
<entry>
Convert the first letter of each word to upper case and the
- rest to lower case. Words are sequences of alphanumeric
- characters separated by non-alphanumeric characters.
+ rest to lower case. When using the <literal>libc</literal> locale
+ provider, words are sequences of alphanumeric characters separated
+ by non-alphanumeric characters; when using the ICU locale provider,
+ words are separated according to
+ <ulink url="https://www.unicode.org/reports/tr29/#Word_Boundaries">Unicode® Standard Annex #29</ulink>.
</entry>
<entry><literal>initcap('hi THOMAS')</literal></entry>
<entry><literal>Hi Thomas</literal></entry>