Hi, Oleg!
On 25 Sep 2024, at 18:13, Oleg Tselebrovskiy <o.tselebrovs...@postgrespro.ru> wrote:
Greetings, everyone!
One of our clients has found a difference in behaviour of initcap function when using different locale providers, shown below
postgres=# create database test_db_1 locale_provider=icu locale="ru_RU.UTF-8" template=template0; NOTICE: using standard form "ru-RU" for ICU locale "ru_RU.UTF-8" CREATE DATABASE postgres=# \c test_db_1; You are now connected to database "test_db_1" as user "postgres". test_db_1=# select initcap('ЧиЮ А.Ю.'); initcap ---------- Чию А.ю. (1 row) test_db_1=# select initcap('joHn d.e.'); initcap ----------- John D.e. (1 row) postgres=# create database test_db_2 locale_provider=libc locale="ru_RU.UTF-8" template=template0; CREATE DATABASE postgres=# \c test_db_2 You are now connected to database "test_db_2" as user "postgres". test_db_2=# select initcap('ЧиЮ А.Ю.'); initcap ---------- Чию А.Ю. (1 row) test_db_2=# select initcap('joHn d.e.'); initcap ----------- John D.E. (1 row)
And an easier reproduction (should work for REL_12_STABLE and up)
postgres=# SELECT initcap('first.second' COLLATE "en-x-icu"); initcap -------------- First.second (1 row) postgres=# SELECT initcap('first.second' COLLATE "en_US"); initcap -------------- First.Second (1 row)
This behaviour is reproducible on REL_12_STABLE and up to master
I don't believe that this is an erroneous behaviour, just a differing one, hence just a documentation change proposition
I suggest adding a clarification that this function works differently with libc and ICU providers because there is a difference in what a "word" is between them
In libc a word is a sequence of alphanumeric characters, separated by non-alphanumeric characters (as it is written in documentation right now) In ICU words are divided according to Unicode® Standard Annex #29 [1]
Similar issue was briefly discussed in [2]
The suggested documentation patch is attached (versions for REL_13_STABLE+ and for REL_12_STABLE only)
[1]: https://www.unicode.org/reports/tr29/#Word_Boundaries [2]: https://www.postgresql.org/message-id/CAEwbS1R8pwhRkwRo3XsPt24ErBNtFWuReAZhVPJwA3oqo148tA%40mail.gmail.com
Oleg Tselebrovskiy, Postgres Professional<v1-0001-string-functions.patch><v1-0002-string-functions-REL_12.patch>
I can confirm inicap works with libc and libicu as you stated. The documentation patch looks good to me. I’ve written a commit message. The REL_12_STABLE branch is not relevant anymore as it’s out of support. I’m going to push this if no objections.
------Regards,Alexander KorotkovSupabase |
v2-0001-Clarify-documentation-for-the-initcap-function.patch
Description: Binary data