Inconsistent string comparison using modified ICU collations

Oleg Tselebrovskiy Wed, 22 Jan 2025 01:03:57 -0800

Greetings, everyone!

I've discovered a bug with string comparison using modified ICUcollations

Using a direct comparison and sorting values gives different results


The easiest way to reproduce is the following:

postgres=# create collation "en-US-u-kr-latn-digit-x-icu" (provider =icu, locale = 'en-US-u-kr-latn-digit');

        CREATE COLLATION
        postgres=# select ('a' < '0' collate "en-US-u-kr-latn-digit-x-icu");
        ?column?
        ----------
        f
        (1 row)

postgres=# select * from (values ('0'),('a')) t(x) order by x collate"en-US-u-kr-latn-digit-x-icu";

        x
        ---
        a
        0
        (2 rows)

Why does this happen:

In the first example of simple comparison, function varstr_cmp is calledand ituses ucol_strcoll[UTF8] function to compare two strings, and it seems toignore

reordering of character groups in collation;

In the second example of sorting values, function varstr_abbrev_convertis calledand somewhere deep it uses ucol_getSortKey/ucol_nextSortKeyPart totransform

source string to SortKey and this transformation takes reordering of
character groups into account

Other way to reproduce this behaviour is to create table, insert thedata into it,create btree index over this table and sometimes you wouldn't get thedata that isdefinitely in the table (if you force postgres to use seqscan the queryworks):

postgres=# create collation "en-US-u-kr-latn-digit-x-icu" (provider =icu, locale = 'en-US-u-kr-latn-digit');

        create table test (col text COLLATE "en-US-u-kr-latn-digit-x-icu");
        insert into test values ('a'), ('0');
        create index test_idx ON test USING btree (col);
        set enable_seqscan = off;
        select * from test where col = 'a';
        select * from test where col = '0';
        CREATE COLLATION
        CREATE TABLE
        INSERT 0 2
        CREATE INDEX
        SET
        col
        -----
        (0 rows)

        col
        -----
        (0 rows)

This happens because selecting one row triggers a binary search over theindex,rows in the index are stored according to modified collation (lettersbefore digits),since the creating of index triggers sorting of rows and, therefore,usage ofucol_getSortKey/ucol_nextSortKeyPart. But the WHERE filter usesucol_strcoll[UTF-8],

that doesn't take modified collation into account

This can be reproduced from REL_13_STABLE up to the current master
(41084409f635453efce03f1114880189b4f6ce4c)

I've opened an issue in ICU Jira[1] where I have reproduced thisbehaviour using

minimal C code

To compose the collation name I have read and used an article by PeterEisentraut

on ICU collation settings[2]

Unfortunately, I don't have any proposed solution for this issue, but Ithought

it was important to highlight it

Oleg Tselebrovskiy, Postgres Pro

[1] https://unicode-org.atlassian.net/browse/ICU-23016

[2]http://peter.eisentraut.org/blog/2023/05/16/overview-of-icu-collation-settings

Inconsistent string comparison using modified ICU collations

Reply via email to