Support LIKE with nondeterministic collations

Peter Eisentraut Sun, 28 Apr 2024 23:45:43 -0700

This patch adds support for using LIKE with nondeterministic collations.So you can do things such as


    col LIKE 'foo%' COLLATE case_insensitive

This currently results in a "not supported" error. The reason for thatis that when I first developed support for nondeterministic collations,I didn't know what the semantics of that should be, especially sincewith nondeterministic collations, strings of different lengths could beequal, and then dropped the issue for a while.

After further research, the SQL standard's definition of the LIKEpredicate actually provides a clear definition of the semantics: Thepattern is partitioned into substrings at wildcard characters (so'foo%bar' is partitioned into 'foo', '%', 'bar') and then then wholepredicate matches if a match can be found for each partition under theapplicable collation (so for 'foo%bar' we look to partition the inputstring into s1 || s2 || s3 such that s1 = 'foo', s2 is anything, and s3= 'bar'.) The only difference to deterministic collations is that fordeterministic collations we can optimize this by matching by character,but for nondeterministic collations we have to go by substring.

From 3f6b584a0f34cabecac69f3cfd663dadfd34f464 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <pe...@eisentraut.org>
Date: Mon, 29 Apr 2024 07:58:20 +0200
Subject: [PATCH v1] Support LIKE with nondeterministic collations

This allows for example using LIKE with case-insensitive collations.
There was previously no internal implementation of this, so it was met
with a not-supported error.  This adds the internal implementation and
removes the error.

Unlike with deterministic collations, the LIKE matching cannot go
character by character but has to go substring by substring.  For
example, if we are matching against LIKE 'foo%bar', we can't start by
looking for an 'f', then an 'o', but instead with have to find
something that matches 'foo'.  This is because the collation could
consider substrings of different lengths to be equal.  This is all
internal to MatchText() in like_match.c.

The changes in GenericMatchText() in like.c just pass through the
locale information to MatchText(), which was previously not needed.
This matches exactly Generic_Text_IC_like() below.

Note that ILIKE is not affected.  It's unclear whether ILIKE makes
sense under nondeterministic collations.

This also updates match_pattern_prefix() in like_support.c to support
optimizing the case of an exact pattern with nondeterministic
collations.  This was already alluded to in the previous code.
---
 doc/src/sgml/charset.sgml                     |   2 +-
 doc/src/sgml/func.sgml                        |   7 +-
 src/backend/utils/adt/like.c                  |  30 +++--
 src/backend/utils/adt/like_match.c            | 118 ++++++++++++++++++
 src/backend/utils/adt/like_support.c          |  29 ++---
 .../regress/expected/collate.icu.utf8.out     |  81 ++++++++++--
 src/test/regress/sql/collate.icu.utf8.sql     |  23 +++-
 7 files changed, 250 insertions(+), 40 deletions(-)

diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index daf671e6205..ec7ffb360dd 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -1197,7 +1197,7 @@ <title>Nondeterministic Collations</title>
      to a performance penalty.  Note, in particular, that B-tree cannot use
      deduplication with indexes that use a nondeterministic collation.  Also,
      certain operations are not possible with nondeterministic collations,
-     such as pattern matching operations.  Therefore, they should be used
+     such as some pattern matching operations.  Therefore, they should be used
      only in cases where they are specifically wanted.
     </para>
 
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 1928de57623..6181cc64712 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -5374,9 +5374,10 @@ <title>Pattern Matching</title>
    </caution>
 
    <para>
-    The pattern matching operators of all three kinds do not support
-    nondeterministic collations.  If required, apply a different collation to
-    the expression to work around this limitation.
+    <function>SIMILAR TO</function> and <acronym>POSIX</acronym>-style regular
+    expressions do not support nondeterministic collations.  If required, use
+    <function>LIKE</function> or apply a different collation to the expression
+    to work around this limitation.
    </para>
 
   <sect2 id="functions-like">
diff --git a/src/backend/utils/adt/like.c b/src/backend/utils/adt/like.c
index 57ead66b5aa..bbbe6c09d18 100644
--- a/src/backend/utils/adt/like.c
+++ b/src/backend/utils/adt/like.c
@@ -149,22 +149,32 @@ SB_lower_char(unsigned char c, pg_locale_t locale, bool 
locale_is_c)
 static inline int
 GenericMatchText(const char *s, int slen, const char *p, int plen, Oid 
collation)
 {
-       if (collation && !lc_ctype_is_c(collation))
-       {
-               pg_locale_t locale = pg_newlocale_from_collation(collation);
+       pg_locale_t locale = 0;
+       bool            locale_is_c = false;
 
-               if (!pg_locale_deterministic(locale))
-                       ereport(ERROR,
-                                       (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-                                        errmsg("nondeterministic collations 
are not supported for LIKE")));
+       if (!OidIsValid(collation))
+       {
+               /*
+                * This typically means that the parser could not resolve a 
conflict
+                * of implicit collations, so report it that way.
+                */
+               ereport(ERROR,
+                               (errcode(ERRCODE_INDETERMINATE_COLLATION),
+                                errmsg("could not determine which collation to 
use for LIKE"),
+                                errhint("Use the COLLATE clause to set the 
collation explicitly.")));
        }
 
+       if (lc_ctype_is_c(collation))
+               locale_is_c = true;
+       else
+               locale = pg_newlocale_from_collation(collation);
+
        if (pg_database_encoding_max_length() == 1)
-               return SB_MatchText(s, slen, p, plen, 0, true);
+               return SB_MatchText(s, slen, p, plen, locale, locale_is_c);
        else if (GetDatabaseEncoding() == PG_UTF8)
-               return UTF8_MatchText(s, slen, p, plen, 0, true);
+               return UTF8_MatchText(s, slen, p, plen, locale, locale_is_c);
        else
-               return MB_MatchText(s, slen, p, plen, 0, true);
+               return MB_MatchText(s, slen, p, plen, locale, locale_is_c);
 }
 
 static inline int
diff --git a/src/backend/utils/adt/like_match.c 
b/src/backend/utils/adt/like_match.c
index f2990edff7e..e282a50b3ba 100644
--- a/src/backend/utils/adt/like_match.c
+++ b/src/backend/utils/adt/like_match.c
@@ -198,6 +198,124 @@ MatchText(const char *t, int tlen, const char *p, int 
plen,
                        NextByte(p, plen);
                        continue;
                }
+               else if (locale && !locale->deterministic)
+               {
+                       /*
+                        * For nondeterministic locales, we find the next 
substring of the
+                        * pattern that does not contain wildcards and try to 
find a
+                        * matching substring in the text.  Crucially, we 
cannot do this
+                        * character by character, as in the normal case, but 
must do it
+                        * substring by substring, partitioned by the wildcard 
characters.
+                        */
+                       const char *p1;
+                       size_t          p1len;
+                       const char *t1;
+                       size_t          t1len;
+                       bool            found_escape;
+                       const char *subpat;
+                       size_t          subpatlen;
+                       char       *buf = NULL;
+
+                       /*
+                        * Determine next substring of pattern without 
wildcards.  p is
+                        * the start of the subpattern, p1 is one past the last 
byte. Also
+                        * track if we found an escape character.
+                        */
+                       p1 = p;
+                       p1len = plen;
+                       found_escape = false;
+                       while (p1len > 0)
+                       {
+                               if (*p1 == '\\')
+                               {
+                                       found_escape = true;
+                                       NextByte(p1, p1len);
+                               }
+                               else if (*p1 == '_' || *p1 == '%')
+                                       break;
+                               NextByte(p1, p1len);
+                       }
+
+                       /*
+                        * If we found an escape character, then make an 
unescaped copy of
+                        * the subpattern.
+                        */
+                       if (found_escape)
+                       {
+                               char       *b;
+
+                               b = buf = palloc(p1 - p);
+                               for (const char *c = p; c < p1; c++)
+                               {
+                                       if (*c == '\\')
+                                               ;
+                                       else
+                                               *(b++) = *c;
+                               }
+
+                               subpat = buf;
+                               subpatlen = b - buf;
+                       }
+                       else
+                       {
+                               subpat = p;
+                               subpatlen = p1 - p;
+                       }
+
+                       /*
+                        * Now build a substring of the text and try to match 
it against
+                        * the subpattern.  t is the start of the text, t1 is 
one past the
+                        * last byte.  We start with a zero-length string.
+                        */
+                       t1 = t;
+                       t1len = tlen;
+                       for (;;)
+                       {
+                               int                     cmp;
+
+                               cmp = pg_strncoll(subpat, subpatlen, t, (t1 - 
t), locale);
+
+                               /*
+                                * If we found a match, we have to test if the 
rest of pattern
+                                * can match against the rest of the string.  
Otherwise we
+                                * have to continue here try matching with a 
longer substring.
+                                * (This is similar to the recursion for the 
'%' wildcard
+                                * above.)
+                                *
+                                * Note that we can't just wind forward p and t 
and continue
+                                * with the main loop.  This would fail for 
example with
+                                *
+                                * U&'\0061\0308bc' LIKE U&'\00E4_c' COLLATE 
ignore_accents
+                                *
+                                * You'd find that t=\0061 matches p=\00E4, but 
then the rest
+                                * won't match; but t=\0061\0308 also matches 
p=\00E4, and
+                                * then the rest will match.
+                                */
+                               if (cmp == 0)
+                               {
+                                       int                     matched = 
MatchText(t1, t1len, p1, p1len, locale, locale_is_c);
+
+                                       if (matched != LIKE_FALSE)
+                                       {
+                                               if (buf)
+                                                       pfree(buf);
+                                               return matched; /* TRUE or 
ABORT */
+                                       }
+                               }
+
+                               /*
+                                * Didn't match.  If we used up the whole text, 
then the match
+                                * fails.  Otherwise, try again with a longer 
substring.
+                                */
+                               if (t1len == 0)
+                                       return LIKE_FALSE;
+                               else
+                                       NextChar(t1, t1len);
+                       }
+                       if (buf)
+                               pfree(buf);
+                       continue;
+               }
                else if (GETCHAR(*p) != GETCHAR(*t))
                {
                        /* non-wildcard pattern char fails to match text char */
diff --git a/src/backend/utils/adt/like_support.c 
b/src/backend/utils/adt/like_support.c
index 2635050861f..3c691a5cc95 100644
--- a/src/backend/utils/adt/like_support.c
+++ b/src/backend/utils/adt/like_support.c
@@ -272,22 +272,6 @@ match_pattern_prefix(Node *leftop,
                return NIL;
        patt = (Const *) rightop;
 
-       /*
-        * Not supported if the expression collation is nondeterministic.  The
-        * optimized equality or prefix tests use bytewise comparisons, which is
-        * not consistent with nondeterministic collations.  The actual
-        * pattern-matching implementation functions will later error out that
-        * pattern-matching is not supported with nondeterministic collations. 
(We
-        * could also error out here, but by doing it later we get more precise
-        * error messages.)  (It should be possible to support at least
-        * Pattern_Prefix_Exact, but no point as long as the actual
-        * pattern-matching implementations don't support it.)
-        *
-        * expr_coll is not set for a non-collation-aware data type such as 
bytea.
-        */
-       if (expr_coll && !get_collation_isdeterministic(expr_coll))
-               return NIL;
-
        /*
         * Try to extract a fixed prefix from the pattern.
         */
@@ -404,6 +388,8 @@ match_pattern_prefix(Node *leftop,
        {
                if (!op_in_opfamily(eqopr, opfamily))
                        return NIL;
+               if (indexcollation != expr_coll)
+                       return NIL;
                expr = make_opclause(eqopr, BOOLOID, false,
                                                         (Expr *) leftop, (Expr 
*) prefix,
                                                         InvalidOid, 
indexcollation);
@@ -411,6 +397,17 @@ match_pattern_prefix(Node *leftop,
                return result;
        }
 
+       /*
+        * Anything other than Pattern_Prefix_Exact is not supported if the
+        * expression collation is nondeterministic.  The optimized equality or
+        * prefix tests use bytewise comparisons, which is not consistent with
+        * nondeterministic collations.
+        *
+        * expr_coll is not set for a non-collation-aware data type such as 
bytea.
+        */
+       if (expr_coll && !get_collation_isdeterministic(expr_coll))
+               return NIL;
+
        /*
         * Otherwise, we have a nonempty required prefix of the values.  Some
         * opclasses support prefix checks directly, otherwise we'll try to
diff --git a/src/test/regress/expected/collate.icu.utf8.out 
b/src/test/regress/expected/collate.icu.utf8.out
index 4b8c8f143f3..6a4509aeb46 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1272,6 +1272,30 @@ CREATE COLLATION ctest_det (provider = icu, locale = '', 
deterministic = true);
 NOTICE:  using standard form "und" for ICU locale ""
 CREATE COLLATION ctest_nondet (provider = icu, locale = '', deterministic = 
false);
 NOTICE:  using standard form "und" for ICU locale ""
+SELECT 'abc' LIKE 'abc' COLLATE ctest_det;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT 'abc' LIKE 'a\bc' COLLATE ctest_det;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT 'abc' LIKE 'abc' COLLATE ctest_nondet;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT 'abc' LIKE 'a\bc' COLLATE ctest_nondet;
+ ?column? 
+----------
+ t
+(1 row)
+
 CREATE TABLE test6 (a int, b text);
 -- same string in different normal forms
 INSERT INTO test6 VALUES (1, U&'\00E4bc');
@@ -1296,6 +1320,19 @@ SELECT * FROM test6 WHERE b = 'äbc' COLLATE ctest_nondet;
  2 | äbc
 (2 rows)
 
+SELECT * FROM test6 WHERE b LIKE 'äbc' COLLATE ctest_det;
+ a |  b  
+---+-----
+ 1 | äbc
+(1 row)
+
+SELECT * FROM test6 WHERE b LIKE 'äbc' COLLATE ctest_nondet;
+ a |  b  
+---+-----
+ 1 | äbc
+ 2 | äbc
+(2 rows)
+
 -- same with arrays
 CREATE TABLE test6a (a int, b text[]);
 INSERT INTO test6a VALUES (1, ARRAY[U&'\00E4bc']);
@@ -1512,7 +1549,12 @@ SELECT x FROM test3ci WHERE x <> 'abc';
 (2 rows)
 
 SELECT x FROM test3ci WHERE x LIKE 'a%';
-ERROR:  nondeterministic collations are not supported for LIKE
+  x  
+-----
+ abc
+ ABC
+(2 rows)
+
 SELECT x FROM test3ci WHERE x ILIKE 'a%';
 ERROR:  nondeterministic collations are not supported for ILIKE
 SELECT x FROM test3ci WHERE x SIMILAR TO 'a%';
@@ -1630,7 +1672,12 @@ SELECT x FROM test3bpci WHERE x <> 'abc';
 (2 rows)
 
 SELECT x FROM test3bpci WHERE x LIKE 'a%';
-ERROR:  nondeterministic collations are not supported for LIKE
+  x  
+-----
+ abc
+ ABC
+(2 rows)
+
 SELECT x FROM test3bpci WHERE x ILIKE 'a%';
 ERROR:  nondeterministic collations are not supported for ILIKE
 SELECT x FROM test3bpci WHERE x SIMILAR TO 'a%';
@@ -1727,7 +1774,7 @@ SELECT string_to_array('ABCDEFGHI'::char(9) COLLATE 
case_insensitive, NULL, 'b')
 -- This tests the issue described in match_pattern_prefix().  In the
 -- absence of that check, the case_insensitive tests below would
 -- return no rows where they should logically return one.
-CREATE TABLE test4c (x text COLLATE "C");
+CREATE TABLE test4c (x text COLLATE case_insensitive);
 INSERT INTO test4c VALUES ('abc');
 CREATE INDEX ON test4c (x);
 SET enable_seqscan = off;
@@ -1741,10 +1788,18 @@ SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE 
case_sensitive;  -- ok, no rows
 ---
 (0 rows)
 
-SELECT x FROM test4c WHERE x LIKE 'ABC' COLLATE case_insensitive;  -- error
-ERROR:  nondeterministic collations are not supported for LIKE
-SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_insensitive;  -- error
-ERROR:  nondeterministic collations are not supported for LIKE
+SELECT x FROM test4c WHERE x LIKE 'ABC' COLLATE case_insensitive;  -- ok
+  x  
+-----
+ abc
+(1 row)
+
+SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_insensitive;  -- ok
+  x  
+-----
+ abc
+(1 row)
+
 RESET enable_seqscan;
 -- Unicode special case: different variants of Greek lower case sigma.
 -- A naive implementation like citext that just does lower(x) =
@@ -1838,6 +1893,18 @@ SELECT * FROM test4 WHERE b = 'Cote' COLLATE 
case_insensitive;
  1 | cote
 (1 row)
 
+-- This is a tricky one.  A naive implementation would first test
+-- \00E4 matches \0061, which is true under ignore_accents, but then
+-- the rest of the string won't match anymore.  Therefore, the
+-- algorithm has to test whether the rest of the string matches, and
+-- if not try matching \00E4 against a longer substring like
+-- \0061\0308, which will then work out.
+SELECT U&'\0061\0308bc' LIKE U&'\00E4_c' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
 -- foreign keys (should use collation of primary key)
 -- PK is case-sensitive, FK is case-insensitive
 CREATE TABLE test10pk (x text COLLATE case_sensitive PRIMARY KEY);
diff --git a/src/test/regress/sql/collate.icu.utf8.sql 
b/src/test/regress/sql/collate.icu.utf8.sql
index 80f28a97d78..481a995c998 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -514,6 +514,12 @@ CREATE COLLATION testcoll_rulesx (provider = icu, locale = 
'', rules = '!!wrong!
 CREATE COLLATION ctest_det (provider = icu, locale = '', deterministic = true);
 CREATE COLLATION ctest_nondet (provider = icu, locale = '', deterministic = 
false);
 
+SELECT 'abc' LIKE 'abc' COLLATE ctest_det;
+SELECT 'abc' LIKE 'a\bc' COLLATE ctest_det;
+
+SELECT 'abc' LIKE 'abc' COLLATE ctest_nondet;
+SELECT 'abc' LIKE 'a\bc' COLLATE ctest_nondet;
+
 CREATE TABLE test6 (a int, b text);
 -- same string in different normal forms
 INSERT INTO test6 VALUES (1, U&'\00E4bc');
@@ -522,6 +528,9 @@ CREATE TABLE test6 (a int, b text);
 SELECT * FROM test6 WHERE b = 'äbc' COLLATE ctest_det;
 SELECT * FROM test6 WHERE b = 'äbc' COLLATE ctest_nondet;
 
+SELECT * FROM test6 WHERE b LIKE 'äbc' COLLATE ctest_det;
+SELECT * FROM test6 WHERE b LIKE 'äbc' COLLATE ctest_nondet;
+
 -- same with arrays
 CREATE TABLE test6a (a int, b text[]);
 INSERT INTO test6a VALUES (1, ARRAY[U&'\00E4bc']);
@@ -637,14 +646,14 @@ CREATE UNIQUE INDEX ON test3bpci (x);  -- error
 -- This tests the issue described in match_pattern_prefix().  In the
 -- absence of that check, the case_insensitive tests below would
 -- return no rows where they should logically return one.
-CREATE TABLE test4c (x text COLLATE "C");
+CREATE TABLE test4c (x text COLLATE case_insensitive);
 INSERT INTO test4c VALUES ('abc');
 CREATE INDEX ON test4c (x);
 SET enable_seqscan = off;
 SELECT x FROM test4c WHERE x LIKE 'ABC' COLLATE case_sensitive;  -- ok, no rows
 SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_sensitive;  -- ok, no 
rows
-SELECT x FROM test4c WHERE x LIKE 'ABC' COLLATE case_insensitive;  -- error
-SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_insensitive;  -- error
+SELECT x FROM test4c WHERE x LIKE 'ABC' COLLATE case_insensitive;  -- ok
+SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_insensitive;  -- ok
 RESET enable_seqscan;
 
 -- Unicode special case: different variants of Greek lower case sigma.
@@ -687,6 +696,14 @@ CREATE TABLE test4 (a int, b text);
 SELECT * FROM test4 WHERE b = 'Cote' COLLATE ignore_accents;  -- still 
case-sensitive
 SELECT * FROM test4 WHERE b = 'Cote' COLLATE case_insensitive;
 
+-- This is a tricky one.  A naive implementation would first test
+-- \00E4 matches \0061, which is true under ignore_accents, but then
+-- the rest of the string won't match anymore.  Therefore, the
+-- algorithm has to test whether the rest of the string matches, and
+-- if not try matching \00E4 against a longer substring like
+-- \0061\0308, which will then work out.
+SELECT U&'\0061\0308bc' LIKE U&'\00E4_c' COLLATE ignore_accents;
+
 -- foreign keys (should use collation of primary key)
 
 -- PK is case-sensitive, FK is case-insensitive

base-commit: 5c9f35fc48ea99e59300a267e090e3eafd1b3b0e
-- 
2.44.0

Support LIKE with nondeterministic collations

Reply via email to