On Mon, Dec 2, 2024, at 6:09 AM, Peter Eisentraut wrote: > On 26.08.24 08:09, Peter Eisentraut wrote: > > This patch allows using text position search functions with > > nondeterministic collations. These functions are > > > > - position, strpos > > - replace > > - split_part > > - string_to_array > > - string_to_table > > > > which all use common internal infrastructure. > > > Some exploratory testing could be useful here. The present test > > coverage was already quite helpful during development, but there is > > always the possibility that something was overlooked.
I took a look at this patch. * Most callers will require "greedy" semantics, meaning that we need * to find the longest such substring, not the shortest. For callers * don't don't need greedy semantics, we can finish on the first s/don't don't/that don't/ ? - Assert(len1 > 0); Assert(len2 > 0); Is there a reason to remove this assert? * (With nondeterministic collations, the search was already * multibyte-aware, so we don't need this.) s/was/is/ The commit title could be changed to reflect that you are adding support for multiple functions. The POSITION gives the impression that it is only for the position() function. Something like Support position search functions with nondeterministic collations I did a couple of tests (some are shown below) and I didn't find issues. postgres=# CREATE COLLATION coll_det (provider = icu, locale = '', deterministic = true); NOTICE: using standard form "und" for ICU locale "" CREATE COLLATION postgres=# CREATE COLLATION coll_noaccents (provider = icu, locale = '@colStrength=primary;colCaseLevel=yes', deterministic = false); NOTICE: using standard form "und-u-kc-ks-level1" for ICU locale "@colStrength=primary;colCaseLevel=yes" CREATE COLLATION postgres=# \dO List of collations Schema | Name | Provider | Collate | Ctype | Locale | ICU Rules | Deterministic? --------+----------------+----------+---------+-------+--------------------+-----------+---------------- public | coll_det | icu | | | und | | yes public | coll_noaccents | icu | | | und-u-kc-ks-level1 | | no public | coll_nondet | icu | | | und | | no (3 rows) postgres=# SELECT *, replace(nome COLLATE coll_noaccents, 'Sao', 'Saint') FROM municipios WHERE nome ~ 'São L' ORDER BY 2; codigo | nome | uf | replace ---------+---------------------------------+----+---------------------------------- 3201803 | Divino de São Lourenço | ES | Divino de Saint Lourenço 4318705 | São Leopoldo | RS | Saint Leopoldo 3163706 | São Lourenço | MG | Saint Lourenço 2613701 | São Lourenço da Mata | PE | Saint Lourenço da Mata 3549953 | São Lourenço da Serra | SP | Saint Lourenço da Serra 4216909 | São Lourenço do Oeste | SC | Saint Lourenço do Oeste 2210359 | São Lourenço do Piauí | PI | Saint Lourenço do Piauí 4318804 | São Lourenço do Sul | RS | Saint Lourenço do Sul 4217006 | São Ludgero | SC | Saint Ludgero 2111300 | São Luís | MA | Saint Luís 5220108 | São Luís de Montes Belos | GO | Saint Luís de Montes Belos 2312601 | São Luís do Curu | CE | Saint Luís do Curu 3550001 | São Luís do Paraitinga | SP | Saint Luís do Paraitinga 2210375 | São Luis do Piauí | PI | Saint Luis do Piauí 2708501 | São Luís do Quitunde | AL | Saint Luís do Quitunde 2111409 | São Luís Gonzaga do Maranhão | MA | Saint Luís Gonzaga do Maranhão 1400605 | São Luiz | RR | Saint Luiz 5220157 | São Luíz do Norte | GO | Saint Luíz do Norte 4318903 | São Luiz Gonzaga | RS | Saint Luiz Gonzaga (19 rows) postgres=# select *, strpos(nome collate coll_noaccents, 'ce') from municipios where nome ~ 'Luc'; codigo | nome | uf | strpos ---------+--------------------+----+-------- 2406908 | Lucrécia | RN | 0 2508604 | Lucena | PB | 3 3527405 | Lucélia | SP | 3 3527504 | Lucianópolis | SP | 0 4315008 | Porto Lucena | RS | 9 4315149 | Presidente Lucena | RS | 14 5105259 | Lucas do Rio Verde | MT | 0 5105309 | Luciara | MT | 0 (8 rows) postgres=# select *, strpos(nome collate coll_det, 'ce') from municipios where nome ~ 'Luc'; codigo | nome | uf | strpos ---------+--------------------+----+-------- 2406908 | Lucrécia | RN | 0 2508604 | Lucena | PB | 3 3527405 | Lucélia | SP | 0 3527504 | Lucianópolis | SP | 0 4315008 | Porto Lucena | RS | 9 4315149 | Presidente Lucena | RS | 14 5105259 | Lucas do Rio Verde | MT | 0 5105309 | Luciara | MT | 0 (8 rows) postgres=# select *, split_part(nome COLLATE coll_noaccents, 'land', 1) from municipios where nome ~ 'ndia' limit 10; codigo | nome | uf | split_part ---------+---------------------------+----+------------ 1100148 | Nova Brasilândia D'Oeste | RO | Nova Brasi 1100601 | Cacaulândia | RO | Cacau 1200013 | Acrelândia | AC | Acre 1200252 | Epitaciolândia | AC | Epitacio 1400407 | Normandia | RR | Normandia 1504455 | Medicilândia | PA | Medici 1505437 | Ourilândia do Norte | PA | Ouri 1507953 | Tailândia | PA | Tai 1700251 | Abreulândia | TO | Abreu 1703008 | Babaçulândia | TO | Babaçu (10 rows) postgres=# select *, split_part(nome COLLATE coll_det, 'land', 1) from municipios where nome ~ 'ndia' limit 10; codigo | nome | uf | split_part ---------+---------------------------+----+--------------------------- 1100148 | Nova Brasilândia D'Oeste | RO | Nova Brasilândia D'Oeste 1100601 | Cacaulândia | RO | Cacaulândia 1200013 | Acrelândia | AC | Acrelândia 1200252 | Epitaciolândia | AC | Epitaciolândia 1400407 | Normandia | RR | Normandia 1504455 | Medicilândia | PA | Medicilândia 1505437 | Ourilândia do Norte | PA | Ourilândia do Norte 1507953 | Tailândia | PA | Tailândia 1700251 | Abreulândia | TO | Abreulândia 1703008 | Babaçulândia | TO | Babaçulândia (10 rows) -- Euler Taveira EDB https://www.enterprisedb.com/