Re: Support POSITION with nondeterministic collations

Euler Taveira Tue, 11 Feb 2025 20:32:34 -0800

On Mon, Dec 2, 2024, at 6:09 AM, Peter Eisentraut wrote:
> On 26.08.24 08:09, Peter Eisentraut wrote:
> > This patch allows using text position search functions with 
> > nondeterministic collations.  These functions are
> > 
> > - position, strpos
> > - replace
> > - split_part
> > - string_to_array
> > - string_to_table
> > 
> > which all use common internal infrastructure.
> 
> > Some exploratory testing could be useful here.  The present test 
> > coverage was already quite helpful during development, but there is 
> > always the possibility that something was overlooked.


I took a look at this patch.

         * Most callers will require "greedy" semantics, meaning that we need
         * to find the longest such substring, not the shortest.  For callers
         * don't don't need greedy semantics, we can finish on the first

s/don't don't/that don't/ ?

-   Assert(len1 > 0);
    Assert(len2 > 0);

Is there a reason to remove this assert?

     * (With nondeterministic collations, the search was already
     * multibyte-aware, so we don't need this.)

s/was/is/

The commit title could be changed to reflect that you are adding support for
multiple functions. The POSITION gives the impression that it is only for the
position() function. Something like

    Support position search functions with nondeterministic collations

I did a couple of tests (some are shown below) and I didn't find issues.

postgres=# CREATE COLLATION coll_det (provider = icu, locale = '', 
deterministic = true);
NOTICE:  using standard form "und" for ICU locale ""
CREATE COLLATION
postgres=# CREATE COLLATION coll_noaccents (provider = icu, locale = 
'@colStrength=primary;colCaseLevel=yes', deterministic = false);
NOTICE:  using standard form "und-u-kc-ks-level1" for ICU locale 
"@colStrength=primary;colCaseLevel=yes"
CREATE COLLATION
postgres=# \dO
                                           List of collations
Schema |      Name      | Provider | Collate | Ctype |       Locale       | ICU 
Rules | Deterministic? 
--------+----------------+----------+---------+-------+--------------------+-----------+----------------
public | coll_det       | icu      |         |       | und                |     
      | yes
public | coll_noaccents | icu      |         |       | und-u-kc-ks-level1 |     
      | no
public | coll_nondet    | icu      |         |       | und                |     
      | no
(3 rows)

postgres=# SELECT *, replace(nome COLLATE coll_noaccents, 'Sao', 'Saint') FROM 
municipios WHERE nome ~ 'São L' ORDER BY 2;
codigo  |              nome               | uf |             replace            
  
---------+---------------------------------+----+----------------------------------
3201803 | Divino de São Lourenço        | ES | Divino de Saint Lourenço
4318705 | São Leopoldo                   | RS | Saint Leopoldo
3163706 | São Lourenço                  | MG | Saint Lourenço
2613701 | São Lourenço da Mata          | PE | Saint Lourenço da Mata
3549953 | São Lourenço da Serra         | SP | Saint Lourenço da Serra
4216909 | São Lourenço do Oeste         | SC | Saint Lourenço do Oeste
2210359 | São Lourenço do Piauí        | PI | Saint Lourenço do Piauí
4318804 | São Lourenço do Sul           | RS | Saint Lourenço do Sul
4217006 | São Ludgero                    | SC | Saint Ludgero
2111300 | São Luís                      | MA | Saint Luís
5220108 | São Luís de Montes Belos      | GO | Saint Luís de Montes Belos
2312601 | São Luís do Curu              | CE | Saint Luís do Curu
3550001 | São Luís do Paraitinga        | SP | Saint Luís do Paraitinga
2210375 | São Luis do Piauí             | PI | Saint Luis do Piauí
2708501 | São Luís do Quitunde          | AL | Saint Luís do Quitunde
2111409 | São Luís Gonzaga do Maranhão | MA | Saint Luís Gonzaga do Maranhão
1400605 | São Luiz                       | RR | Saint Luiz
5220157 | São Luíz do Norte             | GO | Saint Luíz do Norte
4318903 | São Luiz Gonzaga               | RS | Saint Luiz Gonzaga
(19 rows)

postgres=# select *, strpos(nome collate coll_noaccents, 'ce') from municipios 
where nome ~ 'Luc';
codigo  |        nome        | uf | strpos 
---------+--------------------+----+--------
2406908 | Lucrécia          | RN |      0
2508604 | Lucena             | PB |      3
3527405 | Lucélia           | SP |      3
3527504 | Lucianópolis      | SP |      0
4315008 | Porto Lucena       | RS |      9
4315149 | Presidente Lucena  | RS |     14
5105259 | Lucas do Rio Verde | MT |      0
5105309 | Luciara            | MT |      0
(8 rows)

postgres=# select *, strpos(nome collate coll_det, 'ce') from municipios where 
nome ~ 'Luc';
codigo  |        nome        | uf | strpos 
---------+--------------------+----+--------
2406908 | Lucrécia          | RN |      0
2508604 | Lucena             | PB |      3
3527405 | Lucélia           | SP |      0
3527504 | Lucianópolis      | SP |      0
4315008 | Porto Lucena       | RS |      9
4315149 | Presidente Lucena  | RS |     14
5105259 | Lucas do Rio Verde | MT |      0
5105309 | Luciara            | MT |      0
(8 rows)

postgres=# select *, split_part(nome COLLATE coll_noaccents, 'land', 1) from 
municipios where nome ~ 'ndia' limit 10;
codigo  |           nome            | uf | split_part 
---------+---------------------------+----+------------
1100148 | Nova Brasilândia D'Oeste | RO | Nova Brasi
1100601 | Cacaulândia              | RO | Cacau
1200013 | Acrelândia               | AC | Acre
1200252 | Epitaciolândia           | AC | Epitacio
1400407 | Normandia                 | RR | Normandia
1504455 | Medicilândia             | PA | Medici
1505437 | Ourilândia do Norte      | PA | Ouri
1507953 | Tailândia                | PA | Tai
1700251 | Abreulândia              | TO | Abreu
1703008 | Babaçulândia            | TO | Babaçu
(10 rows)

postgres=# select *, split_part(nome COLLATE coll_det, 'land', 1) from 
municipios where nome ~ 'ndia' limit 10;
codigo  |           nome            | uf |        split_part         
---------+---------------------------+----+---------------------------
1100148 | Nova Brasilândia D'Oeste | RO | Nova Brasilândia D'Oeste
1100601 | Cacaulândia              | RO | Cacaulândia
1200013 | Acrelândia               | AC | Acrelândia
1200252 | Epitaciolândia           | AC | Epitaciolândia
1400407 | Normandia                 | RR | Normandia
1504455 | Medicilândia             | PA | Medicilândia
1505437 | Ourilândia do Norte      | PA | Ourilândia do Norte
1507953 | Tailândia                | PA | Tailândia
1700251 | Abreulândia              | TO | Abreulândia
1703008 | Babaçulândia            | TO | Babaçulândia
(10 rows)


--
Euler Taveira
EDB   https://www.enterprisedb.com/

Re: Support POSITION with nondeterministic collations

Reply via email to