Re: general purpose array_sort

Junwang Zhao Sat, 09 Nov 2024 06:45:56 -0800

Hi Michael,

On Fri, Nov 8, 2024 at 8:52 AM Michael Paquier <mich...@paquier.xyz> wrote:
>
> On Thu, Nov 07, 2024 at 10:06:04PM +0800, Junwang Zhao wrote:
> > I'm afraid this can not be achieved in my current implementation, a simple
> > case is:
> >
> > SELECT array_sort('{foo,bar,null,CCC,Abc,bbc}'::text[]);
> > {Abc,bar,bbc,CCC,foo,NULL}
> > SELECT array_sort('{foo,bar,null,CCC,Abc,bbc}'::text[] COLLATE "C");
> > {Abc,CCC,bar,bbc,foo,NULL}
> >
> > SELECT array_sort(a COLLATE "C") FROM (VALUES
> > ('{foo,bar,null,CCC,Abc,bbc}'::text[] COLLATE "C"),
> > ('{foo,bar,null,CCC,Abc,bbc}'::text[])) v(a);
> > {Abc,CCC,bar,bbc,foo,NULL}
> > {Abc,CCC,bar,bbc,foo,NULL}
> >
> > Maybe add some documents to specify this?
>
> So, if I use that:
> CREATE COLLATION case_sensitive (provider = icu, locale = '');
> =# SELECT array_sort('{Abc,CCC,bar,bbc,foo,NULL}'::text[]
>          COLLATE "case_sensitive");
>         array_sort
> ----------------------------
>  {Abc,bar,bbc,CCC,foo,NULL}
> (1 row)
> =# SELECT array_sort('{Abc,CCC,bar,bbc,foo,NULL}'::text[]
>          COLLATE "C");
>         array_sort
> ----------------------------
>  {Abc,CCC,bar,bbc,foo,NULL}
> (1 row)
>
> What takes priority is the collation defined with the array_sort,
> which is fine:
> =# SELECT array_sort(a collate "case_sensitive") FROM
>      (VALUES ('{foo,bar,null,CCC,Abc,bbc}'::text[]),
>              ('{foo,bar,null,CCC,Abc,bbc}'::text[] COLLATE "C" )) v(a);
>         array_sort
> ----------------------------
>  {Abc,bar,bbc,CCC,foo,NULL}
>  {Abc,bar,bbc,CCC,foo,NULL}
> (2 rows)
> =# SELECT array_sort(a collate "C") FROM
>      (VALUES ('{foo,bar,null,CCC,Abc,bbc}'::text[]),
>              ('{foo,bar,null,CCC,Abc,bbc}'::text[] COLLATE "case_sensitive" 
> )) v(a);
>         array_sort
> ----------------------------
>  {Abc,CCC,bar,bbc,foo,NULL}
>  {Abc,CCC,bar,bbc,foo,NULL}
> (2 rows)
>
> The case where the collation is defined in the set of values is a bit
> more troubling to me, as it depends on what the values want to be
> applied, still that's OK because the collation applied is the one
> coming from the set of values:
> =# SELECT array_sort(a) FROM
>      (VALUES ('{foo,bar,null,CCC,Abc,bbc}'::text[]),
>              ('{foo,bar,null,CCC,Abc,bbc}'::text[] COLLATE "case_sensitive" 
> )) v(a);
>          array_sort
> ----------------------------
>  {Abc,bar,bbc,CCC,foo,NULL}
>  {Abc,bar,bbc,CCC,foo,NULL}
> (2 rows)
> =# SELECT array_sort(a) FROM
>      (VALUES ('{foo,bar,null,CCC,Abc,bbc}'::text[]),
>              ('{foo,bar,null,CCC,Abc,bbc}'::text[] COLLATE "C" )) v(a);
>          array_sort
> ----------------------------
>  {Abc,CCC,bar,bbc,foo,NULL}
>  {Abc,CCC,bar,bbc,foo,NULL}
> (2 rows)
>
> I am wondering if there are more fancy cases where the saved cache
> could force a state that would lead to puzzling results, say with
> different collations that should be applied.  I'd recommend to
> research that more, to reflect that in the docs and to add tests that
> show what we should expect in these cases within 0001 because this
> new function is mimicking in the context of a function execution
> multiple query clauses where restrictions are applied when analyzing
> the query, close to the parser.
>
> For example, UNION and UNION ALL require a common collation when
> processing a set of expressions related to them, which would be OK.
> Perhaps I lack some imagination to be able to break things.
> --
> Michael


While trying to come up with more test cases, it comes to me if the
PG_GET_COLLATION() has already done the work to give array_sort
the right collation oid? I did not pass the typentry->typcollation but
PG_GET_COLLATION()  to tuplesort_begin_datum.

I tried:

CREATE COLLATION case_sensitive (provider = icu, locale = '');

create table t1(a int, b text[] COLLATE "C");
create table t2(a int, b text[] COLLATE "case_sensitive");

insert into t1 values (1, '{foo,bar,null,CCC,Abc,bbc}'::text[]);
insert into t2 values (2, '{foo,bar,null,CCC,Abc,bbc}'::text[]);

select array_sort(b) from t1;
select array_sort(b) from t2;

Set breakpoint to see the collation oid, they all give the expected result.

For the following cases:

SELECT array_sort(a) FROM
     (VALUES ('{foo,bar,null,CCC,Abc,bbc}'::text[]),
              ('{foo,bar,null,CCC,Abc,bbc}'::text[] COLLATE
"case_sensitive" )) v(a);

WITH t AS (select '{foo,bar,null,CCC,Abc,bbc}'::text[] a UNION ALL
select '{foo,bar,null,CCC,Abc,bbc}'::text[] COLLATE "case_sensitive"
a) SELECT array_sort(a)  from t;

The collation seems to have been decided in select_common_collation of
the transform phase.

For:

WITH t AS (select '{foo,bar,null,CCC,Abc,bbc}'::text[] COLLATE "C" a
UNION ALL select
     '{foo,bar,null,CCC,Abc,bbc}'::text[] COLLATE "case_sensitive"a)
SELECT array_sort(a) from t;
ERROR:  collation mismatch between explicit collations "C" and "case_sensitive"
LINE 2:      '{foo,bar,null,CCC,Abc,bbc}'::text[] COLLATE "case_sens...

merge_collation_state gives out an ERROR since its explicit collation.
But for implicit collation,
select_common_collation sets InvalidOid to context.collation, so the
following works:

select b from t1 UNION ALL select b from t2;

But since the context has the InvalidOid as collation,
PG_GET_COLLATION() in arrary_sort
got InvalidOid, the following errors:

WITH t3 AS (select b from t1 UNION ALL select b from t2) select
array_sort(b) from t3;
ERROR:  could not determine which collation to use for string comparison
HINT:  Use the COLLATE clause to set the collation explicitly.

The error message comes from tuplesort_begin_datum's call stack, we
can do explicit COLLATE to make it work:

WITH t3 AS (select b from t1 UNION ALL select b from t2) select
array_sort(b collate "C") from t3;

Based on the above analysis, I think it's ok to pass PG_GET_COLLATION()
to tuplesort_begin_datum.

PFA v14 with Robert's comment addressed.

-- 
Regards
Junwang Zhao

v14-0001-general-purpose-array_sort.patch
Description: Binary data

v14-0002-support-sort-order-and-nullsfirst-flag.patch
Description: Binary data

Re: general purpose array_sort

Reply via email to