watford-ep opened a new issue, #18327:
URL: https://github.com/apache/datafusion/issues/18327

   ### Describe the bug
   
   Sometimes the `SortPreservingMerge` sanity check rejects valid ORDER BY with 
CASE expressions. This doesn't universally happen, but seems to happen only the 
columns are from joins (debug output reformatted for readability):
   ```
   Plan: [
   SortPreservingMergeExec: [CASE WHEN typtype@3 = b OR typtype@3 = e OR 
typtype@3 = p THEN 0 WHEN typtype@3 = r THEN 1 END ASC NULLS LAST]
     SortExec: expr=[CASE WHEN typtype@3 = b OR typtype@3 = e OR typtype@3 = p 
THEN 0 WHEN typtype@3 = r THEN 1 ELSE 999 END ASC NULLS LAST], 
preserve_partitioning=[true]
       ProjectionExec: expr=[nspname@3 as nspname, oid@0 as oid, typname@1 as 
typname, typtype@2 as typtype]
         CoalesceBatchesExec: target_batch_size=8192
           HashJoinExec: mode=CollectLeft, join_type=Inner, 
on=[(typnamespace@2, oid@0)], projection=[oid@0, typname@1, typtype@3, 
nspname@5]
             CoalesceBatchesExec: target_batch_size=8192
               FilterExec: Use typtype@3 IN (SET) ([Literal { value: Utf8("b"), 
field: Field { name: "lit", data_type: Utf8, nullable: false, dict_id: 0, 
dict_is_ordered: false, metadata: {} } }, Literal { value: Utf8("r"), field: 
Field { name: "lit", data_type: Utf8, nullable: false, dict_id: 0, 
dict_is_ordered: false, metadata: {} } }, Literal { value: Utf8("m"), field: 
Field { name: "lit", data_type: Utf8, nullable: false, dict_id: 0, 
dict_is_ordered: false, metadata: {} } }, Literal { value: Utf8("e"), field: 
Field { name: "lit", data_type: Utf8, nullable: false, dict_id: 0, 
dict_is_ordered: false, metadata: {} } }, Literal { value: Utf8("d"), field: 
Field { name: "lit", data_type: Utf8, nullable: false, dict_id: 0, 
dict_is_ordered: false, metadata: {} } }])
                 DataSourceExec: partitions=1, partition_sizes=[1]
             RepartitionExec: partitioning=RoundRobinBatch(12), 
input_partitions=1
               StreamingTableExec: partition_sizes=1, projection=[oid, nspname]
   ] does not satisfy order requirements:
       [CASE WHEN typtype@3 = b OR typtype@3 = e OR typtype@3 = p THEN 0 WHEN 
typtype@3 = r THEN 1 END ASC NULLS LAST].
   Child-0 order:
       [[CASE WHEN typtype@3 = b OR typtype@3 = e OR typtype@3 = p THEN 0 WHEN 
typtype@3 = r THEN 1 END ASC NULLS LAST]]
   ```
   I think this shows that CASE expressions lose their ability to be compared 
for equality. Projecting the CASE and ordering on the projection does not fail.
   
   I found this using a tool which wraps [datafusion around the postgres wire 
protocol](https://github.com/datafusion-contrib/datafusion-postgres) so that I 
could use our Iceberg data lake from a BI tool which supports Postgres.
   
   ### To Reproduce
   
   I did some bisect'ing on [a query issued by npgsql at 
startup](https://github.com/npgsql/npgsql/issues/4545) and found this minimal 
reproduction:
   ```sql
   SELECT ns.nspname, typ.oid, typ.typname, typ.typtype
   FROM pg_type AS typ JOIN pg_namespace AS ns ON (ns.oid = typ.typnamespace)
   WHERE typ.typtype IN ('b','r','m','e','d')
   ORDER BY CASE WHEN typ.typtype IN ('b','e','p') THEN 0 WHEN typ.typtype = 
'r' THEN 1 END
   ```
   
   I initially thought a missing `ELSE` clause was the culprit, but it does not 
change anything:
   ```sql
   SELECT ns.nspname, typ.oid, typ.typname, typ.typtype
   FROM pg_type AS typ JOIN pg_namespace AS ns ON (ns.oid = typ.typnamespace)
   WHERE typ.typtype IN ('b','r','m','e','d')
   ORDER BY CASE WHEN typ.typtype IN ('b','e','p') THEN 0 WHEN typ.typtype = 
'r' THEN 1 ELSE 999 END
   ```
   
   Using: 
https://github.com/watford-ep/datafusion-postgres/tree/fix-revealbi-startup-sql
   ```sh
   $ cargo test --test revealbi_minimal_repro
   ```
   
   ### Expected behavior
   
   This query executes because it contains a valid ORDER BY which is stable 
across partitions, with or without the ELSE clause.
   
   ### Additional context
   
   This succeeds:
   ```sql
   SELECT ns.nspname, typ.oid, typ.typname, typ.typtype
       FROM pg_type AS typ JOIN pg_namespace AS ns ON (ns.oid = 
typ.typnamespace)
       WHERE typ.typtype IN ('b','r','m','e','d')
       ORDER BY CASE WHEN COALESCE(typ.typtype, '') IN ('b','e','p') THEN 0 
WHEN COALESCE(typ.typtype, '') = 'r' THEN 1 END
   ```
   But applying the same fix to the original query does not.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to