[I] ParallelizeSorts, a subrule of EnforceSorting optimizer, should not remove necessary coalesce. [datafusion]

via GitHub Sat, 15 Feb 2025 22:52:55 -0800


wiedld opened a new issue, #14691:
URL: https://github.com/apache/datafusion/issues/14691


   ### Describe the bug
   
   During the EnforceSorting optimizer run, a valid plan may be turned invalid 
due to the removal of a necessary coalesce. The result is a planning time 
failure in the SanityChecker due to `does not satisfy distribution 
requirements: HashPartitioned[[a@0]]). Child-0 output partitioning: 
UnknownPartitioning(2)`.
   
   
   We start with a valid input plan:
   ```
   "SortExec: expr=[a@0 ASC], preserve_partitioning=[false]",
   "  AggregateExec: mode=SinglePartitioned, gby=[a@0 as a1], aggr=[]",
   "    CoalescePartitionsExec",
   "      ProjectionExec: expr=[a@0 as a, b@1 as value]",
   "        UnionExec",
   "          DataSourceExec: file_groups={1 group: [[x]]}, projection=[a, b, 
c, d, e], file_type=parquet",
   "          DataSourceExec: file_groups={1 group: [[x]]}, projection=[a, b, 
c, d, e], file_type=parquet"
   ```
   
   And a coalesce is removed to make it invalid:
   ```
   "SortPreservingMergeExec: [a@0 ASC]",
   "  SortExec: expr=[a@0 ASC], preserve_partitioning=[true]",
   "    AggregateExec: mode=SinglePartitioned, gby=[a@0 as a1], aggr=[]",
   "      ProjectionExec: expr=[a@0 as a, b@1 as value]",
   "        UnionExec",
   "          DataSourceExec: file_groups={1 group: [[x]]}, projection=[a, b, 
c, d, e], file_type=parquet",
   "          DataSourceExec: file_groups={1 group: [[x]]}, projection=[a, b, 
c, d, e], file_type=parquet",
   ```
   
   ### To Reproduce
   
   A test case demonstrates this: 
https://github.com/apache/datafusion/pull/14637/commits/670eff35bce04efdc163ce7823437691aa9f29f6
   
   ### Expected behavior
   
   EnforceSorting should not take a valid plan, and make it invalid -- and then 
failing the planning sanity check.
   
   ### Additional context
   
   We already have a proposed solution: 
https://github.com/apache/datafusion/pull/14637
   
   While debugging, I did a minor refactor to `paralelize_sorts` and its helper 
`remove_bottleneck_in_subplan`. The reason for the refactor ([also summarized 
here](https://github.com/apache/datafusion/pull/14637#discussion_r1957023902)), 
was that I noticed a pattern of several necessary nodes being removed -- and 
then added back later. I elected to simplify the code (IMO) by tightening up 
how we build the `PlanWithCorrespondingCoalescePartitions`, in order to 
correctly identify want nodes should be removed in the first place. Instead of 
removing and then adding back. The refactor is isolated in this commit: 
https://github.com/apache/datafusion/pull/14637/commits/0661ed7e8934e7f2a711416b85cbafde2a7b99e2


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[I] ParallelizeSorts, a subrule of EnforceSorting optimizer, should not remove necessary coalesce. [datafusion]

Reply via email to