gene-bordegaray commented on issue #19849:
URL: https://github.com/apache/datafusion/issues/19849#issuecomment-3762169592

   > * partial `AggregateExec` declared an incorrect output partitioning (not 
new)
   > * 
   Yes, in the aggregate here: 
   
   
https://github.com/apache/datafusion/blob/1ab7e4184e391d1950fd2cfb341c4e79a5aa9bf1/datafusion/physical-plan/src/aggregates/mod.rs#L876
 
   
   it incorrectly assumed that the first stage aggregation would not change the 
output partitioning. This is not true in the case of group sets due to the NULL 
extension like the example I went into above.
   
   > * prior to [feat: hash partitioning satisfies subset 
#19304](https://github.com/apache/datafusion/pull/19304), this bug was hidden 
because the plans always repartitioned the output anyways.
   
   Yes, because they were not an exact match after the rollup, it would always 
repartition, thus never saw this issue.
   
   > * After  [feat: hash partitioning satisfies subset 
#19304](https://github.com/apache/datafusion/pull/19304), the bug caused 
incorrect answers because a RepartitioningExec was (incorrectly) removed based 
on the incorrect output partitioning
   
   Yes, we now assumed we were partitioned correctly but was not in this case.
   
   **Example:**
   1. Setup - We are Hash partitioned on (A, B)
   P0: (A=1, B=10, val=5)
   P1: (A=2, B=10, val=7)
   
   2. Rollup where A becomes NULL
   P0:  
   - (A=1, B=10, gid=0) sum=5
   - (A=NULL, B=10, gid=1) sum=5
   P1:
   - (A=2, B=10, gid=0) sum=7
   - (A=NULL, B=10, gid=1) sum=7   <-- same key as P0’s rollup!
   
   3. With vs Without Hash Satisfaction
   *Without Subset Satisfaction (No bug):*
   - The final aggregation now reuires a repartition on (A, B, gid) because we 
do not have an exact match!
   - Thus, (A=NULL, B=10, gid=1) keys land in the same partition
   - Correct results
   
   *With Subset Satisfaction (No bug):*
   - The final aggregation does not require a repartition on (A, B, gid) 
because our input was partitioned on (A, B) and this satisfies
   - Thus, (A=NULL, B=10, gid=1) rows remain in different partitions although 
they have the say key
   - Incrorrect results
   
   cc: @alamb @NGA-TRAN 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to