peter-toth commented on PR #54330:
URL: https://github.com/apache/spark/pull/54330#issuecomment-3992828703

   > 1. `isGrouped` recomputes on every `.copy()`
   
   Good idea! Fixed in 
https://github.com/apache/spark/pull/54330/commits/853e6b72fcdb7e528f2a394c7d658091facbf9f1
 and 
https://github.com/apache/spark/pull/54330/commits/94054cb7c9029c0aa5120ffe5b7283aeebb725b2.
   
   > 2. `toGrouped` may not preserve sorted order
   
   Ok, done in 
https://github.com/apache/spark/pull/54330/commits/f5baf7670b6cba639b9b032586d161d7462f8efc.
   
   > 3. `GroupPartitionsExec.outputPartitioning` applies `transform` without 
verifying invariant
   > 4. `GroupPartitionsExec` has no `doCanonicalize` override
   > 6 . `equals`/`hashCode` on `KeyedPartitioning` allocates wrappers per key 
on every call
   
   I think these 3 comes down to how we store partitions keys in 
`KeyedPartitioning`. In 
https://github.com/apache/spark/pull/54330/commits/174fe90dbf27cbff2523d804ba8f2087bfd936e4
 I changed it to `partitionKeys: Seq[InternalRowComparableWrapper]`, which made 
the code much cleaner at many places, simplified `equals`/`hashCode`, made 
`doCanonicalize` unnecessary. Interestingly, it also helped to identify a few 
fields that should be transient because `InternalRowComparableWrapper` is not 
serializable, but IMO we shoudn't send the keys (maybe not even any 
`Partitioning`s) to executors.
   
   > 5. `GroupPartitionsExec` doesn't handle columnar execution
   
   Good catch! I totally forgot about that. Added in 
https://github.com/apache/spark/pull/54330/commits/1b6bb2988a4ae9fca5f3050bf999be43a7ffd0ca.
   
   > 7. Dead variable in `ShuffleExchangeExec`
   
   Fixed in 
https://github.com/apache/spark/pull/54330/commits/fa432914e4981ea15fbe9d3463c2adcc16d05593.
   
   > 8. `applyGroupPartitions` doesn't handle `GroupPartitionsExec` nested 
under unary nodes
   
   Added documentation in 
https://github.com/apache/spark/pull/54330/commits/b43017d1736b20f20c9d27135ec2259d1fedc2ab.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to