Kontinuation opened a new pull request, #1568:
URL: https://github.com/apache/datafusion-comet/pull/1568

   ## Which issue does this PR close?
   
   Closes #1567.
   
   ## Rationale for this change
   
   Shuffle file deletes were handled by the 
[`unregisterShuffle`](https://github.com/apache/datafusion-comet/blob/0.7.0/spark/src/main/scala/org/apache/spark/sql/comet/execution/shuffle/CometShuffleManager.scala#L214-L221)
 method of shuffle manager. `CometShuffleManager` uses a map 
[taskIdMapsForShuffle](https://github.com/apache/datafusion-comet/blob/0.7.0/spark/src/main/scala/org/apache/spark/sql/comet/execution/shuffle/CometShuffleManager.scala#L58-L61)
 to keep track of which shuffle files to delete for a given task. This map only 
gets updated when 
[`getWriter`](https://github.com/apache/datafusion-comet/blob/0.7.0/spark/src/main/scala/org/apache/spark/sql/comet/execution/shuffle/CometShuffleManager.scala#L175-L184)
 is called.
   
   In JVM shuffle mode, shuffle writers are obtained by calling the `getWriter` 
method of CometShuffleManager, the map gets updated to remember which shuffle 
file was created for a task, so `unregisterShuffle` could work correctly in 
this case. However, we use a [custom shuffle write 
processor](https://github.com/apache/datafusion-comet/blob/0.7.0/spark/src/main/scala/org/apache/spark/sql/comet/execution/shuffle/CometShuffleExchangeExec.scala#L236)
 in native mode, which does not call `getWriter` method when performing shuffle 
writes. The shuffle files written in native mode were not tracked by 
`taskIdMapsForShuffle` and won't be deleted when `unregisterShuffle` was called.
   
   ## What changes are included in this PR?
   
   This PR refactored the native shuffle writer to implement 
`org.apache.spark.shuffle.ShuffleWriter`, and make all shuffle writers being 
created by calling the `getWriter` method of `CometShuffleManager`. Now 
`CometShuffleManager` will be able to keep track of shuffle files of all tasks, 
and delete them on unregistration.
   
   ## How are these changes tested?
   
   Added a unit test.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to