Kontinuation opened a new pull request, #1568: URL: https://github.com/apache/datafusion-comet/pull/1568
## Which issue does this PR close? Closes #1567. ## Rationale for this change Shuffle file deletes were handled by the [`unregisterShuffle`](https://github.com/apache/datafusion-comet/blob/0.7.0/spark/src/main/scala/org/apache/spark/sql/comet/execution/shuffle/CometShuffleManager.scala#L214-L221) method of shuffle manager. `CometShuffleManager` uses a map [taskIdMapsForShuffle](https://github.com/apache/datafusion-comet/blob/0.7.0/spark/src/main/scala/org/apache/spark/sql/comet/execution/shuffle/CometShuffleManager.scala#L58-L61) to keep track of which shuffle files to delete for a given task. This map only gets updated when [`getWriter`](https://github.com/apache/datafusion-comet/blob/0.7.0/spark/src/main/scala/org/apache/spark/sql/comet/execution/shuffle/CometShuffleManager.scala#L175-L184) is called. In JVM shuffle mode, shuffle writers are obtained by calling the `getWriter` method of CometShuffleManager, the map gets updated to remember which shuffle file was created for a task, so `unregisterShuffle` could work correctly in this case. However, we use a [custom shuffle write processor](https://github.com/apache/datafusion-comet/blob/0.7.0/spark/src/main/scala/org/apache/spark/sql/comet/execution/shuffle/CometShuffleExchangeExec.scala#L236) in native mode, which does not call `getWriter` method when performing shuffle writes. The shuffle files written in native mode were not tracked by `taskIdMapsForShuffle` and won't be deleted when `unregisterShuffle` was called. ## What changes are included in this PR? This PR refactored the native shuffle writer to implement `org.apache.spark.shuffle.ShuffleWriter`, and make all shuffle writers being created by calling the `getWriter` method of `CometShuffleManager`. Now `CometShuffleManager` will be able to keep track of shuffle files of all tasks, and delete them on unregistration. ## How are these changes tested? Added a unit test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org