Kontinuation opened a new issue, #1567: URL: https://github.com/apache/datafusion-comet/issues/1567
### Describe the bug Running TPC-H SF=100 on a single node repeatedly will eventually run out of disk when native or auto shuffle mode is enabled. The shuffle files generated when running the queries never gets deleted. Setting `spark.cleaner.periodicGC.interval=60s` or manually trigger driver GC does not help. This problem only happens when `spark.comet.exec.shuffle.mode` is `native` or `auto`. It does not happen when shuffle mode is `jvm`. ### Steps to reproduce Run [tpcbench.py](https://github.com/apache/datafusion-benchmarks/blob/main/runners/datafusion-comet/tpcbench.py) with `--iterations 10` will take hundreds of gigs of disk space. Here is an example: ``` spark-submit \ --master local[8] \ --conf spark.driver.memory=3g \ --conf spark.memory.offHeap.enabled=true \ --conf spark.memory.offHeap.size=16g \ --conf spark.cleaner.periodicGC.interval=60s \ --conf spark.jars=/path/to/comet-spark-spark3.5_2.12-0.8.0-SNAPSHOT.jar \ --conf spark.driver.extraClassPath=/path/to/comet-spark-spark3.5_2.12-0.8.0-SNAPSHOT.jar \ --conf spark.executor.extraClassPath=/path/to/comet-spark-spark3.5_2.12-0.8.0-SNAPSHOT.jar \ --conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions \ --conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \ --conf spark.comet.enabled=true \ --conf spark.comet.exec.shuffle.enabled=true \ --conf spark.comet.exec.shuffle.mode=native \ --conf spark.comet.exec.shuffle.fallbackToColumnar=true \ --conf spark.comet.exec.shuffle.compression.codec=lz4 \ --conf spark.comet.exec.replaceSortMergeJoin=true \ tpcbench.py \ --benchmark tpch \ --data /path/to/tpch/sf100_parquet \ --queries ../../tpch/queries \ --output tpc-results \ --iterations 10 ``` ### Expected behavior Unused shuffle files should be deleted when GC was triggered on Spark driver. ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org