[I] Shuffle files written by native CometExchange operator cannot be cleaned [datafusion-comet]

via GitHub Mon, 24 Mar 2025 09:03:30 -0700

Kontinuation opened a new issue, #1567:
URL: https://github.com/apache/datafusion-comet/issues/1567

   ### Describe the bug
   
   Running TPC-H SF=100 on a single node repeatedly will eventually run out of 
disk when native or auto shuffle mode is enabled. The shuffle files generated 
when running the queries never gets deleted. Setting 
`spark.cleaner.periodicGC.interval=60s` or manually trigger driver GC does not 
help.
   
   This problem only happens when `spark.comet.exec.shuffle.mode` is `native` 
or `auto`. It does not happen when shuffle mode is `jvm`.
   
   ### Steps to reproduce
   
   Run 
[tpcbench.py](https://github.com/apache/datafusion-benchmarks/blob/main/runners/datafusion-comet/tpcbench.py)
 with `--iterations 10` will take hundreds of gigs of disk space. Here is an 
example:
   
   ```
   spark-submit \
           --master local[8] \
           --conf spark.driver.memory=3g \
           --conf spark.memory.offHeap.enabled=true \
           --conf spark.memory.offHeap.size=16g \
           --conf spark.cleaner.periodicGC.interval=60s \
           --conf 
spark.jars=/path/to/comet-spark-spark3.5_2.12-0.8.0-SNAPSHOT.jar \
           --conf 
spark.driver.extraClassPath=/path/to/comet-spark-spark3.5_2.12-0.8.0-SNAPSHOT.jar
 \
           --conf 
spark.executor.extraClassPath=/path/to/comet-spark-spark3.5_2.12-0.8.0-SNAPSHOT.jar
 \
           --conf 
spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions \
           --conf 
spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager
 \
           --conf spark.comet.enabled=true \
           --conf spark.comet.exec.shuffle.enabled=true \
           --conf spark.comet.exec.shuffle.mode=native \
           --conf spark.comet.exec.shuffle.fallbackToColumnar=true \
           --conf spark.comet.exec.shuffle.compression.codec=lz4 \
           --conf spark.comet.exec.replaceSortMergeJoin=true \
           tpcbench.py \
           --benchmark tpch \
           --data /path/to/tpch/sf100_parquet \
           --queries ../../tpch/queries \
           --output tpc-results \
           --iterations 10
   ```
   
   
   ### Expected behavior
   
   Unused shuffle files should be deleted when GC was triggered on Spark driver.
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[I] Shuffle files written by native CometExchange operator cannot be cleaned [datafusion-comet]

Reply via email to