Hi all,

I had posed this query as part of a different thread but did not get a
response there. So creating a new thread hoping to catch someone's
attention.

We are experiencing this issue of shuffle files being left behind and not
being cleaned up by Spark. Since this is a Spark streaming application, it
is expected to stay up indefinitely, so shuffle files not being cleaned up
is a big problem right now. Our max window size is 6 hours, so we have set
up a cron job to clean up shuffle files older than 12 hours otherwise it
will eat up all our disk space.

Please see the following. It seems the non-cleaning of shuffle files is
being documented in 1.3.1.

https://github.com/apache/spark/pull/5074/files
https://issues.apache.org/jira/browse/SPARK-5836


Also, for some reason, the following JIRAs that were reported as functional
issues were closed as Duplicates of the above Documentation bug. Does this
mean that this issue won't be tackled at all?

https://issues.apache.org/jira/browse/SPARK-3563
https://issues.apache.org/jira/browse/SPARK-4796
https://issues.apache.org/jira/browse/SPARK-6011

Any further insight into whether this is being looked into and meanwhile
how to handle shuffle files will be greatly appreciated.

Thanks
NB

Reply via email to