Hi,
We set the spark.cleaner.ttl to some reasonable time and also set spark.streaming.unpersist=true. Those together cleaned up the shuffle files for us. -Conor On Tue, Apr 21, 2015 at 8:18 AM, N B <nb.nos...@gmail.com> wrote: > We already do have a cron job in place to clean just the shuffle files. > However, what I would really like to know is whether there is a "proper" > way of telling spark to clean up these files once its done with them? > > Thanks > NB > > > On Mon, Apr 20, 2015 at 10:47 AM, Jeetendra Gangele <gangele...@gmail.com> > wrote: > >> Write a crone job for this like below >> >> 12 * * * * find $SPARK_HOME/work -cmin +1440 -prune -exec rm -rf {} \+ >> 32 * * * * find /tmp -type d -cmin +1440 -name "spark-*-*-*" -prune >> -exec rm -rf {} \+ >> 52 * * * * find $SPARK_LOCAL_DIR -mindepth 1 -maxdepth 1 -type d -cmin >> +1440 -name "spark-*-*-*" -prune -exec rm -rf {} \+ >> >> >> On 20 April 2015 at 23:12, N B <nb.nos...@gmail.com> wrote: >> >>> Hi all, >>> >>> I had posed this query as part of a different thread but did not get a >>> response there. So creating a new thread hoping to catch someone's >>> attention. >>> >>> We are experiencing this issue of shuffle files being left behind and >>> not being cleaned up by Spark. Since this is a Spark streaming application, >>> it is expected to stay up indefinitely, so shuffle files not being cleaned >>> up is a big problem right now. Our max window size is 6 hours, so we have >>> set up a cron job to clean up shuffle files older than 12 hours otherwise >>> it will eat up all our disk space. >>> >>> Please see the following. It seems the non-cleaning of shuffle files is >>> being documented in 1.3.1. >>> >>> https://github.com/apache/spark/pull/5074/files >>> https://issues.apache.org/jira/browse/SPARK-5836 >>> >>> >>> Also, for some reason, the following JIRAs that were reported as >>> functional issues were closed as Duplicates of the above Documentation bug. >>> Does this mean that this issue won't be tackled at all? >>> >>> https://issues.apache.org/jira/browse/SPARK-3563 >>> https://issues.apache.org/jira/browse/SPARK-4796 >>> https://issues.apache.org/jira/browse/SPARK-6011 >>> >>> Any further insight into whether this is being looked into and meanwhile >>> how to handle shuffle files will be greatly appreciated. >>> >>> Thanks >>> NB >>> >>> >> >> >> >> >