Write a crone job for this like below

12 * * * *  find $SPARK_HOME/work -cmin +1440 -prune -exec rm -rf {} \+
32 * * * *  find /tmp -type d -cmin +1440 -name "spark-*-*-*" -prune -exec
rm -rf {} \+
52 * * * *  find $SPARK_LOCAL_DIR -mindepth 1 -maxdepth 1 -type d -cmin
+1440 -name "spark-*-*-*" -prune -exec rm -rf {} \+

On 20 April 2015 at 23:12, N B <nb.nos...@gmail.com> wrote:

> Hi all,
>
> I had posed this query as part of a different thread but did not get a
> response there. So creating a new thread hoping to catch someone's
> attention.
>
> We are experiencing this issue of shuffle files being left behind and not
> being cleaned up by Spark. Since this is a Spark streaming application, it
> is expected to stay up indefinitely, so shuffle files not being cleaned up
> is a big problem right now. Our max window size is 6 hours, so we have set
> up a cron job to clean up shuffle files older than 12 hours otherwise it
> will eat up all our disk space.
>
> Please see the following. It seems the non-cleaning of shuffle files is
> being documented in 1.3.1.
>
> https://github.com/apache/spark/pull/5074/files
> https://issues.apache.org/jira/browse/SPARK-5836
>
>
> Also, for some reason, the following JIRAs that were reported as
> functional issues were closed as Duplicates of the above Documentation bug.
> Does this mean that this issue won't be tackled at all?
>
> https://issues.apache.org/jira/browse/SPARK-3563
> https://issues.apache.org/jira/browse/SPARK-4796
> https://issues.apache.org/jira/browse/SPARK-6011
>
> Any further insight into whether this is being looked into and meanwhile
> how to handle shuffle files will be greatly appreciated.
>
> Thanks
> NB
>
>

Reply via email to