Hi,
I had to setup a cron job for cleanup in $SPARK_HOME/work and in
$SPARK_LOCAL_DIRS.
Here are the cron lines. Unfortunately it's for *nix machines, I guess
you will have to adapt it seriously for Windows.
12 * * * * find $SPARK_HOME/work -cmin +1440 -prune -exec rm -rf {} \+
32 * * * * find /tmp -type d -cmin +1440 -name "spark-*-*-*" -prune
-exec rm -rf {} \+
52 * * * * find $SPARK_LOCAL_DIR -mindepth 1 -maxdepth 1 -type d -cmin
+1440 -name "spark-*-*-*" -prune -exec rm -rf {} \+
They remove directories older than a day.
The cron have to be setup both on the executors AND on the driver (the
spark local dir of the driver can be heavily used if using a lot of
broadcast)
I think in recent versions of Spark, the $SPARK_HOME/work is correctly
cleaned up, but adding a cron won't hurt.
Guillaume
Does anybody have an answer for this?
Thanks
Ningjun
*From:*Wang, Ningjun (LNG-NPV)
*Sent:* Thursday, April 02, 2015 12:14 PM
*To:* user@spark.apache.org
*Subject:* Is the disk space in SPARK_LOCAL_DIRS cleanned up?
I set SPARK_LOCAL_DIRS to C:\temp\spark-temp. When RDDs are
shuffled, spark writes to this folder. I found that the disk space of
this folder keep on increase quickly and at certain point I will run
out of disk space.
I wonder does spark clean up the disk spacein this folder once the
shuffle operation is done? If not, I need to write a job to clean it
up myself. But how do I know which sub folders there can be removed?
Ningjun
--
eXenSa
*Guillaume PITEL, Président*
+33(0)626 222 431
eXenSa S.A.S. <http://www.exensa.com/>
41, rue Périer - 92120 Montrouge - FRANCE
Tel +33(0)184 163 677 / Fax +33(0)972 283 705