And to answer your original question, spark.cleaner.ttl is not safe for the exact reason you brought up. The PR Mark linked intends to provide a much cleaner (and safer) solution.
On Tue, Mar 11, 2014 at 2:01 PM, Mark Hamstra <m...@clearstorydata.com>wrote: > Actually, TD's work-in-progress is probably more what you want: > https://github.com/apache/spark/pull/126 > > > On Tue, Mar 11, 2014 at 1:58 PM, Michael Allman <m...@allman.ms> wrote: > >> Hello, >> >> I've been trying to run an iterative spark job that spills 1+ GB to disk >> per iteration on a system with limited disk space. I believe there's enough >> space if spark would clean up unused data from previous iterations, but as >> it stands the number of iterations I can run is limited by available disk >> space. >> >> I found a thread on the usage of spark.cleaner.ttl on the old Spark Users >> Google group here: >> >> https://groups.google.com/forum/#!topic/spark-users/9ebKcNCDih4 >> >> I think this setting may be what I'm looking for, however the cleaner >> seems to delete data that's still in use. The effect is I get bizarre >> exceptions from Spark complaining about missing broadcast data or >> ArrayIndexOutOfBounds. When is spark.cleaner.ttl safe to use? Is it >> supposed to delete in-use data or is this a bug/shortcoming? >> >> Cheers, >> >> Michael >> >> >> >