Hi Keith,
I don't think that we keep such references.
But we do experience exceptions during the job execution that we catch and
retry (timeouts/network issues from different data sources).
Can they affect RDD cleanup?
Thanks,
Alex
On Sun, Jul 21, 2019 at 10:49 PM Keith Chapman
wrote:
> Hi Ale
Hi Alex,
Shuffle files in spark are deleted when the object holding a reference to
the shuffle file on disk goes out of scope (is garbage collected by the
JVM). Could it be the case that you are keeping these objects alive?
Regards,
Keith.
http://keith-chapman.com
On Sun, Jul 21, 2019 at 12:1
Thanks,
I looked into these options, the cleaner periodic interval is set to 30 min
by default.
The block option for shuffle -
*spark.cleaner.referenceTracking.blocking.shuffle* - is set to false by
default.
What are the implications of setting it to true?
Will it make the driver slower?
Thanks,
A
This is the job of ContextCleaner. There are few a property that you can tweak
to see if that helps:
spark.cleaner.periodicGC.interval
spark.cleaner.referenceTracking
spark.cleaner.referenceTracking.blocking.shuffle
Regards
Prathmesh Ranaut
> On Jul 21, 2019, at 11:36 AM, Prathmesh Ranaut
Hi,
We are running a long running Spark application ( which executes lots of
quick jobs using our scheduler ) on Spark stand-alone cluster 2.4.0.
We see that old shuffle files ( a week old for example ) are not deleted
during the execution of the application, which leads to out of disk space
error