;, Patrick
Wendell mailto:pwend...@gmail.com>>
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>"
mailto:user@spark.apache.org>>
Subject: Re: Long-running job cleanup
Hi Patrick, to follow up on the below discussion, I am including a short code
snippet that produce
ailto:user@spark.apache.org>"
mailto:user@spark.apache.org>>
Subject: Re: Long-running job cleanup
Hi Patrick - is that cleanup present in 1.1?
The overhead I am talking about is with regards to what I believe is shuffle
related metadata. If I watch the execution log I see sma
Hi Patrick - is that cleanup present in 1.1?
The overhead I am talking about is with regards to what I believe is
shuffle related metadata. If I watch the execution log I see small
broadcast variables created for every stage of execution, a few KB at a
time, and a certain number of MB remaining of
What do you mean when you say "the overhead of spark shuffles start to
accumulate"? Could you elaborate more?
In newer versions of Spark shuffle data is cleaned up automatically
when an RDD goes out of scope. It is safe to remove shuffle data at
this point because the RDD can no longer be referenc
Hello all - can anyone please offer any advice on this issue?
-Ilya Ganelin
On Mon, Dec 22, 2014 at 5:36 PM, Ganelin, Ilya
wrote:
> Hi all, I have a long running job iterating over a huge dataset. Parts of
> this operation are cached. Since the job runs for so long, eventually the
> overhead of