Hello all - can anyone please offer any advice on this issue? -Ilya Ganelin
On Mon, Dec 22, 2014 at 5:36 PM, Ganelin, Ilya <ilya.gane...@capitalone.com> wrote: > Hi all, I have a long running job iterating over a huge dataset. Parts of > this operation are cached. Since the job runs for so long, eventually the > overhead of spark shuffles starts to accumulate culminating in the driver > starting to swap. > > I am aware of the spark.cleanup.tll parameter that allows me to configure > when cleanup happens but the issue with doing this is that it isn’t done > safely, e.g. I can be in the middle of processing a stage when this cleanup > happens and my cached RDDs get cleared. This ultimately causes a > KeyNotFoundException when I try to reference the now cleared cached RDD. > This behavior doesn’t make much sense to me, I would expect the cached RDD > to either get regenerated or at the very least for there to be an option to > execute this cleanup without deleting those RDDs. > > Is there a programmatically safe way of doing this cleanup that doesn’t > break everything? > > If I instead tear down the spark context and bring up a new context for > every iteration (assuming that each iteration is sufficiently long-lived), > would memory get released appropriately? > > ------------------------------ > > The information contained in this e-mail is confidential and/or > proprietary to Capital One and/or its affiliates. The information > transmitted herewith is intended only for use by the individual or entity > to which it is addressed. If the reader of this message is not the > intended recipient, you are hereby notified that any review, > retransmission, dissemination, distribution, copying or other use of, or > taking of any action in reliance upon this information is strictly > prohibited. If you have received this communication in error, please > contact the sender and delete the material from your computer. >