Re: Help alleviating OOM errors

Mayur Rustagi Tue, 01 Jul 2014 23:41:28 -0700

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>




On Mon, Jun 30, 2014 at 8:09 PM, Yana Kadiyska <yana.kadiy...@gmail.com>
wrote:

> Hi,
>
> our cluster seems to have a really hard time with OOM errors on the
> executor. Periodically we'd see a task that gets sent to a few
> executors, one would OOM, and then the job just stays active for hours
> (sometimes 30+ whereas normally it completes sub-minute).
>
> So I have a few questions:
>
> 1. Why am I seeing OOMs to begin with?
>
> I'm running with defaults for
> spark.storage.memoryFraction
> spark.shuffle.memoryFraction
>
> so my understanding is that if Spark exceeds 60% of available memory,
> data will be spilled to disk? Am I misunderstanding this? In the
> attached screenshot, I see a single stage with 2 tasks on the same
> executor -- no disk spills but OOM.
>
You need to configure the  spark.shuffle.spill true again in the config,
What is causing you to OOM, it could be that you are trying to just simply
sortbykey & keys are bigger memory of executor causing the OOM, can you put
the stack.

>
> 2. How can I reduce the likelyhood of seeing OOMs -- I am a bit
> concerned that I don't see a spill at all so not sure if decreasing
> spark.storage.memoryFraction is what needs to be done
>


>
> 3. Why does an OOM seem to break the executor so hopelessly? I am
> seeing times upwards of 30hrs once an OOM occurs. Why is that -- the
> task *should* take under a minute, so even if the whole RDD was
> recomputed from scratch, 30hrs is very mysterious to me. Hadoop can
> process this in about 10-15 minutes, so I imagine even if the whole
> job went to disk it should still not take more than an hour
>
When OOM occurs it could cause the RDD to spill to disk, the repeat task
may be forced to read data from disk & cause the overall slowdown, not to
mention the RDD may be send to different executor to be processed, are you
seeing the slow tasks as process_local  or node_local atleast?

>
> Any insight into this would be much appreciated.
> Running Spark 0.9.1
>

Re: Help alleviating OOM errors

Reply via email to