Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi <https://twitter.com/mayur_rustagi>
On Mon, Jun 30, 2014 at 8:09 PM, Yana Kadiyska <yana.kadiy...@gmail.com> wrote: > Hi, > > our cluster seems to have a really hard time with OOM errors on the > executor. Periodically we'd see a task that gets sent to a few > executors, one would OOM, and then the job just stays active for hours > (sometimes 30+ whereas normally it completes sub-minute). > > So I have a few questions: > > 1. Why am I seeing OOMs to begin with? > > I'm running with defaults for > spark.storage.memoryFraction > spark.shuffle.memoryFraction > > so my understanding is that if Spark exceeds 60% of available memory, > data will be spilled to disk? Am I misunderstanding this? In the > attached screenshot, I see a single stage with 2 tasks on the same > executor -- no disk spills but OOM. > You need to configure the spark.shuffle.spill true again in the config, What is causing you to OOM, it could be that you are trying to just simply sortbykey & keys are bigger memory of executor causing the OOM, can you put the stack. > > 2. How can I reduce the likelyhood of seeing OOMs -- I am a bit > concerned that I don't see a spill at all so not sure if decreasing > spark.storage.memoryFraction is what needs to be done > > > 3. Why does an OOM seem to break the executor so hopelessly? I am > seeing times upwards of 30hrs once an OOM occurs. Why is that -- the > task *should* take under a minute, so even if the whole RDD was > recomputed from scratch, 30hrs is very mysterious to me. Hadoop can > process this in about 10-15 minutes, so I imagine even if the whole > job went to disk it should still not take more than an hour > When OOM occurs it could cause the RDD to spill to disk, the repeat task may be forced to read data from disk & cause the overall slowdown, not to mention the RDD may be send to different executor to be processed, are you seeing the slow tasks as process_local or node_local atleast? > > Any insight into this would be much appreciated. > Running Spark 0.9.1 >