Can you elaborate why "You need to configure the  spark.shuffle.spill
true again in the config" -- the default for spark.shuffle.spill is
set to true according to the
doc(https://spark.apache.org/docs/0.9.1/configuration.html)?

On OOM the tasks were process_local, which I understand is "as good as
it gets" but still going on 32+ hours.

On Wed, Jul 2, 2014 at 2:40 AM, Mayur Rustagi <mayur.rust...@gmail.com> wrote:
>
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi
>
>
>
> On Mon, Jun 30, 2014 at 8:09 PM, Yana Kadiyska <yana.kadiy...@gmail.com>
> wrote:
>>
>> Hi,
>>
>> our cluster seems to have a really hard time with OOM errors on the
>> executor. Periodically we'd see a task that gets sent to a few
>> executors, one would OOM, and then the job just stays active for hours
>> (sometimes 30+ whereas normally it completes sub-minute).
>>
>> So I have a few questions:
>>
>> 1. Why am I seeing OOMs to begin with?
>>
>> I'm running with defaults for
>> spark.storage.memoryFraction
>> spark.shuffle.memoryFraction
>>
>> so my understanding is that if Spark exceeds 60% of available memory,
>> data will be spilled to disk? Am I misunderstanding this? In the
>> attached screenshot, I see a single stage with 2 tasks on the same
>> executor -- no disk spills but OOM.
>
> You need to configure the  spark.shuffle.spill true again in the config,
> What is causing you to OOM, it could be that you are trying to just simply
> sortbykey & keys are bigger memory of executor causing the OOM, can you put
> the stack.
>>
>>
>> 2. How can I reduce the likelyhood of seeing OOMs -- I am a bit
>> concerned that I don't see a spill at all so not sure if decreasing
>> spark.storage.memoryFraction is what needs to be done
>
>
>>
>>
>> 3. Why does an OOM seem to break the executor so hopelessly? I am
>> seeing times upwards of 30hrs once an OOM occurs. Why is that -- the
>> task *should* take under a minute, so even if the whole RDD was
>> recomputed from scratch, 30hrs is very mysterious to me. Hadoop can
>> process this in about 10-15 minutes, so I imagine even if the whole
>> job went to disk it should still not take more than an hour
>
> When OOM occurs it could cause the RDD to spill to disk, the repeat task may
> be forced to read data from disk & cause the overall slowdown, not to
> mention the RDD may be send to different executor to be processed, are you
> seeing the slow tasks as process_local  or node_local atleast?
>>
>>
>> Any insight into this would be much appreciated.
>> Running Spark 0.9.1
>
>

Reply via email to