Re: Pyspark Memory Woes

Sandy Ryza Tue, 11 Mar 2014 15:07:35 -0700

Are you aware that you get an executor (and the 1.5GB) per machine, not per
core?




On Tue, Mar 11, 2014 at 12:52 PM, Aaron Olson <aaron.ol...@shopify.com>wrote:

> Hi Sandy,
>
> We're configuring that with the JAVA_OPTS environment variable in
> $SPARK_HOME/spark-worker-env.sh like this:
>
> # JAVA OPTS
> export SPARK_JAVA_OPTS="-Dspark.ui.port=0 -Dspark.default.parallelism=1024
> -Dspark.cores.max=256 -Dspark.executor.memory=1500m
> -Dspark.worker.timeout=500 -Dspark.akka.timeout=500 "
>
> Does that value seem low to you?
>
> -Aaron
>
>
>
>
> On Tue, Mar 11, 2014 at 3:08 PM, Sandy Ryza <sandy.r...@cloudera.com>wrote:
>
>> Hi Aaron,
>>
>> When you say "Java heap space is 1.5G per worker, 24 or 32 cores across
>> 46 nodes. It seems like we should have more than enough to do this
>> comfortably.", how are you configuring this?
>>
>> -Sandy
>>
>>
>> On Tue, Mar 11, 2014 at 10:11 AM, Aaron Olson <aaron.ol...@shopify.com>wrote:
>>
>>> Dear Sparkians,
>>>
>>> We are working on a system to do relational modeling on top of Spark,
>>> all done in pyspark. While we've been learning a lot about Spark internals
>>> so far, we're currently running into memory issues and wondering how best
>>> to profile to fix them. Here are our symptoms:
>>>
>>>    - We're operating on data sets up to 80G in size of uncompressed
>>>    JSON, 66 million records in the largest one.
>>>    - Sometimes we're joining those large data sets, but cardinality
>>>    never exceeds 66 million (unless we've got a bug somewhere).
>>>    - We're seeing various OOM problems: sometimes python takes all
>>>    available mem, sometimes we OOM with no heap space left, and occasionally
>>>    OOM with GC overhead limit exceeded.
>>>    - Sometimes we also see what looks like a single huge message sent
>>>    over the wire that exceeds the wire format limitations.
>>>    - Java heap space is 1.5G per worker, 24 or 32 cores across 46
>>>    nodes. It seems like we should have more than enough to do this 
>>> comfortably.
>>>    - We're trying to isolate specific steps now, but every time it
>>>    errors, we're partitioning (i.e. partitionByKey is involved somewhere).
>>>
>>> We've been instrumneting according to the monitoring and tuning docs,
>>> but a bit at a loss for where we're going wrong. We suspect poor/wrong
>>> partitioning on our part somehow. With that in mind, some questions:
>>>
>>>    - How exactly is partitioning information propagated? It looks like
>>>    within a pipelined RDD the parent partitioning is preserved throughout
>>>    unless we either specifically repartition or go through a reduce. We're
>>>    splitting as much as we can on maps and letting reduces happen normally. 
>>> Is
>>>    that good practice?
>>>    - When doing e.g. partitionByKey, does an entire partition get sent
>>>    to one worker process?
>>>    - When does Spark stream data? Are there easy ways to sabotage the
>>>    streaming? Are there any knobs for us to twiddle here?
>>>    - Is there any way to specify the number of shuffles for a given
>>>    reduce step?
>>>    - How can we get better insight into what our workers are doing,
>>>    specifically around moving data in and out of python land?
>>>
>>> I realise it's hard to troubleshoot in the absence of code but any test
>>> case we have would be contrived. We're collecting more metrics and trying
>>> to reason about what might be happening, but any guidance at this point
>>> would be most helpful.
>>>
>>> Thanks!
>>>
>>> --
>>> Aaron Olson
>>> Data Engineer, Shopify
>>>
>>
>>
>
>
> --
> Aaron Olson
> Data Engineer, Shopify
>

Re: Pyspark Memory Woes

Reply via email to