Re: Pyspark Memory Woes

Aaron Olson Tue, 11 Mar 2014 12:53:10 -0700

Hi Sandy,

We're configuring that with the JAVA_OPTS environment variable in
$SPARK_HOME/spark-worker-env.sh like this:


# JAVA OPTS
export SPARK_JAVA_OPTS="-Dspark.ui.port=0 -Dspark.default.parallelism=1024
-Dspark.cores.max=256 -Dspark.executor.memory=1500m
-Dspark.worker.timeout=500 -Dspark.akka.timeout=500 "

Does that value seem low to you?

-Aaron




On Tue, Mar 11, 2014 at 3:08 PM, Sandy Ryza <sandy.r...@cloudera.com> wrote:

> Hi Aaron,
>
> When you say "Java heap space is 1.5G per worker, 24 or 32 cores across
> 46 nodes. It seems like we should have more than enough to do this
> comfortably.", how are you configuring this?
>
> -Sandy
>
>
> On Tue, Mar 11, 2014 at 10:11 AM, Aaron Olson <aaron.ol...@shopify.com>wrote:
>
>> Dear Sparkians,
>>
>> We are working on a system to do relational modeling on top of Spark, all
>> done in pyspark. While we've been learning a lot about Spark internals so
>> far, we're currently running into memory issues and wondering how best to
>> profile to fix them. Here are our symptoms:
>>
>>    - We're operating on data sets up to 80G in size of uncompressed
>>    JSON, 66 million records in the largest one.
>>    - Sometimes we're joining those large data sets, but cardinality
>>    never exceeds 66 million (unless we've got a bug somewhere).
>>    - We're seeing various OOM problems: sometimes python takes all
>>    available mem, sometimes we OOM with no heap space left, and occasionally
>>    OOM with GC overhead limit exceeded.
>>    - Sometimes we also see what looks like a single huge message sent
>>    over the wire that exceeds the wire format limitations.
>>    - Java heap space is 1.5G per worker, 24 or 32 cores across 46 nodes.
>>    It seems like we should have more than enough to do this comfortably.
>>    - We're trying to isolate specific steps now, but every time it
>>    errors, we're partitioning (i.e. partitionByKey is involved somewhere).
>>
>> We've been instrumneting according to the monitoring and tuning docs, but
>> a bit at a loss for where we're going wrong. We suspect poor/wrong
>> partitioning on our part somehow. With that in mind, some questions:
>>
>>    - How exactly is partitioning information propagated? It looks like
>>    within a pipelined RDD the parent partitioning is preserved throughout
>>    unless we either specifically repartition or go through a reduce. We're
>>    splitting as much as we can on maps and letting reduces happen normally. 
>> Is
>>    that good practice?
>>    - When doing e.g. partitionByKey, does an entire partition get sent
>>    to one worker process?
>>    - When does Spark stream data? Are there easy ways to sabotage the
>>    streaming? Are there any knobs for us to twiddle here?
>>    - Is there any way to specify the number of shuffles for a given
>>    reduce step?
>>    - How can we get better insight into what our workers are doing,
>>    specifically around moving data in and out of python land?
>>
>> I realise it's hard to troubleshoot in the absence of code but any test
>> case we have would be contrived. We're collecting more metrics and trying
>> to reason about what might be happening, but any guidance at this point
>> would be most helpful.
>>
>> Thanks!
>>
>> --
>> Aaron Olson
>> Data Engineer, Shopify
>>
>
>


-- 
Aaron Olson
Data Engineer, Shopify

Re: Pyspark Memory Woes

Reply via email to