Are you aware that you get an executor (and the 1.5GB) per machine, not per core?
On Tue, Mar 11, 2014 at 12:52 PM, Aaron Olson <aaron.ol...@shopify.com>wrote: > Hi Sandy, > > We're configuring that with the JAVA_OPTS environment variable in > $SPARK_HOME/spark-worker-env.sh like this: > > # JAVA OPTS > export SPARK_JAVA_OPTS="-Dspark.ui.port=0 -Dspark.default.parallelism=1024 > -Dspark.cores.max=256 -Dspark.executor.memory=1500m > -Dspark.worker.timeout=500 -Dspark.akka.timeout=500 " > > Does that value seem low to you? > > -Aaron > > > > > On Tue, Mar 11, 2014 at 3:08 PM, Sandy Ryza <sandy.r...@cloudera.com>wrote: > >> Hi Aaron, >> >> When you say "Java heap space is 1.5G per worker, 24 or 32 cores across >> 46 nodes. It seems like we should have more than enough to do this >> comfortably.", how are you configuring this? >> >> -Sandy >> >> >> On Tue, Mar 11, 2014 at 10:11 AM, Aaron Olson <aaron.ol...@shopify.com>wrote: >> >>> Dear Sparkians, >>> >>> We are working on a system to do relational modeling on top of Spark, >>> all done in pyspark. While we've been learning a lot about Spark internals >>> so far, we're currently running into memory issues and wondering how best >>> to profile to fix them. Here are our symptoms: >>> >>> - We're operating on data sets up to 80G in size of uncompressed >>> JSON, 66 million records in the largest one. >>> - Sometimes we're joining those large data sets, but cardinality >>> never exceeds 66 million (unless we've got a bug somewhere). >>> - We're seeing various OOM problems: sometimes python takes all >>> available mem, sometimes we OOM with no heap space left, and occasionally >>> OOM with GC overhead limit exceeded. >>> - Sometimes we also see what looks like a single huge message sent >>> over the wire that exceeds the wire format limitations. >>> - Java heap space is 1.5G per worker, 24 or 32 cores across 46 >>> nodes. It seems like we should have more than enough to do this >>> comfortably. >>> - We're trying to isolate specific steps now, but every time it >>> errors, we're partitioning (i.e. partitionByKey is involved somewhere). >>> >>> We've been instrumneting according to the monitoring and tuning docs, >>> but a bit at a loss for where we're going wrong. We suspect poor/wrong >>> partitioning on our part somehow. With that in mind, some questions: >>> >>> - How exactly is partitioning information propagated? It looks like >>> within a pipelined RDD the parent partitioning is preserved throughout >>> unless we either specifically repartition or go through a reduce. We're >>> splitting as much as we can on maps and letting reduces happen normally. >>> Is >>> that good practice? >>> - When doing e.g. partitionByKey, does an entire partition get sent >>> to one worker process? >>> - When does Spark stream data? Are there easy ways to sabotage the >>> streaming? Are there any knobs for us to twiddle here? >>> - Is there any way to specify the number of shuffles for a given >>> reduce step? >>> - How can we get better insight into what our workers are doing, >>> specifically around moving data in and out of python land? >>> >>> I realise it's hard to troubleshoot in the absence of code but any test >>> case we have would be contrived. We're collecting more metrics and trying >>> to reason about what might be happening, but any guidance at this point >>> would be most helpful. >>> >>> Thanks! >>> >>> -- >>> Aaron Olson >>> Data Engineer, Shopify >>> >> >> > > > -- > Aaron Olson > Data Engineer, Shopify >