Hi,

I am having problems with large inputs that cause a RDD to have a wide
dependency thus creating a shuffle RDD. Somehow shuffled partitions get
lost and need to be refetched. In web UI I see 3x the amount of
successfully completed tasks (
picture<https://dl.dropboxusercontent.com/u/14789218/Stages.png>)

In web UI task details you can see how one task (already completed
previously) gets refetched. (picture of task
example<https://dl.dropboxusercontent.com/u/14789218/Details.png>
)

These are my spark-evn.sh relevant settings:
export SPARK_JAVA_OPTS='-Dspark.local.dir=/tmp/spark-xvdb
-Dspark.mesos.coarse=true -Dspark.akka.frameSize=500
-Dspark.akka.askTimeout=60 -Dspark.worker.timeout=600
-Dspark.akka.timeout=200 -Dspark.shuffle.consolidateFiles=true
-XX:+UseCompressedOops -XX:+UseParallelGC -verbose:gc -XX:+PrintGCDetails
-XX:+PrintGCTimeStamps'
ulimit -n 65536
export SPARK_DAEMON_JAVA_OPTS='-Dspark.mesos.coarse=true
-Dspark.akka.frameSize=500 -Dspark.worker.timeout=600
-Dspark.akka.askTimeout=60 -Dspark.akka.timeout=200
-Dspark.shuffle.consolidateFiles=true'

Any ideas on how to configure spark to not have problems with large shuffle
RDDs?

Kind regards, Domen

Reply via email to