Hi, I am having problems with large inputs that cause a RDD to have a wide dependency thus creating a shuffle RDD. Somehow shuffled partitions get lost and need to be refetched. In web UI I see 3x the amount of successfully completed tasks ( picture<https://dl.dropboxusercontent.com/u/14789218/Stages.png>)
In web UI task details you can see how one task (already completed previously) gets refetched. (picture of task example<https://dl.dropboxusercontent.com/u/14789218/Details.png> ) These are my spark-evn.sh relevant settings: export SPARK_JAVA_OPTS='-Dspark.local.dir=/tmp/spark-xvdb -Dspark.mesos.coarse=true -Dspark.akka.frameSize=500 -Dspark.akka.askTimeout=60 -Dspark.worker.timeout=600 -Dspark.akka.timeout=200 -Dspark.shuffle.consolidateFiles=true -XX:+UseCompressedOops -XX:+UseParallelGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps' ulimit -n 65536 export SPARK_DAEMON_JAVA_OPTS='-Dspark.mesos.coarse=true -Dspark.akka.frameSize=500 -Dspark.worker.timeout=600 -Dspark.akka.askTimeout=60 -Dspark.akka.timeout=200 -Dspark.shuffle.consolidateFiles=true' Any ideas on how to configure spark to not have problems with large shuffle RDDs? Kind regards, Domen
