Hi
I am testing Spark on Amazon EMR using Python and the basic wordcount
example shipped with Spark.

After running the application, I realized that in Stage 0 reduceByKey(add),
around 2.5GB shuffle is spilled to memory and 4GB shuffle is spilled to
disk. Since in the wordcount example I am not caching or persisting any
data, so I thought I can increase the performance of this application by
giving more shuffle memoryFraction. So, in spark-defaults.conf, I added the
following:

spark.storage.memoryFraction    0.2
spark.shuffle.memoryFraction    0.6

However, I am still getting the same performance and the same amount of
shuffle data is being spilled to disk and memory. I validated that Spark is
reading these configurations using Spark UI/Environment and I can see my
changes. Moreover, I tried setting spark.shuffle.spill to false and I got
the performance I am looking for and all shuffle data was spilled to memory
only.

So, what am I getting wrong here and why not the extra shuffle memory
fraction is not utilized?

*My environment:*
Amazon EMR with Spark 1.3.1 running using -x argument
1 Master node: m3.xlarge
3 Core nodes: m3.xlarge
Application: wordcount.py
Input: 10 .gz files 90MB each (~350MB unarchived) stored in S3

*Submit command:*
/home/hadoop/spark/bin/spark-submit --deploy-mode client /mnt/wordcount.py
s3n://<input location>

*spark-defaults.conf:*
spark.eventLog.enabled          false
spark.executor.extraJavaOptions -verbose:gc -XX:+PrintGCDetails
-XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70
spark.driver.extraJavaOptions   -Dspark.driver.log.level=INFO
spark.master                    yarn
spark.executor.instances        3
spark.executor.cores            4
spark.executor.memory           9404M
spark.default.parallelism       12
spark.eventLog.enabled          true
spark.eventLog.dir              hdfs:///spark-logs/
spark.storage.memoryFraction    0.2
spark.shuffle.memoryFraction    0.6



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-spark-shuffle-memoryFraction-has-no-affect-tp23944.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to