Hi I am testing Spark on Amazon EMR using Python and the basic wordcount example shipped with Spark.
After running the application, I realized that in Stage 0 reduceByKey(add), around 2.5GB shuffle is spilled to memory and 4GB shuffle is spilled to disk. Since in the wordcount example I am not caching or persisting any data, so I thought I can increase the performance of this application by giving more shuffle memoryFraction. So, in spark-defaults.conf, I added the following: spark.storage.memoryFraction 0.2 spark.shuffle.memoryFraction 0.6 However, I am still getting the same performance and the same amount of shuffle data is being spilled to disk and memory. I validated that Spark is reading these configurations using Spark UI/Environment and I can see my changes. Moreover, I tried setting spark.shuffle.spill to false and I got the performance I am looking for and all shuffle data was spilled to memory only. So, what am I getting wrong here and why not the extra shuffle memory fraction is not utilized? *My environment:* Amazon EMR with Spark 1.3.1 running using -x argument 1 Master node: m3.xlarge 3 Core nodes: m3.xlarge Application: wordcount.py Input: 10 .gz files 90MB each (~350MB unarchived) stored in S3 *Submit command:* /home/hadoop/spark/bin/spark-submit --deploy-mode client /mnt/wordcount.py s3n://<input location> *spark-defaults.conf:* spark.eventLog.enabled false spark.executor.extraJavaOptions -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 spark.driver.extraJavaOptions -Dspark.driver.log.level=INFO spark.master yarn spark.executor.instances 3 spark.executor.cores 4 spark.executor.memory 9404M spark.default.parallelism 12 spark.eventLog.enabled true spark.eventLog.dir hdfs:///spark-logs/ spark.storage.memoryFraction 0.2 spark.shuffle.memoryFraction 0.6 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-spark-shuffle-memoryFraction-has-no-affect-tp23944.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org