For your information, I've attached the Ganglia monitoring screen capture on the Stack Overflow question.
Please see: http://stackoverflow.com/questions/24622108/apache-spark-the-number-of-cores -vs-the-number-of-executors From: innowireless TaeYun Kim [mailto:taeyun....@innowireless.co.kr] Sent: Tuesday, July 08, 2014 9:43 AM To: user@spark.apache.org Subject: The number of cores vs. the number of executors Hi, I'm trying to understand the relationship of the number of cores and the number of executors when running a Spark job on YARN. The test environment is as follows: - # of data nodes: 3 - Data node machine spec: - CPU: Core i7-4790 (# of cores: 4, # of threads: 8) - RAM: 32GB (8GB x 4) - HDD: 8TB (2TB x 4) - Network: 1Gb - Spark job flow: sc.textFile -> filter -> map -> filter -> mapToPair -> reduceByKey -> map -> saveAsTextFile - input data - type: single text file - size: 165GB - # of lines: 454,568,833 - output - # of lines after second filter: 310,640,717 - # of lines of the result file: 99,848,268 - size of the result file: 41GB The job was run with following configurations: 1) --master yarn-client --executor-memory 19G --executor-cores 7 --num-executors 3 (executors per data node, use as much as cores) 2) --master yarn-client --executor-memory 19G --executor-cores 4 --num-executors 3 (# of cores reduced) 3) --master yarn-client --executor-memory 4G --executor-cores 2 --num-executors 12 (less core, more executor) - elapsed times: 1) 50 min 15 sec 2) 55 min 48 sec 3) 31 min 23 sec To my surprise, 3) was much faster. I thought that 1) would be faster, since there would be less inter-executor communication when shuffling. Although # of cores of 1) is fewer than 3), #of cores is not the key factor since 2) did perform well. How can I understand this result? Thanks.