Performance issue / time taken to complete spark job in yarn is 4 x slower, when considered spark standalone mode. However, in spark standalone mode jobs often fails with executor lost issue.
Hardware configuration 32GB RAM 8 Cores (16) and 1 TB HDD 3 (1 Master and 2 Workers) Spark configuration: spark.executor.memory 7g Spark cores Max 96 Spark driver 5GB spark.sql.autoBroadcastJoinThreshold::-1 (Without this key the job fails or job takes 50x times more time) spark.driver.maxResultSize::2g spark.driver.memory::5g No of Instances 4 per machine. With the above spark configuration the spark job for the business flow of 17 million records completes in 8 Minutes. Problem Area: When run in yarn client mode with the below configuration which takes 33 to 42 minutes to run the same flow. Below is the yarn-site.xml configuration data. <configuration> <property><name>yarn.label.enabled</name><value>true</value></property> <property><name>yarn.log-aggregation.enable-local-cleanup</name><value>false</value></property> <property><name>yarn.resourcemanager.scheduler.client.thread-count</name><value>64</value></property> <property><name>yarn.resourcemanager.resource-tracker.address</name><value>satish-NS1:8031</value></property> <property><name>yarn.resourcemanager.scheduler.address</name><value>satish-NS1:8030</value></property> <property><name>yarn.dispatcher.exit-on-error</name><value>true</value></property> <property><name>yarn.nodemanager.container-manager.thread-count</name><value>64</value></property> <property><name>yarn.nodemanager.local-dirs</name><value>/home/satish/yarn</value></property> <property><name>yarn.nodemanager.localizer.fetch.thread-count</name><value>20</value></property> <property><name>yarn.resourcemanager.address</name><value>satish-NS1:8032</value></property> <property><name>yarn.scheduler.increment-allocation-mb</name><value>512</value></property> <property><name>yarn.log.server.url</name><value>http://satish-NS1:19888/jobhistory/logs</value></property> <property><name>yarn.nodemanager.resource.memory-mb</name><value>28000</value></property> <property><name>yarn.nodemanager.labels</name><value>MASTER</value></property> <property><name>yarn.nodemanager.resource.cpu-vcores</name><value>48</value></property> <property><name>yarn.scheduler.minimum-allocation-mb</name><value>1024</value></property> <property><name>yarn.log-aggregation-enable</name><value>true</value></property> <property><name>yarn.nodemanager.localizer.client.thread-count</name><value>20</value></property> <property><name>yarn.app.mapreduce.am.labels</name><value>CORE</value></property> <property><name>yarn.log-aggregation.retain-seconds</name><value>172800</value></property> <property><name>yarn.nodemanager.address</name><value>${yarn.nodemanager.hostname}:8041</value></property> <property><name>yarn.resourcemanager.hostname</name><value>satish-NS1</value></property> <property><name>yarn.scheduler.maximum-allocation-mb</name><value>8192</value></property> <property><name>yarn.nodemanager.remote-app-log-dir</name><value>/home/satish/satish/hadoop-yarn/apps</value></property> <property><name>yarn.resourcemanager.resource-tracker.client.thread-count</name><value>64</value></property> <property><name>yarn.scheduler.maximum-allocation-vcores</name><value>1</value></property> <property><name>yarn.nodemanager.aux-services</name><value>mapreduce_shuffle,</value></property> <property><name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name><value>org.apache.hadoop.mapred.ShuffleHandler</value></property> <property><name>yarn.resourcemanager.client.thread-count</name><value>64</value></property> <property><name>yarn.nodemanager.container-metrics.enable</name><value>true</value></property> <property><name>yarn.nodemanager.log-dirs</name><value>/home/satish/hadoop-yarn/containers</value></property> <property> <name>yarn.nodemanager.aux-services</name> <value>spark_shuffle,mapreduce_shuffle</value></property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property><name>yarn.nodemanager.aux-services.spark_shuffle.class</name> <value>org.apache.spark.network.yarn.YarnShuffleService</value></property> <property><name>yarn.scheduler.minimum-allocation-vcores</name><value>1</value></property> <property><name>yarn.scheduler.increment-allocation-vcores</name> <value>1</value> </property> <property> <name>yarn.resourcemanager.scheduler.class</name> <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value></property> <property><name>yarn.scheduler.fair.preemption</name><value>true</value></property> </configuration> Also in capacity scheduler I am using Dominant resource calculator. I have tried hands on other fair and default as well. In order make the test simple, I ran sort on the same cluster with yarn-client mode and spark standalone mode. I can share the data for your comparative test analysis as well. 136 seconds - Yarn-client mode 40 seconds - Spark Standalone mode To conclude I am looking for a reason and solution for yarn-client mode performance issue best configuration possible to achieve performance from yarn. When I use spark.sql.autoBroadcastJoinThreshold::-1 the jobs that takes long completes in time and also does not fail often when compared to without as I have had history of issues when running job in spark without this option enabled. Let me know how to get similar performance from yarn-client or spark standalone. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Performance-issue-when-running-Spark-1-6-1-in-yarn-client-mode-with-Hadoop-2-6-0-tp28747.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org