Performance issue / time taken to complete spark job in yarn is 4 x slower,
when considered spark standalone mode. However, in spark standalone mode
jobs often fails with executor lost issue.

Hardware configuration


 32GB RAM 8 Cores (16) and 1 TB HDD  3 (1 Master and 2 Workers)

Spark configuration:


spark.executor.memory 7g
Spark cores Max 96
Spark driver 5GB
spark.sql.autoBroadcastJoinThreshold::-1 (Without this key the job fails or
job takes 50x times more time)
spark.driver.maxResultSize::2g
spark.driver.memory::5g
No of Instances 4 per machine.

With the above spark configuration the spark job for the business flow of 17
million records completes in 8 Minutes.

Problem Area:


When run in yarn client mode with the below configuration which takes 33 to
42 minutes to run the same flow. Below is the yarn-site.xml configuration
data.

<configuration>
  <property><name>yarn.label.enabled</name><value>true</value></property>
 
<property><name>yarn.log-aggregation.enable-local-cleanup</name><value>false</value></property>
 
<property><name>yarn.resourcemanager.scheduler.client.thread-count</name><value>64</value></property>
 
<property><name>yarn.resourcemanager.resource-tracker.address</name><value>satish-NS1:8031</value></property>
 
<property><name>yarn.resourcemanager.scheduler.address</name><value>satish-NS1:8030</value></property>
 
<property><name>yarn.dispatcher.exit-on-error</name><value>true</value></property>
 
<property><name>yarn.nodemanager.container-manager.thread-count</name><value>64</value></property>
 
<property><name>yarn.nodemanager.local-dirs</name><value>/home/satish/yarn</value></property>
 
<property><name>yarn.nodemanager.localizer.fetch.thread-count</name><value>20</value></property>
 
<property><name>yarn.resourcemanager.address</name><value>satish-NS1:8032</value></property>
 
<property><name>yarn.scheduler.increment-allocation-mb</name><value>512</value></property>
 
<property><name>yarn.log.server.url</name><value>http://satish-NS1:19888/jobhistory/logs</value></property>
 
<property><name>yarn.nodemanager.resource.memory-mb</name><value>28000</value></property>
 
<property><name>yarn.nodemanager.labels</name><value>MASTER</value></property>
 
<property><name>yarn.nodemanager.resource.cpu-vcores</name><value>48</value></property>
 
<property><name>yarn.scheduler.minimum-allocation-mb</name><value>1024</value></property>
 
<property><name>yarn.log-aggregation-enable</name><value>true</value></property>
 
<property><name>yarn.nodemanager.localizer.client.thread-count</name><value>20</value></property>
 
<property><name>yarn.app.mapreduce.am.labels</name><value>CORE</value></property>
 
<property><name>yarn.log-aggregation.retain-seconds</name><value>172800</value></property>
 
<property><name>yarn.nodemanager.address</name><value>${yarn.nodemanager.hostname}:8041</value></property>
 
<property><name>yarn.resourcemanager.hostname</name><value>satish-NS1</value></property>
 
<property><name>yarn.scheduler.maximum-allocation-mb</name><value>8192</value></property>
 
<property><name>yarn.nodemanager.remote-app-log-dir</name><value>/home/satish/satish/hadoop-yarn/apps</value></property>
 
<property><name>yarn.resourcemanager.resource-tracker.client.thread-count</name><value>64</value></property>
 
<property><name>yarn.scheduler.maximum-allocation-vcores</name><value>1</value></property>
 
<property><name>yarn.nodemanager.aux-services</name><value>mapreduce_shuffle,</value></property>
 
<property><name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name><value>org.apache.hadoop.mapred.ShuffleHandler</value></property>
 
<property><name>yarn.resourcemanager.client.thread-count</name><value>64</value></property>
 
<property><name>yarn.nodemanager.container-metrics.enable</name><value>true</value></property>
 
<property><name>yarn.nodemanager.log-dirs</name><value>/home/satish/hadoop-yarn/containers</value></property>
  <property> <name>yarn.nodemanager.aux-services</name>
<value>spark_shuffle,mapreduce_shuffle</value></property>    
 <property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>     
<value>org.apache.hadoop.mapred.ShuffleHandler</value>    </property>
  <property><name>yarn.nodemanager.aux-services.spark_shuffle.class</name>   
<value>org.apache.spark.network.yarn.YarnShuffleService</value></property>
 
<property><name>yarn.scheduler.minimum-allocation-vcores</name><value>1</value></property>
  <property><name>yarn.scheduler.increment-allocation-vcores</name>       
<value>1</value>    </property>
<property> <name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value></property>
<property><name>yarn.scheduler.fair.preemption</name><value>true</value></property>

</configuration>

Also in capacity scheduler I am using Dominant resource calculator. I have
tried hands on other fair and default as well.

In order make the test simple, I ran sort on the same cluster with
yarn-client mode and spark standalone mode. I can share the data for your
comparative test analysis as well.

136 seconds - Yarn-client mode
40 seconds  - Spark Standalone mode

To conclude I am looking for a reason and solution for yarn-client mode
performance issue best configuration possible to achieve performance from
yarn. 

When I use spark.sql.autoBroadcastJoinThreshold::-1 the jobs that takes long
completes in time and also does not fail often when compared to without as I
have had history of issues when running job in spark without this option
enabled. 

Let me know how to get similar performance from yarn-client or spark
standalone.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Performance-issue-when-running-Spark-1-6-1-in-yarn-client-mode-with-Hadoop-2-6-0-tp28747.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to