Hi,

I'm running a Spark 1.6.0 on YARN on a Hadoop 2.6.0 cluster. 
I observe a very strange issue. 
I run a simple job that reads about 1TB of json logs from a remote HDFS
cluster and converts them to parquet, then saves them to the local HDFS of
the Hadoop cluster.

I run it with 25 executors with sufficient resources. However the strange
thing is that the job only uses 2 executors to do most of the read work.

For example when I go to the Executors' tab in the Spark UI and look at the
"Input" column, the difference between the nodes is huge, sometimes 20G vs
120G.

1. What is the cause for this behaviour?
2. Any ideas how to achieve a more balanced performance?

Thanks,
Borislav



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-work-distribution-among-execs-tp26502.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to