Hi, I'm running a Spark 1.6.0 on YARN on a Hadoop 2.6.0 cluster. I observe a very strange issue. I run a simple job that reads about 1TB of json logs from a remote HDFS cluster and converts them to parquet, then saves them to the local HDFS of the Hadoop cluster.
I run it with 25 executors with sufficient resources. However the strange thing is that the job only uses 2 executors to do most of the read work. For example when I go to the Executors' tab in the Spark UI and look at the "Input" column, the difference between the nodes is huge, sometimes 20G vs 120G. 1. What is the cause for this behaviour? 2. Any ideas how to achieve a more balanced performance? Thanks, Borislav -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-work-distribution-among-execs-tp26502.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org