Hi folks, I have a question regarding scheduling of Spark job on Yarn cluster.
Let's say there are 5 nodes on Yarn cluster: A,B,C, D, E In Spark job I'll be reading some huge text file (sc.textFile(fileName)) from HDFS and create an RDD. Assume that only nodes A, E contain the blocks of that text file. I'm curious if Spark driver talks to NameNode or Yarn Resource Manager talks to NameNode to know the nodes which has required input blocks ? How does Spark or Yarn launch executors on the nodes having required blocks ? Which component gets that info (data blocks in each node) from NameNode ? This info is required while launching executors on WorkerNodes for using data locality. How do Spark executors gets launched on nodes A,E ? But not B,C,D ? Or does Yarn RM launches executors on random nodes, and then if data block does not exist in that node extra duplication/copy is one on HDFS ? (copy data from A or E to one of {B,C,D} where executor got launched) Thanks.