Does Spark driver talk to NameNode directly or Yarn Resource Manager talks to NameNode to know the nodes which has required input blocks and informs Spark Driver ? (for launching Executors on nodes which has required input data blocks)

Elkhan Dadashov Mon, 13 Jul 2015 10:37:13 -0700

Hi folks,

I have a question regarding scheduling of Spark job on Yarn cluster.


Let's say there are 5 nodes on Yarn cluster: A,B,C, D, E

In Spark job I'll be reading some huge text file (sc.textFile(fileName))
from HDFS and create an RDD.

Assume that only nodes A, E contain the blocks of that text file.

I'm curious if Spark driver talks to NameNode or Yarn Resource Manager
talks to NameNode to know the nodes which has required input blocks ?

How does Spark or Yarn launch executors on the nodes having required blocks
?

Which component gets that info (data blocks in each node) from NameNode ?

This info is required while launching executors on WorkerNodes for using
data locality.

How do Spark executors gets launched on nodes A,E ? But not B,C,D ?

Or does Yarn RM launches executors on random nodes, and then if data block
does not exist in that node extra duplication/copy is one on HDFS ? (copy
data from A or E to one of {B,C,D} where executor got launched)

Thanks.

Does Spark driver talk to NameNode directly or Yarn Resource Manager talks to NameNode to know the nodes which has required input blocks and informs Spark Driver ? (for launching Executors on nodes which has required input data blocks)

Reply via email to