Hi Brad, YARN scheduling does take care of data locality. In YARN, tasks are not assigned based on capacity. Actually certain number of containers are allocated on every node based on node's capacity. Tasks are executed on those containers. While scheduling tasks on containers YARN scheduler satisfies data locality requirements. I am not very familiar with Fair Scheduler but if you check the source of FifoScheduler you will find a function 'assignContainersonNode' which looks like following
private int assignContainersOnNode(FiCaSchedulerNode node, FiCaSchedulerApp application, Priority priority ) { // Data-local int nodeLocalContainers = assignNodeLocalContainers(node, application, priority); // Rack-local int rackLocalContainers = assignRackLocalContainers(node, application, priority); // Off-switch int offSwitchContainers = assignOffSwitchContainers(node, application, priority); LOG.debug("assignContainersOnNode:" + " node=" + node.getRMNode().getNodeAddress() + " application=" + application.getApplicationId().getId() + " priority=" + priority.getPriority() + " #assigned=" + (nodeLocalContainers + rackLocalContainers + offSwitchContainers)); return (nodeLocalContainers + rackLocalContainers + offSwitchContainers); } In this routine you will find that data-local tasks are scheduled first, then rack-local and in then off-switch. After this you may find similar function in fairScheduler too. I hope this helps. Let me know if you more questions or if something is wrong in my reasoning. Regards, Shekhar On Thu, Apr 3, 2014 at 10:56 AM, Brad Childs <b...@redhat.com> wrote: > Sorry if this is the wrong list, i am looking for deep technical/hadoop > source help :) > > How does job scheduling work on yarn framework for map reduce jobs? I see > the yarn scheduler discussed here: > https://hadoop.apache.org/docs/r2.2.0/hadoop-yarn/hadoop-yarn-site/YARN.html > which leads me to believe tasks are scheduled based on node capacity and > not data locality. I've sifted through the fair scheduler and can't find > anything about data location or locality. > > Where does data locality play into the scheduling of map/reduce tasks on > yarn? Can someone point me to the hadoop 2.x source where the data block > location is used to calculate node/container/task assignment (if thats > still happening). > > > > -bc > >