Hello All, In order to get a better understanding of Hadoop, i've started reading the source and have a question The FileInputFormat, reads in files, splits into splitsizes (which may be bigger than block size) and creates FileSplits. The FileSplits contain the start, length *and* the locations of the split. The LineRecordReader, receives a split and emits records.
So far I think i'm correct(hopefully). Now, my questions Does the LineRecordReader run on a machine, in some sense, closest to the location of the splits? i.e Q1: If the split is less than the block size, then the split is located on one machine (apart from replicates): does the LineRecordReader run on the machine which contains the split? Or at least attempt to? Q2. If a split is greater than the block size, it spans multiple blocks which could reside on more than 1 machine. In this case, on which machine does the LineRecordReader run? The machine 'closest' to them? Please correct me if i'm wrong. Thank you Saptarshi -- Saptarshi Guha - [email protected]
