Vamsi, > > I have some basic doubt on hadoop Input Data placement... > > Like, If i input some 30GB of data to hadoop program , it will place the > 30gb into HDFS into some set of files based on some input formats..
Conceptually, it would be more accurate to say that it splits the data into 'blocks' that are managed in HDFS. Of course, implementation-wise, these blocks do get stored on physical files on the datanodes. > > I have 2 doubts here .. > > 1. Each time i run a program 30GB is placed into HDFS or how its going to > Work What program ? Are you talking about this 30GB as input to the program or output from it ? Assuming Map/Reduce input, the answer is in general, no. A typical M/R program takes an input path on DFS and this can point to data that's already copied to DFS, independent of the program itself. > 2. Again if i want to run some other program on another 100Gb of data, where > the above stated data and program is different. Then the previous 30GB is > erased in HDFS or how its going to run.. > Given that the program and its input are independent, the program will not modify any existing data. In fact most Map/Reduce applications do not overwrite output data as well. Rather, they will refuse to start if the output directory already exists. Thanks Hemanth