Hi all, I have some architectural question. For my app I have persistent 50 GB data, which stored in HDFS, data is simple CSV format file. Also for my app which should be run over this (50 GB) data I have 10 GB input data also CSV format. Persistent data and input data don't have commons keys.
In my cluster I have 5 data nodes. The app does simple match every line of input data with every line of persistent data. For solving this task I see two different approaches: 1. Destribute input file to every node using attribute -files, and run job. But in this case every map will go through 10 GB input data. 2. Devide input file (10 GB) to 5 parts (for instance), run 5 independent jobs (one per data node for instance), and for every job we will put 2 GB data. In this case every map should go through 2 GB data. In other words I'll give every map node it's own input data. But drawback of this approache is work which I should do before start job and after job finished. And may be there is more subtle way in hadoop to do this work? -- View this message in context: http://old.nabble.com/Architectural-question-tp31365870p31365870.html Sent from the Hadoop core-dev mailing list archive at Nabble.com.