Ok. I was able to get this to run but have a slight problem. *File 1* 1 10 2 20 3 30 3 35 4 40 4 45 4 49 5 50
*File 2* a 10 123 b 20 21321 c 45 2131 d 40 2131111 I want to join the above two based on the second column of file 1. Here's what I am getting as the output. *Output* 1 a 123 b 21321 2 3 3 4 d 2131111 c 2131 4 4 5 The ones in red are in the format I want it. The ones in blue have their order reversed. How can I get them to be in the correct order too? Basically, the order in which the iterator iterates over the values is not consistent. How can I get this to be consistent? Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Fri, Feb 6, 2009 at 2:58 PM, Amandeep Khurana <[email protected]> wrote: > Ok. Got it. > > Now, how would my reducer know whether the name is coming first or the > address? Is it going to be in the same order in the iterator as the files > are read (alphabetically) in the mapper? > > > Amandeep Khurana > Computer Science Graduate Student > University of California, Santa Cruz > > > On Fri, Feb 6, 2009 at 5:22 AM, Jeff Hammerbacher <[email protected]>wrote: > >> You put the files into a common directory, and use that as your input to >> the >> MapReduce job. You write a single Mapper class that has an "if" statement >> examining the map.input.file property, outputting "number" as the key for >> both files, but "address" for one and "name" for the other. By using a >> commone key ("number"), you'll ensure that the name and address make it >> to >> the same reducer after the shuffle. In the reducer, you'll then have the >> relevant information (in the values) you need to create the name, address >> pair. >> >> On Fri, Feb 6, 2009 at 2:17 AM, Amandeep Khurana <[email protected]> >> wrote: >> >> > Thanks Jeff... >> > I am not 100% clear about the first solution you have given. How do I >> get >> > the multiple files to be read and then feed into a single reducer? I >> should >> > have multiple mappers in the same class and have different job configs >> for >> > them, run two separate jobs with one outputing the key as (name,number) >> and >> > the other outputing the value as (number, address) into the reducer? >> > Not clear what I'll be doing with the map.intput.file here... >> > >> > Amandeep >> > >> > >> > Amandeep Khurana >> > Computer Science Graduate Student >> > University of California, Santa Cruz >> > >> > >> > On Fri, Feb 6, 2009 at 1:55 AM, Jeff Hammerbacher <[email protected] >> > >wrote: >> > >> > > Hey Amandeep, >> > > >> > > You can get the file name for a task via the "map.input.file" >> property. >> > For >> > > the join you're doing, you could inspect this property and ouput >> (number, >> > > name) and (number, address) as your (key, value) pairs, depending on >> the >> > > file you're working with. Then you can do the combination in your >> > reducer. >> > > >> > > You could also check out the join package in contrib/utils ( >> > > >> > > >> > >> http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/contrib/utils/join/package-summary.html >> > > ), >> > > but I'd say your job is simple enough that you'll get it done faster >> with >> > > the above method. >> > > >> > > This task would be a simple join in Hive, so you could consider using >> > Hive >> > > to manage the data and perform the join. >> > > >> > > Later, >> > > Jeff >> > > >> > > On Fri, Feb 6, 2009 at 1:34 AM, Amandeep Khurana <[email protected]> >> > wrote: >> > > >> > > > Is it possible to write a map reduce job using multiple input files? >> > > > >> > > > For example: >> > > > File 1 has data like - Name, Number >> > > > File 2 has data like - Number, Address >> > > > >> > > > Using these, I want to create a third file which has something like >> - >> > > Name, >> > > > Address >> > > > >> > > > How can a map reduce job be written to do this? >> > > > >> > > > Amandeep >> > > > >> > > > >> > > > >> > > > Amandeep Khurana >> > > > Computer Science Graduate Student >> > > > University of California, Santa Cruz >> > > > >> > > >> > >> > >
