Re: Hadoop job using multiple input files

Amandeep Khurana Fri, 06 Feb 2009 16:46:46 -0800

Ok. I was able to get this to run but have a slight problem.

*File 1*
1   10
2   20
3   30
3   35
4   40
4   45
4   49
5   50


*File 2*

a   10   123
b   20   21321
c   45   2131
d   40   2131111

I want to join the above two based on the second column of file 1. Here's
what I am getting as the output.

*Output*
1   a   123
b   21321   2
3
3
4   d   2131111
c   2131   4
4
5

The ones in red are in the format I want it. The ones in blue have their
order reversed. How can I get them to be in the correct order too?
Basically, the order in which the iterator iterates over the values is not
consistent. How can I get this to be consistent?

Amandeep

Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Fri, Feb 6, 2009 at 2:58 PM, Amandeep Khurana <[email protected]> wrote:

> Ok. Got it.
>
> Now, how would my reducer know whether the name is coming first or the
> address? Is it going to be in the same order in the iterator as the files
> are read (alphabetically) in the mapper?
>
>
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
>
>
> On Fri, Feb 6, 2009 at 5:22 AM, Jeff Hammerbacher <[email protected]>wrote:
>
>> You put the files into a common directory, and use that as your input to
>> the
>> MapReduce job. You write a single Mapper class that has an "if" statement
>> examining the map.input.file property, outputting "number" as the key for
>> both files, but "address" for one and "name" for the other. By using a
>> commone key ("number"), you'll  ensure that the name and address make it
>> to
>> the same reducer after the shuffle. In the reducer, you'll then have the
>> relevant information (in the values) you need to create the name, address
>> pair.
>>
>> On Fri, Feb 6, 2009 at 2:17 AM, Amandeep Khurana <[email protected]>
>> wrote:
>>
>> > Thanks Jeff...
>> > I am not 100% clear about the first solution you have given. How do I
>> get
>> > the multiple files to be read and then feed into a single reducer? I
>> should
>> > have multiple mappers in the same class and have different job configs
>> for
>> > them, run two separate jobs with one outputing the key as (name,number)
>> and
>> > the other outputing the value as (number, address) into the reducer?
>> > Not clear what I'll be doing with the map.intput.file here...
>> >
>> > Amandeep
>> >
>> >
>> > Amandeep Khurana
>> > Computer Science Graduate Student
>> > University of California, Santa Cruz
>> >
>> >
>> > On Fri, Feb 6, 2009 at 1:55 AM, Jeff Hammerbacher <[email protected]
>> > >wrote:
>> >
>> > > Hey Amandeep,
>> > >
>> > > You can get the file name for a task via the "map.input.file"
>> property.
>> > For
>> > > the join you're doing, you could inspect this property and ouput
>> (number,
>> > > name) and (number, address) as your (key, value) pairs, depending on
>> the
>> > > file you're working with. Then you can do the combination in your
>> > reducer.
>> > >
>> > > You could also check out the join package in contrib/utils (
>> > >
>> > >
>> >
>> http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/contrib/utils/join/package-summary.html
>> > > ),
>> > > but I'd say your job is simple enough that you'll get it done faster
>> with
>> > > the above method.
>> > >
>> > > This task would be a simple join in Hive, so you could consider using
>> > Hive
>> > > to manage the data and perform the join.
>> > >
>> > > Later,
>> > > Jeff
>> > >
>> > > On Fri, Feb 6, 2009 at 1:34 AM, Amandeep Khurana <[email protected]>
>> > wrote:
>> > >
>> > > > Is it possible to write a map reduce job using multiple input files?
>> > > >
>> > > > For example:
>> > > > File 1 has data like - Name, Number
>> > > > File 2 has data like - Number, Address
>> > > >
>> > > > Using these, I want to create a third file which has something like
>> -
>> > > Name,
>> > > > Address
>> > > >
>> > > > How can a map reduce job be written to do this?
>> > > >
>> > > > Amandeep
>> > > >
>> > > >
>> > > >
>> > > > Amandeep Khurana
>> > > > Computer Science Graduate Student
>> > > > University of California, Santa Cruz
>> > > >
>> > >
>> >
>>
>
>

Re: Hadoop job using multiple input files

Reply via email to