This is discussed in chapter 8 of my book.

In short,
If both data sets are:

   - in same key order
   - partitioned with the same partitioner,
   - the input format of each data set is the same, (necessary for this
   simple example only)

A map side join will present all the key value pairs of each partition, to a
single map task, in key order,
Path dir1 == the directory containing the part-XXXXX files for data set 1
Path dir2 == The directory containing the part-XXXXX files for data set 2
and use CompositeInputFormat.compose to build the join statement

set the InputFormat to CompositeInputFormat,
conf.setInputFormat(CompositeInputFormat.class);

String joinStatement = CompositeInputFormat.compose("inner", dir1, dir2);
conf.set('mapred.join.expr", joinStatement);

The value classfor your map method will be TupleWritable
In the map method,

   - value.has(x) indicates if the Xth ordinal data set has a value for this
   key
   - value.get(x) returns the value from the Xth ordinal data set for this
   key
   - value.size() returns the number of data sets in the join

In our example, dir1 would be ordinal 0, and dir2 would be ordinal 1.



On Sat, Apr 4, 2009 at 2:36 PM, Ken Krugler <[email protected]>wrote:

> I need to do some calculations that has to merge two sets of very large
>> data  (basically calculate variance).
>> One set contains a set of "means" and the second  a set of objects tied to
>> a mean.
>>
>> Normally I would  send the set of means using the distributed cache, but
>> the set has become too large to keep in memory and it is going to grow in
>> the future.
>>
>
> You might want to check out Cascading (http://www.cascading.org), which is
> an API for doing data processing on Hadoop - it has support for SQL-style
> joins (sounds like what you want) via its CoGroup pipe.
>
> -- Ken
> --
> Ken Krugler
> +1 530-210-6378
>



-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422

Reply via email to