Alpha chapters are available, and 8 should be available in the alpha's as soon as draft one gets back from technical review.
On Sun, Apr 5, 2009 at 7:43 AM, Christian Ulrik Søttrup <[email protected]>wrote: > jason hadoop wrote: > >> This is discussed in chapter 8 of my book. >> >> > What book? Is it out? > > In short, >> If both data sets are: >> >> - in same key order >> - partitioned with the same partitioner, >> - the input format of each data set is the same, (necessary for this >> simple example only) >> >> A map side join will present all the key value pairs of each partition, to >> a >> single map task, in key order, >> Path dir1 == the directory containing the part-XXXXX files for data set 1 >> Path dir2 == The directory containing the part-XXXXX files for data set 2 >> and use CompositeInputFormat.compose to build the join statement >> >> set the InputFormat to CompositeInputFormat, >> conf.setInputFormat(CompositeInputFormat.class); >> >> String joinStatement = CompositeInputFormat.compose("inner", dir1, dir2); >> conf.set('mapred.join.expr", joinStatement); >> >> The value classfor your map method will be TupleWritable >> In the map method, >> >> - value.has(x) indicates if the Xth ordinal data set has a value for >> this >> key >> - value.get(x) returns the value from the Xth ordinal data set for this >> key >> - value.size() returns the number of data sets in the join >> >> In our example, dir1 would be ordinal 0, and dir2 would be ordinal 1. >> >> > The partitioner is normally used for the reduce step but here it will be > used already at the mapper stage? > > Basically my files look like: > id<tab>matrix > id2<tab>anothermatrix > and > id<tab>vector1 > id<tab>vector2 > id2<tab>vector3 > > id is just an integer and there is only one matrix but many vectors tied to > the same id. > I just want the values from both files that has the same id. > Do I need a partitioner in this case? What happens if the file is split > into blocks such that two blocks > contain entries with the same key? > > Am I right if what happens is that using the example above the mapper will > be called three times with: > key=id tuple=(matrix,vector1) > key=id tuple=(matrix,vector2) > key=id2 tuple=(anothermatix,vector3) > > cheers, > Christian > > -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422
