Re: Join Variation

jason hadoop Thu, 26 Mar 2009 07:07:48 -0700

For the classic map/reduce job, you have 3 requirements.

1) a comparator that provide the keys in ip address order, such that all
keys in one of your ranges, would be contiguous, when sorted with the
comparator
2) a partitioner that ensures that all keys that should be together end up
in the same partition
3) and output value grouping comparator that considered all keys in a
specified range equal.


The comparator only sorts by the first part of the key, the search file has
a 2 part key begin/end the input data has just a 1 part key.

A partitioner that new ahead of time the group sets in your search set, in
the way that the tera sort example works would be ideal:
ie: it builds an index of ranges from your seen set so that the ranges get
rougly evenly split between your reduces.
This requires a pass over the search file to write out a summary file, which
is then loaded by the partitioner.

The output value grouping comparator, will get the keys in order of the
first token, and will define the start of a group by the presence of a 2
part key, and consider the group ended when either another 2 part key
appears, or when the key value is larger than the second part of the
starting key. - This does require that the grouping comparator maintain
state.

At this point, your reduce will be called with the first key in the key
equivalence group of (3), with the values of all of the keys

In your map, any address that is not in a range of interest is not passed to
output.collect.

For the map side join code, you have to define a comparator on the key type
that defines your definition of equivalence and ordering, and call
WritableComparator.define( Key.class, comparator.class ), to force the join
code to use your comparator.

For tables with duplicates, per the key comparator, in map side join, your
map fuction will receive a row for every permutation of the duplicate keys:
if you have one table a, 1; a, 2; and another table with a, 3; a, 4; your
map will receive4 rows, a, 1, 3; a, 1, 4; a, 2, 3; a, 2, 4;


On Wed, Mar 25, 2009 at 11:19 PM, Tamir Kamara <tamirkam...@gmail.com>wrote:

> Thanks for all who replies.
>
> Stefan -
> I'm unable to see how converting IP ranges to network masks would help
> because different ranges can have the same network mask and with that I
> still have to do a comparison of two fields: the searched IP with
> from-IP&mask.
>
> Pig - I'm familier with pig and use it many times, but I can't think of a
> way to write a pig script that will do this type of "join". I'll ask the
> pig
> users group.
>
> The search file is indeed large in terms of the amount records. However, I
> don't see this as an issue yet, because I'm still puzzeled with how to
> write
> the job in plain MR. The join code is looking for an exact match in the
> keys
> and that is not what I need. Would a custom comperator which will look for
> a
> match in between the ranges, be the right choice to do this ?
>
> Thanks,
> Tamir
>
> On Wed, Mar 25, 2009 at 5:23 PM, jason hadoop <jason.had...@gmail.com
> >wrote:
>
> > If the search file data set is large, the issue becomes ensuring that
> only
> > the required portion of search file is actually read, and that those
> reads
> > are ordered, in search file's key order.
> >
> > If the data set is small, most any of the common patterns will work.
> >
> > I haven't looked at pig for a while, does pig now use indexes in map
> files,
> > and take into account that a data set is sorted?
> > Out of the box, the map side join code, org.apache.hadoop.mapred.join
> will
> > do a decent job of this, but the entire search file set will be read.
> > To stop reading the entire search file, a record reader or join type,
> would
> > need to be put together to:
> > a) skip to the first key of interest, using the index if available
> > b) finish when the last possible key of interest has been delivered.
> >
> > On Wed, Mar 25, 2009 at 6:05 AM, John Lee <j.benlin....@gmail.com>
> wrote:
> >
> > > In addition to other suggestions, you could also take a look at
> > > building a Cascading job with a custom Joiner class.
> > >
> > > - John
> > >
> > > On Tue, Mar 24, 2009 at 7:33 AM, Tamir Kamara <tamirkam...@gmail.com>
> > > wrote:
> > > > Hi,
> > > >
> > > > We need to implement a Join with a between operator instead of an
> > equal.
> > > > What we are trying to do is search a file for a key where the key
> falls
> > > > between two fields in the search file like this:
> > > >
> > > > main file (ip, a, b):
> > > > (80, zz, yy)
> > > > (125, vv, bb)
> > > >
> > > > search file (from-ip, to-ip, d, e):
> > > > (52, 75, xxx, yyy)
> > > > (78, 98, aaa, bbb)
> > > > (99, 115, xxx, ddd)
> > > > (125, 130, hhh, aaa)
> > > > (150, 162, qqq, sss)
> > > >
> > > > the outcome should be in the form (ip, a, b, d, e):
> > > > (80, zz, yy, aaa, bbb)
> > > > (125, vv, bb, eee, hhh)
> > > >
> > > > We could convert the ip ranges in the search file to single record
> ips
> > > and
> > > > then do a regular join, but the number of single ips is huge and this
> > is
> > > > probably not a good way.
> > > > What would be a good course for doing this in hadoop ?
> > > >
> > > >
> > > > Thanks,
> > > > Tamir
> > > >
> > >
> >
> >
> >
> > --
> > Alpha Chapters of my book on Hadoop are available
> > http://www.apress.com/book/view/9781430219422
> >
>



-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422

Re: Join Variation

Reply via email to