Re: Join Variation

jason hadoop Thu, 02 Apr 2009 07:48:21 -0700

Probably be available in a week or so, as draft one isn't quite finished :)


On Thu, Apr 2, 2009 at 1:45 AM, Stefan Podkowinski <spo...@gmail.com> wrote:

> .. and is not yet available as an alpha book chapter. Any chance uploading
> it?
>
> On Thu, Apr 2, 2009 at 4:21 AM, jason hadoop <jason.had...@gmail.com>
> wrote:
> > Just for fun, chapter 9 in my book is a work through of solving this
> class
> > of problem.
> >
> >
> > On Thu, Mar 26, 2009 at 7:07 AM, jason hadoop <jason.had...@gmail.com
> >wrote:
> >
> >> For the classic map/reduce job, you have 3 requirements.
> >>
> >> 1) a comparator that provide the keys in ip address order, such that all
> >> keys in one of your ranges, would be contiguous, when sorted with the
> >> comparator
> >> 2) a partitioner that ensures that all keys that should be together end
> up
> >> in the same partition
> >> 3) and output value grouping comparator that considered all keys in a
> >> specified range equal.
> >>
> >> The comparator only sorts by the first part of the key, the search file
> has
> >> a 2 part key begin/end the input data has just a 1 part key.
> >>
> >> A partitioner that new ahead of time the group sets in your search set,
> in
> >> the way that the tera sort example works would be ideal:
> >> ie: it builds an index of ranges from your seen set so that the ranges
> get
> >> rougly evenly split between your reduces.
> >> This requires a pass over the search file to write out a summary file,
> >> which is then loaded by the partitioner.
> >>
> >> The output value grouping comparator, will get the keys in order of the
> >> first token, and will define the start of a group by the presence of a 2
> >> part key, and consider the group ended when either another 2 part key
> >> appears, or when the key value is larger than the second part of the
> >> starting key. - This does require that the grouping comparator maintain
> >> state.
> >>
> >> At this point, your reduce will be called with the first key in the key
> >> equivalence group of (3), with the values of all of the keys
> >>
> >> In your map, any address that is not in a range of interest is not
> passed
> >> to output.collect.
> >>
> >> For the map side join code, you have to define a comparator on the key
> type
> >> that defines your definition of equivalence and ordering, and call
> >> WritableComparator.define( Key.class, comparator.class ), to force the
> join
> >> code to use your comparator.
> >>
> >> For tables with duplicates, per the key comparator, in map side join,
> your
> >> map fuction will receive a row for every permutation of the duplicate
> keys:
> >> if you have one table a, 1; a, 2; and another table with a, 3; a, 4;
> your
> >> map will receive4 rows, a, 1, 3; a, 1, 4; a, 2, 3; a, 2, 4;
> >>
> >>
> >>
> >> On Wed, Mar 25, 2009 at 11:19 PM, Tamir Kamara <tamirkam...@gmail.com
> >wrote:
> >>
> >>> Thanks for all who replies.
> >>>
> >>> Stefan -
> >>> I'm unable to see how converting IP ranges to network masks would help
> >>> because different ranges can have the same network mask and with that I
> >>> still have to do a comparison of two fields: the searched IP with
> >>> from-IP&mask.
> >>>
> >>> Pig - I'm familier with pig and use it many times, but I can't think of
> a
> >>> way to write a pig script that will do this type of "join". I'll ask
> the
> >>> pig
> >>> users group.
> >>>
> >>> The search file is indeed large in terms of the amount records.
> However, I
> >>> don't see this as an issue yet, because I'm still puzzeled with how to
> >>> write
> >>> the job in plain MR. The join code is looking for an exact match in the
> >>> keys
> >>> and that is not what I need. Would a custom comperator which will look
> for
> >>> a
> >>> match in between the ranges, be the right choice to do this ?
> >>>
> >>> Thanks,
> >>> Tamir
> >>>
> >>> On Wed, Mar 25, 2009 at 5:23 PM, jason hadoop <jason.had...@gmail.com
> >>> >wrote:
> >>>
> >>> > If the search file data set is large, the issue becomes ensuring that
> >>> only
> >>> > the required portion of search file is actually read, and that those
> >>> reads
> >>> > are ordered, in search file's key order.
> >>> >
> >>> > If the data set is small, most any of the common patterns will work.
> >>> >
> >>> > I haven't looked at pig for a while, does pig now use indexes in map
> >>> files,
> >>> > and take into account that a data set is sorted?
> >>> > Out of the box, the map side join code, org.apache.hadoop.mapred.join
> >>> will
> >>> > do a decent job of this, but the entire search file set will be read.
> >>> > To stop reading the entire search file, a record reader or join type,
> >>> would
> >>> > need to be put together to:
> >>> > a) skip to the first key of interest, using the index if available
> >>> > b) finish when the last possible key of interest has been delivered.
> >>> >
> >>> > On Wed, Mar 25, 2009 at 6:05 AM, John Lee <j.benlin....@gmail.com>
> >>> wrote:
> >>> >
> >>> > > In addition to other suggestions, you could also take a look at
> >>> > > building a Cascading job with a custom Joiner class.
> >>> > >
> >>> > > - John
> >>> > >
> >>> > > On Tue, Mar 24, 2009 at 7:33 AM, Tamir Kamara <
> tamirkam...@gmail.com>
> >>> > > wrote:
> >>> > > > Hi,
> >>> > > >
> >>> > > > We need to implement a Join with a between operator instead of an
> >>> > equal.
> >>> > > > What we are trying to do is search a file for a key where the key
> >>> falls
> >>> > > > between two fields in the search file like this:
> >>> > > >
> >>> > > > main file (ip, a, b):
> >>> > > > (80, zz, yy)
> >>> > > > (125, vv, bb)
> >>> > > >
> >>> > > > search file (from-ip, to-ip, d, e):
> >>> > > > (52, 75, xxx, yyy)
> >>> > > > (78, 98, aaa, bbb)
> >>> > > > (99, 115, xxx, ddd)
> >>> > > > (125, 130, hhh, aaa)
> >>> > > > (150, 162, qqq, sss)
> >>> > > >
> >>> > > > the outcome should be in the form (ip, a, b, d, e):
> >>> > > > (80, zz, yy, aaa, bbb)
> >>> > > > (125, vv, bb, eee, hhh)
> >>> > > >
> >>> > > > We could convert the ip ranges in the search file to single
> record
> >>> ips
> >>> > > and
> >>> > > > then do a regular join, but the number of single ips is huge and
> >>> this
> >>> > is
> >>> > > > probably not a good way.
> >>> > > > What would be a good course for doing this in hadoop ?
> >>> > > >
> >>> > > >
> >>> > > > Thanks,
> >>> > > > Tamir
> >>> > > >
> >>> > >
> >>> >
> >>> >
> >>> >
> >>> > --
> >>> > Alpha Chapters of my book on Hadoop are available
> >>> > http://www.apress.com/book/view/9781430219422
> >>> >
> >>>
> >>
> >>
> >>
> >> --
> >> Alpha Chapters of my book on Hadoop are available
> >> http://www.apress.com/book/view/9781430219422
> >>
> >
> >
> >
> > --
> > Alpha Chapters of my book on Hadoop are available
> > http://www.apress.com/book/view/9781430219422
> >
>



-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422

Re: Join Variation

Reply via email to