Probably be available in a week or so, as draft one isn't quite finished :)
On Thu, Apr 2, 2009 at 1:45 AM, Stefan Podkowinski <spo...@gmail.com> wrote: > .. and is not yet available as an alpha book chapter. Any chance uploading > it? > > On Thu, Apr 2, 2009 at 4:21 AM, jason hadoop <jason.had...@gmail.com> > wrote: > > Just for fun, chapter 9 in my book is a work through of solving this > class > > of problem. > > > > > > On Thu, Mar 26, 2009 at 7:07 AM, jason hadoop <jason.had...@gmail.com > >wrote: > > > >> For the classic map/reduce job, you have 3 requirements. > >> > >> 1) a comparator that provide the keys in ip address order, such that all > >> keys in one of your ranges, would be contiguous, when sorted with the > >> comparator > >> 2) a partitioner that ensures that all keys that should be together end > up > >> in the same partition > >> 3) and output value grouping comparator that considered all keys in a > >> specified range equal. > >> > >> The comparator only sorts by the first part of the key, the search file > has > >> a 2 part key begin/end the input data has just a 1 part key. > >> > >> A partitioner that new ahead of time the group sets in your search set, > in > >> the way that the tera sort example works would be ideal: > >> ie: it builds an index of ranges from your seen set so that the ranges > get > >> rougly evenly split between your reduces. > >> This requires a pass over the search file to write out a summary file, > >> which is then loaded by the partitioner. > >> > >> The output value grouping comparator, will get the keys in order of the > >> first token, and will define the start of a group by the presence of a 2 > >> part key, and consider the group ended when either another 2 part key > >> appears, or when the key value is larger than the second part of the > >> starting key. - This does require that the grouping comparator maintain > >> state. > >> > >> At this point, your reduce will be called with the first key in the key > >> equivalence group of (3), with the values of all of the keys > >> > >> In your map, any address that is not in a range of interest is not > passed > >> to output.collect. > >> > >> For the map side join code, you have to define a comparator on the key > type > >> that defines your definition of equivalence and ordering, and call > >> WritableComparator.define( Key.class, comparator.class ), to force the > join > >> code to use your comparator. > >> > >> For tables with duplicates, per the key comparator, in map side join, > your > >> map fuction will receive a row for every permutation of the duplicate > keys: > >> if you have one table a, 1; a, 2; and another table with a, 3; a, 4; > your > >> map will receive4 rows, a, 1, 3; a, 1, 4; a, 2, 3; a, 2, 4; > >> > >> > >> > >> On Wed, Mar 25, 2009 at 11:19 PM, Tamir Kamara <tamirkam...@gmail.com > >wrote: > >> > >>> Thanks for all who replies. > >>> > >>> Stefan - > >>> I'm unable to see how converting IP ranges to network masks would help > >>> because different ranges can have the same network mask and with that I > >>> still have to do a comparison of two fields: the searched IP with > >>> from-IP&mask. > >>> > >>> Pig - I'm familier with pig and use it many times, but I can't think of > a > >>> way to write a pig script that will do this type of "join". I'll ask > the > >>> pig > >>> users group. > >>> > >>> The search file is indeed large in terms of the amount records. > However, I > >>> don't see this as an issue yet, because I'm still puzzeled with how to > >>> write > >>> the job in plain MR. The join code is looking for an exact match in the > >>> keys > >>> and that is not what I need. Would a custom comperator which will look > for > >>> a > >>> match in between the ranges, be the right choice to do this ? > >>> > >>> Thanks, > >>> Tamir > >>> > >>> On Wed, Mar 25, 2009 at 5:23 PM, jason hadoop <jason.had...@gmail.com > >>> >wrote: > >>> > >>> > If the search file data set is large, the issue becomes ensuring that > >>> only > >>> > the required portion of search file is actually read, and that those > >>> reads > >>> > are ordered, in search file's key order. > >>> > > >>> > If the data set is small, most any of the common patterns will work. > >>> > > >>> > I haven't looked at pig for a while, does pig now use indexes in map > >>> files, > >>> > and take into account that a data set is sorted? > >>> > Out of the box, the map side join code, org.apache.hadoop.mapred.join > >>> will > >>> > do a decent job of this, but the entire search file set will be read. > >>> > To stop reading the entire search file, a record reader or join type, > >>> would > >>> > need to be put together to: > >>> > a) skip to the first key of interest, using the index if available > >>> > b) finish when the last possible key of interest has been delivered. > >>> > > >>> > On Wed, Mar 25, 2009 at 6:05 AM, John Lee <j.benlin....@gmail.com> > >>> wrote: > >>> > > >>> > > In addition to other suggestions, you could also take a look at > >>> > > building a Cascading job with a custom Joiner class. > >>> > > > >>> > > - John > >>> > > > >>> > > On Tue, Mar 24, 2009 at 7:33 AM, Tamir Kamara < > tamirkam...@gmail.com> > >>> > > wrote: > >>> > > > Hi, > >>> > > > > >>> > > > We need to implement a Join with a between operator instead of an > >>> > equal. > >>> > > > What we are trying to do is search a file for a key where the key > >>> falls > >>> > > > between two fields in the search file like this: > >>> > > > > >>> > > > main file (ip, a, b): > >>> > > > (80, zz, yy) > >>> > > > (125, vv, bb) > >>> > > > > >>> > > > search file (from-ip, to-ip, d, e): > >>> > > > (52, 75, xxx, yyy) > >>> > > > (78, 98, aaa, bbb) > >>> > > > (99, 115, xxx, ddd) > >>> > > > (125, 130, hhh, aaa) > >>> > > > (150, 162, qqq, sss) > >>> > > > > >>> > > > the outcome should be in the form (ip, a, b, d, e): > >>> > > > (80, zz, yy, aaa, bbb) > >>> > > > (125, vv, bb, eee, hhh) > >>> > > > > >>> > > > We could convert the ip ranges in the search file to single > record > >>> ips > >>> > > and > >>> > > > then do a regular join, but the number of single ips is huge and > >>> this > >>> > is > >>> > > > probably not a good way. > >>> > > > What would be a good course for doing this in hadoop ? > >>> > > > > >>> > > > > >>> > > > Thanks, > >>> > > > Tamir > >>> > > > > >>> > > > >>> > > >>> > > >>> > > >>> > -- > >>> > Alpha Chapters of my book on Hadoop are available > >>> > http://www.apress.com/book/view/9781430219422 > >>> > > >>> > >> > >> > >> > >> -- > >> Alpha Chapters of my book on Hadoop are available > >> http://www.apress.com/book/view/9781430219422 > >> > > > > > > > > -- > > Alpha Chapters of my book on Hadoop are available > > http://www.apress.com/book/view/9781430219422 > > > -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422