If the search file data set is large, the issue becomes ensuring that only the required portion of search file is actually read, and that those reads are ordered, in search file's key order.
If the data set is small, most any of the common patterns will work. I haven't looked at pig for a while, does pig now use indexes in map files, and take into account that a data set is sorted? Out of the box, the map side join code, org.apache.hadoop.mapred.join will do a decent job of this, but the entire search file set will be read. To stop reading the entire search file, a record reader or join type, would need to be put together to: a) skip to the first key of interest, using the index if available b) finish when the last possible key of interest has been delivered. On Wed, Mar 25, 2009 at 6:05 AM, John Lee <j.benlin....@gmail.com> wrote: > In addition to other suggestions, you could also take a look at > building a Cascading job with a custom Joiner class. > > - John > > On Tue, Mar 24, 2009 at 7:33 AM, Tamir Kamara <tamirkam...@gmail.com> > wrote: > > Hi, > > > > We need to implement a Join with a between operator instead of an equal. > > What we are trying to do is search a file for a key where the key falls > > between two fields in the search file like this: > > > > main file (ip, a, b): > > (80, zz, yy) > > (125, vv, bb) > > > > search file (from-ip, to-ip, d, e): > > (52, 75, xxx, yyy) > > (78, 98, aaa, bbb) > > (99, 115, xxx, ddd) > > (125, 130, hhh, aaa) > > (150, 162, qqq, sss) > > > > the outcome should be in the form (ip, a, b, d, e): > > (80, zz, yy, aaa, bbb) > > (125, vv, bb, eee, hhh) > > > > We could convert the ip ranges in the search file to single record ips > and > > then do a regular join, but the number of single ips is huge and this is > > probably not a good way. > > What would be a good course for doing this in hadoop ? > > > > > > Thanks, > > Tamir > > > -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422