That's a good solution. In order to deal with ranges which overlap two intervals you have to create multiple "coarse-grained" join keys. One key for each interval contained in the range.
Cheers, Till On Apr 26, 2015 11:22 PM, "Alexander Alexandrov" < alexander.s.alexand...@gmail.com> wrote: > I thought about your problem over the weekend. Unfortunately the algorithm > that you describe does not fit "regular" equi-join semantics, but I think > it could be "fitted" with a more complex dataflow. > > To achieve that, I would partition the (active) domain of the two datasets > on fine-granular intervals (for the sake of the example, let's say 10. > > You can prepare a "coarse-grained" join key on the inputs using a "x % 10" > (Flat)Map: > > One: (0, {3,6}), (0, {5,7}) > Two: (0, 1), (0, 2), (0, 3), (0, 4), (0, 5), (0, 6), (0, 7) > > Upon that you can do a regular join on the "coarse-grained" key (in the > first component of the tuples), and follow that with a filter that > evaluates the actual "one.start <= two.number <= one.end" predicate. > > Regards, > Alex > > > 2015-04-24 20:55 GMT+02:00 Kirschnick, Johannes < > johannes.kirschn...@tu-berlin.de>: > > > Hi > > I have a small problem with doing a custom join, that I would need some > > help with. Maybe I'm also approaching the problem wrong. > > So basically I have two dataset. > > The simplified example: The first one has a start and end value. The > > second dataset is just a list of ordered numbers and some value (value is > > ignored in the example) > > Example > > One = {3,6},{5,7} > > Two = 1,2,3,4,5,6,7 > > What I need is a sort of custom join, that brings to the first dataset > all > > elements from the second that are within the range. > > Something like .. join where one.start <= two.number <= one.end > > So {3,6} from one would only need to "see" 3,4,5 > > Joining does not work out of the box here as the key is sort of "dynamic" > > depending on the value of one. > > I can just use a map for the first dataset and broadcast the second into > > the mapper which can then select the required elements - but my > assumption > > is that the second dataset might actually be very large as well, but the > > qualifying join "numbers" from two will actually be small. > > Is there something I could do in this particular case? > > Thanks a lot > > Johannes > > >