That's a good solution. In order to deal with ranges which overlap two
intervals you have to create multiple "coarse-grained" join keys. One key
for each interval contained in the range.

Cheers,
Till
On Apr 26, 2015 11:22 PM, "Alexander Alexandrov" <
alexander.s.alexand...@gmail.com> wrote:

> I thought about your problem over the weekend. Unfortunately the algorithm
> that you describe does not fit "regular" equi-join semantics, but I think
> it could be "fitted" with a more complex dataflow.
>
> To achieve that, I would partition the (active) domain of the two datasets
> on fine-granular intervals (for the sake of the example, let's say 10.
>
> You can prepare a "coarse-grained" join key on the inputs using a "x % 10"
> (Flat)Map:
>
> One: (0, {3,6}), (0, {5,7})
> Two: (0, 1), (0, 2), (0, 3), (0, 4), (0, 5), (0, 6), (0, 7)
>
> Upon that you can do a regular join on the "coarse-grained" key (in the
> first component of the tuples), and follow that with a filter that
> evaluates the actual "one.start <= two.number <= one.end" predicate.
>
> Regards,
> Alex
>
>
> 2015-04-24 20:55 GMT+02:00 Kirschnick, Johannes <
> johannes.kirschn...@tu-berlin.de>:
>
> > Hi
> > I have a small problem with doing a custom join, that I would need some
> > help with. Maybe I'm also approaching the problem wrong.
> > So basically I have two dataset.
> > The simplified example: The first one has a start and end value. The
> > second dataset is just a list of ordered numbers and some value (value is
> > ignored in the example)
> > Example
> > One = {3,6},{5,7}
> > Two = 1,2,3,4,5,6,7
> > What I need is a sort of custom join, that brings to the first dataset
> all
> > elements from the second that are within the range.
> > Something like .. join where one.start <= two.number <= one.end
> > So {3,6} from one would only need to "see" 3,4,5
> > Joining does not work out of the box here as the key is sort of "dynamic"
> > depending on the value of one.
> > I can just use a map for the first dataset and broadcast the second into
> > the mapper which can then select the required elements - but my
> assumption
> > is that the second dataset might actually be very large as well, but the
> > qualifying join "numbers" from two will actually be small.
> > Is there something I could do in this particular case?
> > Thanks a lot
> > Johannes
> >
>

Reply via email to