Github user hvanhovell commented on the pull request:
https://github.com/apache/spark/pull/7379#issuecomment-122333357
No problem.
### Supporting N-Ary Predicates.
In order to make the range join work we need the predicates to define a
single interval for each side of the join. For instance the clause: ```a.low <
b.high && b.low < a.high``` implies that there are two intervals: [a.low,
a.high] & [b.low, b.high]. An open interval, for instance ```a.low < b.high```,
would also work.
When we use more than two clauses, we can potentially have multiple
intervals, in your example for instance ```a.key < b.key and a.key2 > b.key2
and a.key3>=b.key3``` would yield the following intervals: [a.key1, a.key2],
[a.key1, a.key3], [b.key2, b.key1] & [b.key2, b.key3]. Creating a working
index, that can deal with the (partially) uncorrelated intervals, will be quite
a challenge (I haven't really looked into this yet). We could offcourse pick
join on one pair of intervals and use filtering to take of the rest.
I think the Unary and Binary cases are the most common. Let's start there,
and see if there is demand for N-ary designs.
### Generalization
If you consider the fact that we are joining intervals (Ranges if you
will), range partitioning will not work because this assumes both intervals
will be entirely in the same partition (they can span multiple partitions).
When dealing with larger tables we would have to use a special interval-aware
partitioning, this would create partitions for a number of fully covering
non-overlapping intervals, and would multicast the rows to each interval it
belongs to. The subsequent step would be using an index or doing a
cartesian/BNL join.
Doing a Cartesian Join in a single partition performs horrible. I thought
it wouldn't be a problem either, but this completely killed the performance of
an analysis I was doing for a client (account balances at specific dates).
I do see opportunities for code re-use. But this would be by generalizing
HashedRelation and the BroadCast join family.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]