Hi Johann, hi Ricky, Thanks for reaching out to the mailing list before taking action!
I do also prefer option c. In principle, all inner join strategies can also be applied for all outer joins (for some hash strategies, a special HashTable implementation is required). I propose to add two methods for each join type, with and without JoinHint. The JoinHint variant should fail for strategies which are not applicable, yet (all hash-based). You are right that there is quite a bit of duplicated code in the API code. It could really use some refactoring and deduplication. However, I see this as a separate issue and would also go for the separate operator approach in the beginning. Maybe you can implement a common operator base for all outer joins. If you are interested, you can help refactoring the API as a follow up issue ;-) Cheers, Fabian 2015-09-01 18:34 GMT+02:00 Till Rohrmann <trohrm...@apache.org>: > Hi Johann, > > I'd prefer 1.c, because the different join variants are semantically > different and this should be IMO reflected in the API. Moreover, the > `JoinHints` are used to give hints for the selection of the underlying > strategy for the different join variants. For `leftOuterJoin` you could > either use a sort-merge-join or a hash-join where the left side is the > probe side. For the `rightOuterJoin` you will have different strategies: > Either you use a sort-merge-join or a hash-join where the right side is the > probe side. Thus, the `JoinHints` for both variants cannot be the same. > > Since I'm not too familiar with the code base of the join operator > implementation, I cannot give you a really good advice for question 2. In > general it would be good to avoid as much redundant code as possible since > it makes the code base harder to maintain. But if this should entail a > disproportional larger work effort, then I think it's also fine to follow > the CoGroup way. > > Cheers, > Till > > On Tue, Sep 1, 2015 at 6:02 PM, Johann Kovacs <m...@jkovacs.de> wrote: > > > Hi all, > > > > we (Ricky and I) are currently working on the outer join > > implementation for Flink (FLINK-687, previous pull requests #907, > > #1052). > > > > I am now looking for advice on 2 issues specifically regarding the > > integration of the outer join operator with the DataSet API > > (FLINK-2576). > > > > 1. There are several options of exposing the operator to the user via > > the DataSet API and I'd just like to hear your preferences between the > > following options (or other suggestions if I missed something): > > a. DataSet#outerJoin(DataSet other, OuterJoinType outerJoinType) > > [i.e. asking the user to pass an enum left-, right-, or full outer > > join] > > b. DataSet#join(DataSet other, JoinType joinType) [i.e. like option > > a, but generalized to work for all: inner-, left-, right-, full outer > > joins] > > c. DataSet#left/right/fullOuterJoin(DataSet other) [i.e. a fully > > qualified method for each operator] > > > > Personally I'm partial towards options a and c, although a does have > > the advantage of not blowing up the API too much (imagine adding > > additional optional parameters, such as JoinHint, to each of option > > c's methods). > > > > 2. I would have liked to implement the outer join operator API by > > reusing as much code & functionality as possible from > > org.apache.flink.api.java.operators.JoinOperator and JoinOperatorBase > > (especially all the KeySelector, semantic annotations, and tuple > > unwrapping stuff...) but I feel like this would bite me sooner or > > later due to incompatibilities or other minor differences between the > > behaviour of those operators. > > I imagine this is the reason why lots of this functionality was > > duplicated for the CoGroup operator implementation. Which makes me > > think I should probably go the same route and duplicate the necessary > > APIs, and then maybe try to refactor later? > > Any opinions or hints regarding this? > > > > Thanks in advance, > > Johann > > >