[
https://issues.apache.org/jira/browse/FLINK-3910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15302656#comment-15302656
]
Stephan Ewen commented on FLINK-3910:
-------------------------------------
Let me play devil's advocate here: Every time we add new specialized
constructs, we make our lives harder by committing to maintaining them.
There are two things to think about:
- How much difference to these three self join variants make? Enough to have
them in the core? Or enough to have them in Utils or Gelly directly?
- How deeply do we want to embed them in the DataSet API. Anything added is
virtually impossible to remove, and a very long term commitment.
Given the plethora of things one could add to the APIs, the cost of
maintenance, and the benefit of a concise API, I think we need to start
thinking about a staging process.
We could add constructs not directly to the API, but to something like
extension classes. The once a construct get used very often (let's track this
by users that look for that operation, but do not find the extension class) we
start moving it to the core API.
That would sort of act as a steering process that leads to having the common
and frequent operations on the core API, and the more specialized ones in
extension classes.
What do you think?
> New self-join operator
> ----------------------
>
> Key: FLINK-3910
> URL: https://issues.apache.org/jira/browse/FLINK-3910
> Project: Flink
> Issue Type: New Feature
> Components: DataSet API, Java API, Scala API
> Affects Versions: 1.1.0
> Reporter: Greg Hogan
> Assignee: Greg Hogan
>
> Flink currently provides inner- and outer-joins as well as cogroup and the
> non-keyed cross. {{JoinOperator}} hints at future support for semi- and
> anti-joins.
> Many Gelly algorithms perform a self-join [0]. Still pending reviews,
> FLINK-3768 performs a self-join on non-skewed data in TriangleListing.java
> and FLINK-3780 performs a self-join on skewed data in JaccardSimilarity.java.
> A {{SelfJoinHint}} will select between skewed and non-skewed implementations.
> The object-reuse-disabled case can be simply handled with a new {{Operator}}.
> The object-reuse-enabled case requires either {{CopyableValue}} types (as in
> the code above) or a custom driver which has access to the serializer (or
> making the serializer accessible to rich functions, and I think there be
> dragons).
> If the idea of a self-join is agreeable, I'd like to work out a rough
> implementation and go from there.
> [0] https://en.wikipedia.org/wiki/Join_%28SQL%29#Self-join
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)