[
https://issues.apache.org/jira/browse/FLINK-3910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15288635#comment-15288635
]
Fabian Hueske commented on FLINK-3910:
--------------------------------------
I think it is a good idea to have strategies for self-joins. At the moment you
can join a data set with itself in two ways:
- Use a regular join: {{dataset.join(dataset)}}. In this case, Flink will treat
the input as two inputs, i.e., depending on the chosen strategy shuffle it
twice, sort it twice and possibly temporarily buffer the input.
- Use a reduce function and manually implement the join as done in the
TriangleEnumeration example. Here the problem is that the join must be manually
implemented and is not done in managed memory and might fail.
I would not add a dedicated {{selfjoin}} method to {{DataSet}} because this can
be automatically detected if both inputs of a join are identical. Extending
{{JoinHint}} with strategies for self joins sounds good to me.
[~greghogan] can you describe the driver strategies that you are planning to
implement for self joins? What will characterize the skewed and non-skewed
variants?
> New self-join operator
> ----------------------
>
> Key: FLINK-3910
> URL: https://issues.apache.org/jira/browse/FLINK-3910
> Project: Flink
> Issue Type: New Feature
> Components: DataSet API, Java API, Scala API
> Affects Versions: 1.1.0
> Reporter: Greg Hogan
> Assignee: Greg Hogan
>
> Flink currently provides inner- and outer-joins as well as cogroup and the
> non-keyed cross. {{JoinOperator}} hints at future support for semi- and
> anti-joins.
> Many Gelly algorithms perform a self-join [0]. Still pending reviews,
> FLINK-3768 performs a self-join on non-skewed data in TriangleListing.java
> and FLINK-3780 performs a self-join on skewed data in JaccardSimilarity.java.
> A {{SelfJoinHint}} will select between skewed and non-skewed implementations.
> The object-reuse-disabled case can be simply handled with a new {{Operator}}.
> The object-reuse-enabled case requires either {{CopyableValue}} types (as in
> the code above) or a custom driver which has access to the serializer (or
> making the serializer accessible to rich functions, and I think there be
> dragons).
> If the idea of a self-join is agreeable, I'd like to work out a rough
> implementation and go from there.
> [0] https://en.wikipedia.org/wiki/Join_%28SQL%29#Self-join
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)