[
https://issues.apache.org/jira/browse/FLINK-3910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15299963#comment-15299963
]
Greg Hogan commented on FLINK-3910:
-----------------------------------
[~fhueske] thanks for looking at this idea. The current reduce-based
implementations of {{selfJoin}} only generate pairs from the "strictly upper
triangular matrix" so we're not generating {{(x, x)}} and only generating {{(x,
y)}} not {{(x, y)}} and {{(y, x)}}. If {{selfJoin}} is a new operation then we
can retain the same algorithm performance by outputting {{(x, null)}} pairs and
allowing the user to assume {{(y, x)}} when given {{(x, y)}}.
The second listed method, using a reduce, requires that types implement
{{CopyableValue}} in order to enable object reuse whereas driver has access to
the serializer.
A third method for {{selfJoin}} is demonstrated in the recently committed
{{JaccardIndex}} using reduceGroup, flatMap, and reduceGroup to obviate data
skew.
A {{SelfJoinFunction}} would be configured with one input type and key set
rather than two as in {{JoinFunction}}. Also, wouldn't {{SelfJoinHint}} be
exclusive of {{JoinHint}}?
> New self-join operator
> ----------------------
>
> Key: FLINK-3910
> URL: https://issues.apache.org/jira/browse/FLINK-3910
> Project: Flink
> Issue Type: New Feature
> Components: DataSet API, Java API, Scala API
> Affects Versions: 1.1.0
> Reporter: Greg Hogan
> Assignee: Greg Hogan
>
> Flink currently provides inner- and outer-joins as well as cogroup and the
> non-keyed cross. {{JoinOperator}} hints at future support for semi- and
> anti-joins.
> Many Gelly algorithms perform a self-join [0]. Still pending reviews,
> FLINK-3768 performs a self-join on non-skewed data in TriangleListing.java
> and FLINK-3780 performs a self-join on skewed data in JaccardSimilarity.java.
> A {{SelfJoinHint}} will select between skewed and non-skewed implementations.
> The object-reuse-disabled case can be simply handled with a new {{Operator}}.
> The object-reuse-enabled case requires either {{CopyableValue}} types (as in
> the code above) or a custom driver which has access to the serializer (or
> making the serializer accessible to rich functions, and I think there be
> dragons).
> If the idea of a self-join is agreeable, I'd like to work out a rough
> implementation and go from there.
> [0] https://en.wikipedia.org/wiki/Join_%28SQL%29#Self-join
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)