[ 
https://issues.apache.org/jira/browse/FLINK-3910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15302853#comment-15302853
 ] 

Greg Hogan commented on FLINK-3910:
-----------------------------------

I see this case as part of flushing out the remaining join operators. Flink 
could get by with {{map}}, {{reduce}}, and {{join}} but we are kindly given 
additional operators for clarity and performance. Outer joins have been quite 
useful despite that we could instead use {{coGroup}}. anti- and semi-joins 
would be similarly useful but are for now just comments in code.

{{selfJoin}} can have a large impact on performance. A {{reduce}} is {{O(n)}} 
but a join is {{O(n^2)}} so data skew has a much larger effect.

How would extension classes contrast with simply marking methods as 
{{@PublicEvolving}}?

I do see that it may be desirable to defer major features to the next release 
when there is insufficient time to settle.


> New self-join operator
> ----------------------
>
>                 Key: FLINK-3910
>                 URL: https://issues.apache.org/jira/browse/FLINK-3910
>             Project: Flink
>          Issue Type: New Feature
>          Components: DataSet API, Java API, Scala API
>    Affects Versions: 1.1.0
>            Reporter: Greg Hogan
>            Assignee: Greg Hogan
>
> Flink currently provides inner- and outer-joins as well as cogroup and the 
> non-keyed cross. {{JoinOperator}} hints at future support for semi- and 
> anti-joins.
> Many Gelly algorithms perform a self-join [0]. Still pending reviews, 
> FLINK-3768 performs a self-join on non-skewed data in TriangleListing.java 
> and FLINK-3780 performs a self-join on skewed data in JaccardSimilarity.java. 
> A {{SelfJoinHint}} will select between skewed and non-skewed implementations.
> The object-reuse-disabled case can be simply handled with a new {{Operator}}. 
> The object-reuse-enabled case requires either {{CopyableValue}} types (as in 
> the code above) or a custom driver which has access to the serializer (or 
> making the serializer accessible to rich functions, and I think there be 
> dragons).
> If the idea of a self-join is agreeable, I'd like to work out a rough 
> implementation and go from there.
> [0] https://en.wikipedia.org/wiki/Join_%28SQL%29#Self-join



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to