I would rewrite this to replicate the list into tuples: "foreach x in list: emit (x, list)" Then join on fields 0.
This replicates the lists, but makes the join very efficient. On Fri, Jul 17, 2015 at 12:26 AM, Fabian Hueske <fhue...@gmail.com> wrote: > Hi Martin, > > good to hear that you like Flink :-) > AFAIK, there are no plans to add a containment join. The Flink community > is currently working on adding support for outer joins. > Regarding a containment join, I am not sure about the number of use cases. > I would rather try to implement it on top of Flink's batch API instead of > adding it as an internal feature/operator to the system because this would > touch a lot of things (API, optimizer, operator implementation). > > There might be better ways to implement a containment join than using a > cross and a filter. > - Do you know a distributed algorithm for containment joins? Maybe it can > be implemented with Flink's API. > - I guess, you are implementing a generic graph framework, but can you > make certain assumptions about the data such as relative sizes of the > inputs or avg/max size of the lists, etc.? > > Contributions to Gelly (and Flink in general) are highly welcome. > > Best, Fabian > > > 2015-07-16 9:39 GMT+02:00 Martin Junghanns <martin.jungha...@gmx.net>: > >> Hi everyone, >> >> at first, thanks for building this great framework! We are using Flink >> and especially Gelly for building a graph analytics stack (gradoop.com). >> >> I was wondering if there is a [planned] support for a containment join >> operator. Consider the following example: >> >> DataSet<List<Int>> left := {[0, 1], [2, 3, 4], [5]} >> DataSet<Tuple2<Int, Int>> right := {<0, 1>, <1, 0>, <2, 1>, <5, 2>} >> >> What I want to compute is >> >> left.join(right).where(list).contains(tuple.f0) := >> >> { >> <[0, 1], <0,1>>, <[0, 1], <1, 0>>, >> <[2, 3, 4], <2, 1>>, >> <[5], <5, 2> >> } >> >> At the moment, I am solving that using cross and filter, which can be >> expensive. >> >> The generalization of that operator would be "set containment join", >> where you join if the right set is contained in the left set. >> >> If there is a general need for that operator, I would also like to >> contribute to its implementation. >> >> But maybe, there is already another nice solution which I didn't >> discover yet? >> >> Any help would be appreciated. Especially since I would also like to >> contribute some of our graph operators (e.g., graph summarization) back >> to Flink/Gelly (current WIP state can be found here: [1]). >> >> Thanks, >> >> Martin >> >> >> [1] >> >> https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/gradoop-flink/src/main/java/org/gradoop/model/impl/operators/Summarization.java >> >> >> >