Thanks, I am currently looking at the new ExecutionEnvironment API. > I think Stephan is working on the scheduling to support this kind of programs.
@Stephan: is there a feature branch for that somewhere? 2015-01-12 12:05 GMT+01:00 Ufuk Celebi <u...@apache.org>: > Hey Alexander, > > On 12 Jan 2015, at 11:42, Alexander Alexandrov < > alexander.s.alexand...@gmail.com> wrote: > > > Hi there, > > > > I wished for intermediate datasets, and Santa Ufuk made my wishes come > true > > (thank you, Santa)! > > > > Now that FLINK-986 is in the mainline, I want to ask some practical > > questions. > > > > In Spark, there is a way to put a value from the local driver to the > > distributed runtime via > > > > val x = env.parallelize(...) // expose x to the distributed runtime > > val y = dataflow(env, x) // y is produced by a dataflow which reads from > x > > > > and also to get a value from the distributed runtime back to the driver > > > > val z = env.collect("y") > > > > As far as I know, in Flink we have an equivalent for parallelize > > > > val x = env.fromCollection(...) > > > > but not for collect. Is this still the case? > > Yes, but this will change soon. > > > If yes, how hard would it be to add this feature at the moment? Can you > > give me some pointers? > > There is a "alpha" version/hack of this using accumulators. See > https://github.com/apache/flink/pull/210. The problem is that each > collect call results in a new program being executed from the sources. I > think Stephan is working on the scheduling to support this kind of > programs. From the runtime perspective, it is not a problem to transfer the > produced intermediate results back to the job manager. The job manager can > basically use the same mechanism that the task managers use. Even the > accumulator version should be fine as a initial version.