@Stephan: yes, I would like to contribute (e.g. I can design the interfaces and merge 210).
Please reply with more information once you have the branch, I can find some time for that next week (on the expense of FLINK-1347 <https://issues.apache.org/jira/browse/FLINK-1347> which hopefully can wait 3-4 more weeks). Regards, Alex 2015-01-13 16:50 GMT+01:00 Stephan Ewen <se...@apache.org>: > Hi! > > To follow up on what Ufuk explaned: > > - Ufuk is right, the problem is not getting the data set. > https://github.com/apache/flink/pull/210 does that for anything that is > not > too gigantic, which is a good start. I think we should merge this as soon > as we agree on the signature and names of the API methods. We can swap the > internal realization for something more robust later. > > - For anything that just issues a program and wants the result back, this > is actually perfectly fine. > > - For true interactive programs, we need to back track to intermediate > results (rather than to the source) to avoid re-executing large parts. This > is the biggest missing piece, next to the persistent materialization of > intermediate results (Ufuk is working on this). The logic is the same as > for fault tolerance, so it is part of that development. > > @alexander: I want to create the feature branch for that on Thursday. Are > you interested in contributing to that feature? > > - For streaming results continuously back, we need another mechanism than > the accumulators. Let's create a design doc or thread an get working on > that. Probably involves adding another set of akka messages from TM -> JM > -> Client. Or something like an extension to the BLOB manager for streams? > > Greetings, > Stephan > > > On Mon, Jan 12, 2015 at 12:25 PM, Alexander Alexandrov < > alexander.s.alexand...@gmail.com> wrote: > > > Thanks, I am currently looking at the new ExecutionEnvironment API. > > > > > I think Stephan is working on the scheduling to support this kind of > > programs. > > > > @Stephan: is there a feature branch for that somewhere? > > > > 2015-01-12 12:05 GMT+01:00 Ufuk Celebi <u...@apache.org>: > > > > > Hey Alexander, > > > > > > On 12 Jan 2015, at 11:42, Alexander Alexandrov < > > > alexander.s.alexand...@gmail.com> wrote: > > > > > > > Hi there, > > > > > > > > I wished for intermediate datasets, and Santa Ufuk made my wishes > come > > > true > > > > (thank you, Santa)! > > > > > > > > Now that FLINK-986 is in the mainline, I want to ask some practical > > > > questions. > > > > > > > > In Spark, there is a way to put a value from the local driver to the > > > > distributed runtime via > > > > > > > > val x = env.parallelize(...) // expose x to the distributed runtime > > > > val y = dataflow(env, x) // y is produced by a dataflow which reads > > from > > > x > > > > > > > > and also to get a value from the distributed runtime back to the > driver > > > > > > > > val z = env.collect("y") > > > > > > > > As far as I know, in Flink we have an equivalent for parallelize > > > > > > > > val x = env.fromCollection(...) > > > > > > > > but not for collect. Is this still the case? > > > > > > Yes, but this will change soon. > > > > > > > If yes, how hard would it be to add this feature at the moment? Can > you > > > > give me some pointers? > > > > > > There is a "alpha" version/hack of this using accumulators. See > > > https://github.com/apache/flink/pull/210. The problem is that each > > > collect call results in a new program being executed from the sources. > I > > > think Stephan is working on the scheduling to support this kind of > > > programs. From the runtime perspective, it is not a problem to transfer > > the > > > produced intermediate results back to the job manager. The job manager > > can > > > basically use the same mechanism that the task managers use. Even the > > > accumulator version should be fine as a initial version. > > >