Re: Gather a distributed dataset

Alexander Alexandrov Mon, 12 Jan 2015 03:26:41 -0800

Thanks, I am currently looking at the new ExecutionEnvironment API.

> I think Stephan is working on the scheduling to support this kind of
programs.


@Stephan: is there a feature branch for that somewhere?

2015-01-12 12:05 GMT+01:00 Ufuk Celebi <u...@apache.org>:

> Hey Alexander,
>
> On 12 Jan 2015, at 11:42, Alexander Alexandrov <
> alexander.s.alexand...@gmail.com> wrote:
>
> > Hi there,
> >
> > I wished for intermediate datasets, and Santa Ufuk made my wishes come
> true
> > (thank you, Santa)!
> >
> > Now that FLINK-986 is in the mainline, I want to ask some practical
> > questions.
> >
> > In Spark, there is a way to put a value from the local driver to the
> > distributed runtime via
> >
> > val x = env.parallelize(...) // expose x to the distributed runtime
> > val y = dataflow(env, x) // y is produced by a dataflow which reads from
> x
> >
> > and also to get a value from the distributed runtime back to the driver
> >
> > val z = env.collect("y")
> >
> > As far as I know, in Flink we have an equivalent for parallelize
> >
> > val x = env.fromCollection(...)
> >
> > but not for collect. Is this still the case?
>
> Yes, but this will change soon.
>
> > If yes, how hard would it be to add this feature at the moment? Can you
> > give me some pointers?
>
> There is a "alpha" version/hack of this using accumulators. See
> https://github.com/apache/flink/pull/210. The problem is that each
> collect call results in a new program being executed from the sources. I
> think Stephan is working on the scheduling to support this kind of
> programs. From the runtime perspective, it is not a problem to transfer the
> produced intermediate results back to the job manager. The job manager can
> basically use the same mechanism that the task managers use. Even the
> accumulator version should be fine as a initial version.

Re: Gather a distributed dataset

Reply via email to