Re: Gather a distributed dataset

Alexander Alexandrov Thu, 15 Jan 2015 02:19:00 -0800

@Stephan: yes, I would like to contribute (e.g. I can design the interfaces
and merge 210).


Please reply with more information once you have the branch, I can find
some time for that next week (on the expense of FLINK-1347
<https://issues.apache.org/jira/browse/FLINK-1347> which hopefully can wait
3-4 more weeks).

Regards,
Alex


2015-01-13 16:50 GMT+01:00 Stephan Ewen <se...@apache.org>:

> Hi!
>
> To follow up on what Ufuk explaned:
>
>  - Ufuk is right, the problem is not getting the data set.
> https://github.com/apache/flink/pull/210 does that for anything that is
> not
> too gigantic, which is a good start. I think we should merge this as soon
> as we agree on the signature and names of the API methods. We can swap the
> internal realization for something more robust later.
>
>  - For anything that just issues a program and wants the result back, this
> is actually perfectly fine.
>
>  - For true interactive programs, we need to back track to intermediate
> results (rather than to the source) to avoid re-executing large parts. This
> is the biggest missing piece, next to the persistent materialization of
> intermediate results (Ufuk is working on this). The logic is the same as
> for fault tolerance, so it is part of that development.
>
> @alexander: I want to create the feature branch for that on Thursday. Are
> you interested in contributing to that feature?
>
>  - For streaming results continuously back, we need another mechanism than
> the accumulators. Let's create a design doc or thread an get working on
> that. Probably involves adding another set of akka messages from TM -> JM
> -> Client. Or something like an extension to the BLOB manager for streams?
>
> Greetings,
> Stephan
>
>
> On Mon, Jan 12, 2015 at 12:25 PM, Alexander Alexandrov <
> alexander.s.alexand...@gmail.com> wrote:
>
> > Thanks, I am currently looking at the new ExecutionEnvironment API.
> >
> > > I think Stephan is working on the scheduling to support this kind of
> > programs.
> >
> > @Stephan: is there a feature branch for that somewhere?
> >
> > 2015-01-12 12:05 GMT+01:00 Ufuk Celebi <u...@apache.org>:
> >
> > > Hey Alexander,
> > >
> > > On 12 Jan 2015, at 11:42, Alexander Alexandrov <
> > > alexander.s.alexand...@gmail.com> wrote:
> > >
> > > > Hi there,
> > > >
> > > > I wished for intermediate datasets, and Santa Ufuk made my wishes
> come
> > > true
> > > > (thank you, Santa)!
> > > >
> > > > Now that FLINK-986 is in the mainline, I want to ask some practical
> > > > questions.
> > > >
> > > > In Spark, there is a way to put a value from the local driver to the
> > > > distributed runtime via
> > > >
> > > > val x = env.parallelize(...) // expose x to the distributed runtime
> > > > val y = dataflow(env, x) // y is produced by a dataflow which reads
> > from
> > > x
> > > >
> > > > and also to get a value from the distributed runtime back to the
> driver
> > > >
> > > > val z = env.collect("y")
> > > >
> > > > As far as I know, in Flink we have an equivalent for parallelize
> > > >
> > > > val x = env.fromCollection(...)
> > > >
> > > > but not for collect. Is this still the case?
> > >
> > > Yes, but this will change soon.
> > >
> > > > If yes, how hard would it be to add this feature at the moment? Can
> you
> > > > give me some pointers?
> > >
> > > There is a "alpha" version/hack of this using accumulators. See
> > > https://github.com/apache/flink/pull/210. The problem is that each
> > > collect call results in a new program being executed from the sources.
> I
> > > think Stephan is working on the scheduling to support this kind of
> > > programs. From the runtime perspective, it is not a problem to transfer
> > the
> > > produced intermediate results back to the job manager. The job manager
> > can
> > > basically use the same mechanism that the task managers use. Even the
> > > accumulator version should be fine as a initial version.
> >
>

Re: Gather a distributed dataset

Reply via email to