Hi everyone,

I just wanted to give you the pointer FLINK-1038 
https://github.com/apache/flink/pull/94
This is an output format that can send DataSet contents via Java RMI to, e.g., 
the driver. I am currently using it a lot and it seems to scale pretty well.

Cheers,
Sebastian

-----Original Message-----
From: Ufuk Celebi [mailto:u...@apache.org] 
Sent: Montag, 12. Januar 2015 12:06
To: dev@flink.apache.org
Subject: Re: Gather a distributed dataset

Hey Alexander,

On 12 Jan 2015, at 11:42, Alexander Alexandrov 
<alexander.s.alexand...@gmail.com> wrote:

> Hi there,
> 
> I wished for intermediate datasets, and Santa Ufuk made my wishes come 
> true (thank you, Santa)!
> 
> Now that FLINK-986 is in the mainline, I want to ask some practical 
> questions.
> 
> In Spark, there is a way to put a value from the local driver to the 
> distributed runtime via
> 
> val x = env.parallelize(...) // expose x to the distributed runtime 
> val y = dataflow(env, x) // y is produced by a dataflow which reads 
> from x
> 
> and also to get a value from the distributed runtime back to the 
> driver
> 
> val z = env.collect("y")
> 
> As far as I know, in Flink we have an equivalent for parallelize
> 
> val x = env.fromCollection(...)
> 
> but not for collect. Is this still the case?

Yes, but this will change soon.

> If yes, how hard would it be to add this feature at the moment? Can 
> you give me some pointers?

There is a "alpha" version/hack of this using accumulators. See 
https://github.com/apache/flink/pull/210. The problem is that each collect call 
results in a new program being executed from the sources. I think Stephan is 
working on the scheduling to support this kind of programs. From the runtime 
perspective, it is not a problem to transfer the produced intermediate results 
back to the job manager. The job manager can basically use the same mechanism 
that the task managers use. Even the accumulator version should be fine as a 
initial version.

Reply via email to