API behavior with data sinks (lazy) and eager operations

Stephan Ewen Sun, 18 Jan 2015 16:44:11 -0800

Hi there!

With the upcoming more interactive extensions to the API (operations that
go back to the client from a program and need to be eagerly evaluated) we
need to define how different actions should behave.


Currently, nothing gets executed until the "env.execute()" call is made.
That allows to produce multiple data sources at the same time, which is a
good feature.

For certain operations, like the "count()" and "collect()" functions added
in https://github.com/apache/flink/pull/210 , we need to trigger execution
immediately.

The open question is, how should this behave in connection with already
defined data sinks:

1) Should all yet defined data sinks be executed as well?
2) Should only that immediate operation be executed and the data sinks be
pending till a call to "env.execute()"

I am somewhat leaning towards the first option right now, because I think
that executing them later may force re-execution of larger parts of the
plan.

In addition: I think that the "print()" commands should go to the client
command line. In that sense, they would behave like
"collect().foreach(print)"


Greetings,
Stephan

API behavior with data sinks (lazy) and eager operations

Reply via email to