Let's make it clear that count/collection type of actions execute the
plan up till that point (including the data sinks). From a user
perspective, this seems most logic to me. The user might even rely on
the data generated by the sinks.

On Mon, Jan 19, 2015 at 11:46 AM, Fabian Hueske <fhue...@apache.org> wrote:
> This is a difficult question.
>
> A program might also later refer to some intermediate data set that would
> have been already computed if sinks are executed together with the count()
> call and need to be computed again.
> Also what do we do with sinks that are not connected with the collected or
> counted data set. Executing them as well would not give any benefit but
> might result in higher costs, if something on their path is used later.
> On the other hand distinguishing between connected and unconnected sinks
> might not be easy to reason about.
>
> How about caching data sets where the data flow branches?
>
> In any case, the policy should be easy to understand for users.
>
> Cheers, Fabian
>
> 2015-01-19 11:19 GMT+01:00 Ufuk Celebi <u...@apache.org>:
>
>> I think this question depends on how much both subgraphs overlap? But in
>> general, I agree that the first approach seems more desirable from the
>> runtime view (multiple consumers at the branch point).
>>
>> On Mon, Jan 19, 2015 at 10:59 AM, Robert Metzger <rmetz...@apache.org>
>> wrote:
>>
>> > I would also execute the sinks immediately. I think its a corner case
>> > because the sinks are usually the last thing in a plan and all print() or
>> > collect() statements are earlier in the plan.
>> >
>> > print() should go to the client command line, yes.
>> >
>> > On Mon, Jan 19, 2015 at 1:42 AM, Stephan Ewen <se...@apache.org> wrote:
>> >
>> > > Hi there!
>> > >
>> > > With the upcoming more interactive extensions to the API (operations
>> that
>> > > go back to the client from a program and need to be eagerly evaluated)
>> we
>> > > need to define how different actions should behave.
>> > >
>> > > Currently, nothing gets executed until the "env.execute()" call is
>> made.
>> > > That allows to produce multiple data sources at the same time, which
>> is a
>> > > good feature.
>> > >
>> > > For certain operations, like the "count()" and "collect()" functions
>> > added
>> > > in https://github.com/apache/flink/pull/210 , we need to trigger
>> > execution
>> > > immediately.
>> > >
>> > > The open question is, how should this behave in connection with already
>> > > defined data sinks:
>> > >
>> > > 1) Should all yet defined data sinks be executed as well?
>> > > 2) Should only that immediate operation be executed and the data sinks
>> be
>> > > pending till a call to "env.execute()"
>> > >
>> > > I am somewhat leaning towards the first option right now, because I
>> think
>> > > that executing them later may force re-execution of larger parts of the
>> > > plan.
>> > >
>> > > In addition: I think that the "print()" commands should go to the
>> client
>> > > command line. In that sense, they would behave like
>> > > "collect().foreach(print)"
>> > >
>> > >
>> > > Greetings,
>> > > Stephan
>> > >
>> >
>>

Reply via email to