Let's make it clear that count/collection type of actions execute the
plan up till that point (including the data sinks). From a user
perspective, this seems most logic to me. The user might even rely on
the data generated by the sinks.
On Mon, Jan 19, 2015 at 11:46 AM, Fabian Hueske wrote:
> Thi
This is a difficult question.
A program might also later refer to some intermediate data set that would
have been already computed if sinks are executed together with the count()
call and need to be computed again.
Also what do we do with sinks that are not connected with the collected or
counted
I agree with Ufuk that it depends on how much both subgraphs and also
future subgraphs overlap. It is conceivable that the user will reuse
subgraphs of an already computed data sink after he called collect(). Then
we also would have to reexecute parts of the dataflow graph. I guess we
easily find e
I think this question depends on how much both subgraphs overlap? But in
general, I agree that the first approach seems more desirable from the
runtime view (multiple consumers at the branch point).
On Mon, Jan 19, 2015 at 10:59 AM, Robert Metzger
wrote:
> I would also execute the sinks immediat
I would also execute the sinks immediately. I think its a corner case
because the sinks are usually the last thing in a plan and all print() or
collect() statements are earlier in the plan.
print() should go to the client command line, yes.
On Mon, Jan 19, 2015 at 1:42 AM, Stephan Ewen wrote:
>
Hi there!
With the upcoming more interactive extensions to the API (operations that
go back to the client from a program and need to be eagerly evaluated) we
need to define how different actions should behave.
Currently, nothing gets executed until the "env.execute()" call is made.
That allows to