Re: Side-effects of DataSet::count

2016-05-31 Thread Ovidiu-Cristian MARCU
Hi Stephan and all, Some reference to this may be https://issues.apache.org/jira/browse/FLINK-2250 ? I agree your priorities on streaming are very high, it will make a big +1 for the community to create a discussion/place for the design proposal

Re: Side-effects of DataSet::count

2016-05-31 Thread Aljoscha Krettek
That last section is a really good Idea! I have several design docs floating around that were announced on the ML. Without a central place to store them they are hard to find, though. -Aljoscha On Tue, 31 May 2016 at 11:27 Stephan Ewen wrote: > Hi! > > There was some preliminary work on this.

Re: Side-effects of DataSet::count

2016-05-31 Thread Stephan Ewen
Hi! There was some preliminary work on this. By now, the requirements have grown a bit. The backtracking needs to handle - Scheduling for execution (the here raised point), possibly resuming from available intermediate results - Recovery from partially executed programs, where operators execu

Re: Side-effects of DataSet::count

2016-05-30 Thread Greg Hogan
Hi Stephan, Is there a design document, prior discussion, or background material on this enhancement? Am I correct in understanding that this only applies to DataSet since streams run indefinitely? Thanks, Greg On Mon, May 30, 2016 at 5:49 PM, Stephan Ewen wrote: > Hi Eron! > > Yes, the idea i

Re: Side-effects of DataSet::count

2016-05-30 Thread Greg Hogan
Hi Simone, This can be done with a map followed by a reduce. DataSet#count leverages accumulators which perform an inherent reduce. Also, DataSet#count implements RichOutputFormat as an optimization to only require a single operator. Previously the counting and accumulating was handled in a RichMa

Re: Side-effects of DataSet::count

2016-05-30 Thread Simone Robutti
On this same subject, I have a question. Is it possible to achieve a lazy count that transforms a DataSet[T] to a DataSet[Long] with a single value containing the length of the original DataSet? Otherwise what is the best way to count the elements lazily? 2016-05-30 23:49 GMT+02:00 Stephan Ewen :

Re: Side-effects of DataSet::count

2016-05-30 Thread Stephan Ewen
Hi Eron! Yes, the idea is to actually switch all executions to a backtracking scheduling mode. That simultaneously solves both fine grained recovery and lazy execution, where later stages build on prior stages. With all the work around streaming, we have not gotten to this so far, but it is one f

Re: Side-effects of DataSet::count

2016-05-30 Thread Eron Wright
Thinking out loud now… Is the job graph fully mutable? Can it be cleared? For example, shouldn’t the count method remove the sink after execution completes? Can numerous job graphs co-exist within a single driver program?How would that relate to the session concept? Seems the count met

Re: Side-effects of DataSet::count

2016-05-29 Thread Márton Balassi
Hey Eron, Yes, DataSet#collect and count methods implicitly trigger a JobGraph execution, thus they also trigger writing to any previously defined sinks. The idea behind this behavior is to enable interactive querying (the one that you are used to get from a shell environment) and it is also a gre