Re: Side-effects of DataSet::count

Aljoscha Krettek Tue, 31 May 2016 02:46:41 -0700

That last section is a really good Idea! I have several design docs
floating around that were announced on the ML. Without a central place to
store them they are hard to find, though.


-Aljoscha


On Tue, 31 May 2016 at 11:27 Stephan Ewen <[email protected]> wrote:

> Hi!
>
> There was some preliminary work on this. By now, the requirements have
> grown a bit. The backtracking needs to handle
>
>   - Scheduling for execution (the here raised point), possibly resuming
> from available intermediate results
>   - Recovery from partially executed programs, where operators execute
> whole or not (batch style)
>   - Recover from intermediate result since latest completed checkpoint
>   - Eventually even recover superstep-based iterations.
>
> So the design needs to be extended slightly. We do not have a design
> writeup for this, but I agree, it would be great to have one.
> I have a pretty good general idea about this, let me see if I can get to
> that next week.
>
> In general, for such things (long standing ideas and designs), we should
> have something like Kafka has with its KIPs (Kafka Improvement Proposal) -
> a place where to collect them, refine them over time, and
> see how people react to them or step up to implement them. We could call
> them 3Fs (Flink Feature Forms) ;-)
>
> Greetings,
> Stephan
>
>
> On Tue, May 31, 2016 at 1:02 AM, Greg Hogan <[email protected]> wrote:
>
> > Hi Stephan,
> >
> > Is there a design document, prior discussion, or background material on
> > this enhancement? Am I correct in understanding that this only applies to
> > DataSet since streams run indefinitely?
> >
> > Thanks,
> > Greg
> >
> > On Mon, May 30, 2016 at 5:49 PM, Stephan Ewen <[email protected]> wrote:
> >
> > > Hi Eron!
> > >
> > > Yes, the idea is to actually switch all executions to a backtracking
> > > scheduling mode. That simultaneously solves both fine grained recovery
> > and
> > > lazy execution, where later stages build on prior stages.
> > >
> > > With all the work around streaming, we have not gotten to this so far,
> > but
> > > it is one feature still in the list...
> > >
> > > Greetings,
> > > Stephan
> > >
> > >
> > > On Mon, May 30, 2016 at 9:55 PM, Eron Wright <[email protected]> wrote:
> > >
> > > > Thinking out loud now…
> > > >
> > > > Is the job graph fully mutable?   Can it be cleared?   For example,
> > > > shouldn’t the count method remove the sink after execution completes?
> > > >
> > > > Can numerous job graphs co-exist within a single driver program?
> How
> > > > would that relate to the session concept?
> > > >
> > > > Seems the count method should use ‘backtracking’ schedule mode, and
> > only
> > > > execute the minimum needed to materialize the count sink.
> > > >
> > > > > On May 29, 2016, at 3:08 PM, Márton Balassi <
> > [email protected]>
> > > > wrote:
> > > > >
> > > > > Hey Eron,
> > > > >
> > > > > Yes, DataSet#collect and count methods implicitly trigger a
> JobGraph
> > > > > execution, thus they also trigger writing to any previously defined
> > > > sinks.
> > > > > The idea behind this behavior is to enable interactive querying
> (the
> > > one
> > > > > that you are used to get from a shell environment) and it is also a
> > > great
> > > > > debugging tool.
> > > > >
> > > > > Best,
> > > > >
> > > > > Marton
> > > > >
> > > > > On Sun, May 29, 2016 at 11:28 PM, Eron Wright <[email protected]>
> > > wrote:
> > > > >
> > > > >> I was curious as to how the `count` method on DataSet worked, and
> > was
> > > > >> surprised to see that it executes the entire program graph.
> >  Wouldn’t
> > > > this
> > > > >> cause undesirable side-effects like writing to sinks?    Also
> > strange
> > > > that
> > > > >> the graph is mutated with the addition of a sink (that isn’t
> > > > subsequently
> > > > >> removed).
> > > > >>
> > > > >> Surveying the Flink code, there aren’t many situations where the
> > > program
> > > > >> graph is implicitly executed (`collect` is another).
>  Nonetheless,
> > > this
> > > > >> has deepened my appreciation for how dynamic the application might
> > be.
> > > > >>
> > > > >> // DataSet.java
> > > > >> public long count() throws Exception {
> > > > >>   final String id = new AbstractID().toString();
> > > > >>
> > > > >>   output(new Utils.CountHelper<T>(id)).name("count()");
> > > > >>
> > > > >>   JobExecutionResult res = getExecutionEnvironment().execute();
> > > > >>   return res.<Long> getAccumulatorResult(id);
> > > > >> }
> > > > >> Eron
> > > >
> > > >
> > >
> >
>

Re: Side-effects of DataSet::count

Reply via email to