Re: Side-effects of DataSet::count

Stephan Ewen Tue, 31 May 2016 02:27:45 -0700

Hi!

There was some preliminary work on this. By now, the requirements have
grown a bit. The backtracking needs to handle


  - Scheduling for execution (the here raised point), possibly resuming
from available intermediate results
  - Recovery from partially executed programs, where operators execute
whole or not (batch style)
  - Recover from intermediate result since latest completed checkpoint
  - Eventually even recover superstep-based iterations.

So the design needs to be extended slightly. We do not have a design
writeup for this, but I agree, it would be great to have one.
I have a pretty good general idea about this, let me see if I can get to
that next week.

In general, for such things (long standing ideas and designs), we should
have something like Kafka has with its KIPs (Kafka Improvement Proposal) -
a place where to collect them, refine them over time, and
see how people react to them or step up to implement them. We could call
them 3Fs (Flink Feature Forms) ;-)

Greetings,
Stephan


On Tue, May 31, 2016 at 1:02 AM, Greg Hogan <[email protected]> wrote:

> Hi Stephan,
>
> Is there a design document, prior discussion, or background material on
> this enhancement? Am I correct in understanding that this only applies to
> DataSet since streams run indefinitely?
>
> Thanks,
> Greg
>
> On Mon, May 30, 2016 at 5:49 PM, Stephan Ewen <[email protected]> wrote:
>
> > Hi Eron!
> >
> > Yes, the idea is to actually switch all executions to a backtracking
> > scheduling mode. That simultaneously solves both fine grained recovery
> and
> > lazy execution, where later stages build on prior stages.
> >
> > With all the work around streaming, we have not gotten to this so far,
> but
> > it is one feature still in the list...
> >
> > Greetings,
> > Stephan
> >
> >
> > On Mon, May 30, 2016 at 9:55 PM, Eron Wright <[email protected]> wrote:
> >
> > > Thinking out loud now…
> > >
> > > Is the job graph fully mutable?   Can it be cleared?   For example,
> > > shouldn’t the count method remove the sink after execution completes?
> > >
> > > Can numerous job graphs co-exist within a single driver program?    How
> > > would that relate to the session concept?
> > >
> > > Seems the count method should use ‘backtracking’ schedule mode, and
> only
> > > execute the minimum needed to materialize the count sink.
> > >
> > > > On May 29, 2016, at 3:08 PM, Márton Balassi <
> [email protected]>
> > > wrote:
> > > >
> > > > Hey Eron,
> > > >
> > > > Yes, DataSet#collect and count methods implicitly trigger a JobGraph
> > > > execution, thus they also trigger writing to any previously defined
> > > sinks.
> > > > The idea behind this behavior is to enable interactive querying (the
> > one
> > > > that you are used to get from a shell environment) and it is also a
> > great
> > > > debugging tool.
> > > >
> > > > Best,
> > > >
> > > > Marton
> > > >
> > > > On Sun, May 29, 2016 at 11:28 PM, Eron Wright <[email protected]>
> > wrote:
> > > >
> > > >> I was curious as to how the `count` method on DataSet worked, and
> was
> > > >> surprised to see that it executes the entire program graph.
>  Wouldn’t
> > > this
> > > >> cause undesirable side-effects like writing to sinks?    Also
> strange
> > > that
> > > >> the graph is mutated with the addition of a sink (that isn’t
> > > subsequently
> > > >> removed).
> > > >>
> > > >> Surveying the Flink code, there aren’t many situations where the
> > program
> > > >> graph is implicitly executed (`collect` is another).   Nonetheless,
> > this
> > > >> has deepened my appreciation for how dynamic the application might
> be.
> > > >>
> > > >> // DataSet.java
> > > >> public long count() throws Exception {
> > > >>   final String id = new AbstractID().toString();
> > > >>
> > > >>   output(new Utils.CountHelper<T>(id)).name("count()");
> > > >>
> > > >>   JobExecutionResult res = getExecutionEnvironment().execute();
> > > >>   return res.<Long> getAccumulatorResult(id);
> > > >> }
> > > >> Eron
> > >
> > >
> >
>

Re: Side-effects of DataSet::count

Reply via email to