Re: Questions re: ExecutionGraph & ResultPartitions for interactive use a la Spark

kovas boguta Sat, 09 Jan 2016 21:21:07 -0800

Reposting in hopes relevant people have returned from vacation:

I'm impressed with the Flink API, it seems simpler and more composable than
what I've seen elsewhere.

I'm trying to see how to achieve a more interactive, REPL-driven
experience, similar to Spark. I'm consuming Flinkfrom Clojure.

For now I'm only interested in smaller clusters & interactive usage, so the
failure recovery aspect of RDDs is less important relative to simply
caching intermediate results (in this context, keeping the ResultPartitions
partitions around even when theres no consuming tasks) , and dynamically
extending the jobgraph.

Three questions:

1) How can I prevent ResultPartitions from being released?

In interactive use, RPs should not necessarily be released when there are
no pending tasks to consume them.

Looking at the code, its hard to understand the high-level logic of who
triggers their release, how the refcounting works, etc. For instance,
is releasePartitionsProducedBy called by the producer, or the consumer, or
both? Is the refcount initialized at the beginning of the task setup, and
decremented every time its read out?

Ideally, I could force certain ResultPartitions to only be manually
releasable, so I can consume them over and over.

2) Can I call attachJobGraph on the ExecutionGraph after the job has begun
executing to add more nodes?

I read that Flink does not support changing the running the topology. But
what about extending the topology to add more nodes?

If the IntermediateResultPartition are just sitting around from previously
completely tasks, this seems straightforward in principle.. would one have
to fiddle with the scheduler/event system to kick things off again?

3) Can I have a ConnectionManager without a TaskManager?

For interactive use, I want to selectively pull data from ResultPartitions
into my local REPL process. But I don't want my local process to be a
TaskManager node, and have tasks assigned to it.

So this question boils down to, how to hook a ConnectionManager into the
Flink communication/actor system?

Thanks!

On Tue, Jan 5, 2016 at 5:54 AM, Aljoscha Krettek <aljos...@apache.org>
wrote:

> Hi,
> these are certainly valid use cases. As far is I know, the people who know
> most in this area are on vacation right now. They should be back in a week,
> I think.
>
> They should be able to give you a proper description of the current
> situation and some pointers.
>
> Cheers,
> Aljoscha
> > On 04 Jan 2016, at 22:47, kovas boguta <kovas.bog...@gmail.com> wrote:
> >
> > I'm impressed with the Flink API, it seems simpler and more composable
> than what I've seen elsewhere.
> >
> > I'm trying to see how to achieve a more interactive, REPL-driven
> experience, similar to Spark. I'm consuming Flink from Clojure.
> >
> > For now I'm only interested in smaller clusters & interactive usage, so
> the failure recovery aspect of RDDs is less important relative to simply
> caching intermediate results (in this context, keeping the ResultPartitions
> partitions around even when theres no consuming tasks) , and dynamically
> extending the jobgraph.
> >
> > Three questions:
> >
> > 1) How can I prevent ResultPartitions from being released?
> >
> > In interactive use, RPs should not necessarily be released when there
> are no pending tasks to consume them.
> >
> > Looking at the code, its hard to understand the high-level logic of who
> triggers their release, how the refcounting works, etc. For instance, is
> releasePartitionsProducedBy called by the producer, or the consumer, or
> both? Is the refcount initialized at the beginning of the task setup, and
> decremented every time its read out?
> >
> > Ideally, I could force certain ResultPartitions to only be manually
> releasable, so I can consume them over and over.
> >
> >
> > 2) Can I call attachJobGraph on the ExecutionGraph after the job has
> begun executing to add more nodes?
> >
> > I read that Flink does not support changing the running the topology.
> But what about extending the topology to add more nodes?
> >
> > If the IntermediateResultPartition are just sitting around from
> previously completely tasks, this seems straightforward in principle..
> would one have to fiddle with the scheduler/event system to kick things off
> again?
> >
> > 3) Can I have a ConnectionManager without a TaskManager?
> >
> > For interactive use, I want to selectively pull data from
> ResultPartitions into my local REPL process. But I don't want my local
> process to be a TaskManager node, and have tasks assigned to it.
> >
> > So this question boils down to, how to hook a ConnectionManager into the
> Flink communication/actor system?
> >
> > Thanks!
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>
>

Re: Questions re: ExecutionGraph & ResultPartitions for interactive use a la Spark

Reply via email to