Re: [DISCUSS] FLIP-74: Flink JobClient API

Konstantin Knauf Sun, 29 Sep 2019 13:37:16 -0700

Hi Zili,

thanks for working on this topic. Just read through the FLIP and I have two
questions:


* should we add "cancelWithSavepeoint" to a new public API, when we have
deprecated the corresponding REST API/CLI methods? In my understanding
there is no reason to use it anymore.
* should we call "stopWithSavepoint" simply "stop" as "stop" always
performs a savepoint?

Best,

Konstantin



On Fri, Sep 27, 2019 at 10:48 AM Aljoscha Krettek <aljos...@apache.org>
wrote:

> Hi Flavio,
>
> I agree that this would be good to have. But I also think that this is
> outside the scope of FLIP-74, I think it is an orthogonal feature.
>
> Best,
> Aljoscha
>
> > On 27. Sep 2019, at 10:31, Flavio Pompermaier <pomperma...@okkam.it>
> wrote:
> >
> > Hi all,
> > just a remark about the Flink REST APIs (and its client as well): almost
> > all the times we need a way to dynamically know which jobs are contained
> in
> > a jar file, and this could be exposed by the REST endpoint under
> > /jars/:jarid/entry-points (a simple way to implement this would be to
> check
> > the value of Main-class or Main-classes inside the Manifest of the jar if
> > they exists [1]).
> >
> > I understand that this is something that is not strictly required to
> > execute Flink jobs but IMHO it would ease A LOT the work of UI developers
> > that could have a way to show the users all available jobs inside a jar +
> > their configurable parameters.
> > For example, right now in the WebUI, you can upload a jar and then you
> have
> > to set (without any autocomplete or UI support) the main class and their
> > params (for example using a string like --param1 xx --param2 yy).
> > Adding this functionality to the REST API and the respective client would
> > enable the WebUI (and all UIs interacting with a Flink cluster) to
> prefill
> > a dropdown list containing the list of entry-point classes (i.e. Flink
> > jobs) and, once selected, their required (typed) parameters.
> >
> > Best,
> > Flavio
> >
> > [1] https://issues.apache.org/jira/browse/FLINK-10864
> >
> > On Fri, Sep 27, 2019 at 9:16 AM Zili Chen <wander4...@gmail.com> wrote:
> >
> >> modify
> >>
> >> /we just shutdown the cluster on the exit of client that running inside
> >> cluster/
> >>
> >> to
> >>
> >> we just shutdown the cluster on both the exit of client that running
> inside
> >> cluster and the finish of job.
> >> Since client is running inside cluster we can easily wait for the end of
> >> two both in ClusterEntrypoint.
> >>
> >>
> >> Zili Chen <wander4...@gmail.com> 于2019年9月27日周五 下午3:13写道：
> >>
> >>> About JobCluster
> >>>
> >>> Actually I am not quite sure what we gains from DETACHED configuration
> on
> >>> cluster side.
> >>> We don't have a NON-DETACHED JobCluster in fact in our codebase, right?
> >>>
> >>> It comes to me one major questions we have to answer first.
> >>>
> >>> *What JobCluster conceptually is exactly*
> >>>
> >>> Related discussion can be found in JIRA[1] and mailing list[2]. Stephan
> >>> gives a nice
> >>> description of JobCluster:
> >>>
> >>> Two things to add: - The job mode is very nice in the way that it runs
> >> the
> >>> client inside the cluster (in the same image/process that is the JM)
> and
> >>> thus unifies both applications and what the Spark world calls the
> "driver
> >>> mode". - Another thing I would add is that during the FLIP-6 design, we
> >>> were thinking about setups where Dispatcher and JobManager are separate
> >>> processes. A Yarn or Mesos Dispatcher of a session could run
> >> independently
> >>> (even as privileged processes executing no code). Then you the
> "per-job"
> >>> mode could still be helpful: when a job is submitted to the dispatcher,
> >> it
> >>> launches the JM again in a per-job mode, so that JM and TM processes
> are
> >>> bound to teh job only. For higher security setups, it is important that
> >>> processes are not reused across jobs.
> >>>
> >>> However, currently in "per-job" mode we generate JobGraph in client
> side,
> >>> launching
> >>> the JobCluster and retrieve the JobGraph for execution. So actually, we
> >>> don't "run the
> >>> client inside the cluster".
> >>>
> >>> Besides, refer to the discussion with Till[1], it would be helpful we
> >>> follow the same process
> >>> of session mode for that of "per-job" mode in user perspective, that we
> >>> don't use
> >>> OptimizedPlanEnvironment to create JobGraph, but directly deploy Flink
> >>> cluster in env.execute.
> >>>
> >>> Generally 2 points
> >>>
> >>> 1. Running Flink job by invoke user main method and execute throughout,
> >>> instead of create
> >>> JobGraph from main-class.
> >>> 2. Run the client inside the cluster.
> >>>
> >>> If 1 and 2 are implemented. There is obvious no need for DETACHED mode
> in
> >>> cluster side
> >>> because we just shutdown the cluster on the exit of client that running
> >>> inside cluster. Whether
> >>> or not delivered the result is up to user code.
> >>>
> >>> [1]
> >>>
> >>
> https://issues.apache.org/jira/browse/FLINK-14051?focusedCommentId=16931388&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16931388
> >>> [2]
> >>>
> >>
> https://lists.apache.org/x/thread.html/e8f14a381be6c027e8945f884c3cfcb309ce49c1ba557d3749fca495@%3Cdev.flink.apache.org%3E
> >>>
> >>>
> >>> Zili Chen <wander4...@gmail.com> 于2019年9月27日周五 下午2:13写道：
> >>>
> >>>> Thanks for your replies Kostas & Aljoscha!
> >>>>
> >>>> Below are replies point by point.
> >>>>
> >>>> 1. For DETACHED mode, what I said there is about the DETACHED mode in
> >>>> client side.
> >>>> There are two configurations overload the item DETACHED[1].
> >>>>
> >>>> In client side, it means whether or not client.submitJob is blocking
> to
> >>>> job execution result.
> >>>> Due to client.submitJob returns CompletableFuture<JobClient>
> >> NON-DETACHED
> >>>> is no
> >>>> power at all. Caller of submitJob makes the decision whether or not
> >>>> blocking to get the
> >>>> JobClient and request for the job execution result. If client crashes,
> >> it
> >>>> is a user scope
> >>>> exception that should be handled in user code; if client lost
> connection
> >>>> to cluster, we have
> >>>> a retry times and interval configuration that automatically retry and
> >>>> throws an user scope
> >>>> exception if exceed.
> >>>>
> >>>> Your comment about poll for result or job result sounds like a concern
> >> on
> >>>> cluster side.
> >>>>
> >>>> In cluster side, DETACHED mode is alive only in JobCluster. If
> DETACHED
> >>>> configured,
> >>>> JobCluster exits on job finished; if NON-DETACHED configured,
> JobCluster
> >>>> exits on job
> >>>> execution result delivered. FLIP-74 doesn't stick to changes on this
> >>>> scope, it is just remained.
> >>>>
> >>>> However, it is an interesting part we can revisit this implementation
> a
> >>>> bit.
> >>>>
> >>>> <see the next email for compact reply in this one>
> >>>>
> >>>> 2. The retrieval of JobClient is so important that if we don't have a
> >> way
> >>>> to retrieve JobClient it is
> >>>> a dumb public user-facing interface(what a strange state :P).
> >>>>
> >>>> About the retrieval of JobClient, as mentioned in the document, two
> ways
> >>>> should be supported.
> >>>>
> >>>> (1). Retrieved as return type of job submission.
> >>>> (2). Retrieve a JobClient of existing job.(with job id)
> >>>>
> >>>> I highly respect your thoughts about how Executors should be and
> >> thoughts
> >>>> on multi-layered clients.
> >>>> Although, (2) is not supported by public interfaces as summary of
> >>>> discussion above, we can discuss
> >>>> a bit on the place of Executors on multi-layered clients and find a
> way
> >>>> to retrieve JobClient of
> >>>> existing job with public client API. I will comment in FLIP-73
> thread[2]
> >>>> since it is almost about Executors.
> >>>>
> >>>> Best,
> >>>> tison.
> >>>>
> >>>> [1]
> >>>>
> >>
> https://docs.google.com/document/d/1E-8UjOLz4QPUTxetGWbU23OlsIH9VIdodpTsxwoQTs0/edit?disco=AAAADnLLvM8
> >>>> [2]
> >>>>
> >>
> https://lists.apache.org/x/thread.html/dc3a541709f96906b43df4155373af1cd09e08c3f105b0bd0ba3fca2@%3Cdev.flink.apache.org%3E
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> Kostas Kloudas <kklou...@gmail.com> 于2019年9月25日周三 下午9:29写道：
> >>>>
> >>>>> Hi Tison,
> >>>>>
> >>>>> Thanks for the FLIP and launching the discussion!
> >>>>>
> >>>>> As a first note, big +1 on providing/exposing a JobClient to the
> users!
> >>>>>
> >>>>> Some points that would be nice to be clarified:
> >>>>> 1) You mention that we can get rid of the DETACHED mode: I agree that
> >>>>> at a high level, given that everything will now be asynchronous,
> there
> >>>>> is no need to keep the DETACHED mode but I think we should specify
> >>>>> some aspects. For example, without the explicit separation of the
> >>>>> modes, what happens when the job finishes. Does the client
> >>>>> periodically poll for the result always or the result is pushed when
> >>>>> in NON-DETACHED mode? What happens if the client disconnects and
> >>>>> reconnects?
> >>>>>
> >>>>> 2) On the "how to retrieve a JobClient for a running Job", I think
> >>>>> this is related to the other discussion you opened in the ML about
> >>>>> multi-layered clients. First of all, I agree that exposing different
> >>>>> "levels" of clients would be a nice addition, and actually there have
> >>>>> been some discussions about doing so in the future. Now for this
> >>>>> specific discussion:
> >>>>>      i) I do not think that we should expose the
> >>>>> ClusterDescriptor/ClusterSpecification to the user, as this ties us
> to
> >>>>> a specific architecture which may change in the future.
> >>>>>     ii) I do not think it should be the Executor that will provide a
> >>>>> JobClient for an already running job (only for the Jobs that it
> >>>>> submits). The job of the executor should just be to execute() a
> >>>>> pipeline.
> >>>>>     iii) I think a solution that respects the separation of concerns
> >>>>> could be the addition of another component (in the future), something
> >>>>> like a ClientFactory, or ClusterFactory that will have methods like:
> >>>>> ClusterClient createCluster(Configuration), JobClient
> >>>>> retrieveJobClient(Configuration , JobId), maybe even (although not
> >>>>> sure) Executor getExecutor(Configuration ) and maybe more. This
> >>>>> component would be responsible to interact with a cluster manager
> like
> >>>>> Yarn and do what is now being done by the ClusterDescriptor plus some
> >>>>> more stuff.
> >>>>>
> >>>>> Although under the hood all these abstractions (Environments,
> >>>>> Executors, ...) underneath use the same clients, I believe their
> >>>>> job/existence is not contradicting but they simply hide some of the
> >>>>> complexity from the user, and give us, as developers some freedom to
> >>>>> change in the future some of the parts. For example, the executor
> will
> >>>>> take a Pipeline, create a JobGraph and submit it, instead of
> requiring
> >>>>> the user to do each step separately. This allows us to, for example,
> >>>>> get rid of the Plan if in the future everything is DataStream.
> >>>>> Essentially, I think of these as layers of an onion with the clients
> >>>>> being close to the core. The higher you go, the more functionality is
> >>>>> included and hidden from the public eye.
> >>>>>
> >>>>> Point iii) by the way is just a thought and by no means final. I also
> >>>>> like the idea of multi-layered clients so this may spark up the
> >>>>> discussion.
> >>>>>
> >>>>> Cheers,
> >>>>> Kostas
> >>>>>
> >>>>> On Wed, Sep 25, 2019 at 2:21 PM Aljoscha Krettek <
> aljos...@apache.org>
> >>>>> wrote:
> >>>>>>
> >>>>>> Hi Tison,
> >>>>>>
> >>>>>> Thanks for proposing the document! I had some comments on the
> >> document.
> >>>>>>
> >>>>>> I think the only complex thing that we still need to figure out is
> >> how
> >>>>> to get a JobClient for a job that is already running. As you
> mentioned
> >> in
> >>>>> the document. Currently I’m thinking that its ok to add a method to
> >>>>> Executor for retrieving a JobClient for a running job by providing an
> >> ID.
> >>>>> Let’s see what Kostas has to say on the topic.
> >>>>>>
> >>>>>> Best,
> >>>>>> Aljoscha
> >>>>>>
> >>>>>>> On 25. Sep 2019, at 12:31, Zili Chen <wander4...@gmail.com> wrote:
> >>>>>>>
> >>>>>>> Hi all,
> >>>>>>>
> >>>>>>> Summary from the discussion about introducing Flink JobClient
> >> API[1]
> >>>>> we
> >>>>>>> draft FLIP-74[2] to
> >>>>>>> gather thoughts and towards a standard public user-facing
> >> interfaces.
> >>>>>>>
> >>>>>>> This discussion thread aims at standardizing job level client API.
> >>>>> But I'd
> >>>>>>> like to emphasize that
> >>>>>>> how to retrieve JobClient possibly causes further discussion on
> >>>>> different
> >>>>>>> level clients exposed from
> >>>>>>> Flink so that a following thread will be started later to
> >> coordinate
> >>>>>>> FLIP-73 and FLIP-74 on
> >>>>>>> expose issue.
> >>>>>>>
> >>>>>>> Looking forward to your opinions.
> >>>>>>>
> >>>>>>> Best,
> >>>>>>> tison.
> >>>>>>>
> >>>>>>> [1]
> >>>>>>>
> >>>>>
> >>
> https://lists.apache.org/thread.html/ce99cba4a10b9dc40eb729d39910f315ae41d80ec74f09a356c73938@%3Cdev.flink.apache.org%3E
> >>>>>>> [2]
> >>>>>>>
> >>>>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-74%3A+Flink+JobClient+API
> >>>>>>
> >>>>>
> >>>>
>
>

-- 

Konstantin Knauf | Solutions Architect

+49 160 91394525


Follow us @VervericaData Ververica <https://www.ververica.com/>


--

Join Flink Forward <https://flink-forward.org/> - The Apache Flink
Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--
Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji
(Tony) Cheng

Re: [DISCUSS] FLIP-74: Flink JobClient API

Reply via email to