Hi Zili, thanks for working on this topic. Just read through the FLIP and I have two questions:
* should we add "cancelWithSavepeoint" to a new public API, when we have deprecated the corresponding REST API/CLI methods? In my understanding there is no reason to use it anymore. * should we call "stopWithSavepoint" simply "stop" as "stop" always performs a savepoint? Best, Konstantin On Fri, Sep 27, 2019 at 10:48 AM Aljoscha Krettek <aljos...@apache.org> wrote: > Hi Flavio, > > I agree that this would be good to have. But I also think that this is > outside the scope of FLIP-74, I think it is an orthogonal feature. > > Best, > Aljoscha > > > On 27. Sep 2019, at 10:31, Flavio Pompermaier <pomperma...@okkam.it> > wrote: > > > > Hi all, > > just a remark about the Flink REST APIs (and its client as well): almost > > all the times we need a way to dynamically know which jobs are contained > in > > a jar file, and this could be exposed by the REST endpoint under > > /jars/:jarid/entry-points (a simple way to implement this would be to > check > > the value of Main-class or Main-classes inside the Manifest of the jar if > > they exists [1]). > > > > I understand that this is something that is not strictly required to > > execute Flink jobs but IMHO it would ease A LOT the work of UI developers > > that could have a way to show the users all available jobs inside a jar + > > their configurable parameters. > > For example, right now in the WebUI, you can upload a jar and then you > have > > to set (without any autocomplete or UI support) the main class and their > > params (for example using a string like --param1 xx --param2 yy). > > Adding this functionality to the REST API and the respective client would > > enable the WebUI (and all UIs interacting with a Flink cluster) to > prefill > > a dropdown list containing the list of entry-point classes (i.e. Flink > > jobs) and, once selected, their required (typed) parameters. > > > > Best, > > Flavio > > > > [1] https://issues.apache.org/jira/browse/FLINK-10864 > > > > On Fri, Sep 27, 2019 at 9:16 AM Zili Chen <wander4...@gmail.com> wrote: > > > >> modify > >> > >> /we just shutdown the cluster on the exit of client that running inside > >> cluster/ > >> > >> to > >> > >> we just shutdown the cluster on both the exit of client that running > inside > >> cluster and the finish of job. > >> Since client is running inside cluster we can easily wait for the end of > >> two both in ClusterEntrypoint. > >> > >> > >> Zili Chen <wander4...@gmail.com> 于2019年9月27日周五 下午3:13写道: > >> > >>> About JobCluster > >>> > >>> Actually I am not quite sure what we gains from DETACHED configuration > on > >>> cluster side. > >>> We don't have a NON-DETACHED JobCluster in fact in our codebase, right? > >>> > >>> It comes to me one major questions we have to answer first. > >>> > >>> *What JobCluster conceptually is exactly* > >>> > >>> Related discussion can be found in JIRA[1] and mailing list[2]. Stephan > >>> gives a nice > >>> description of JobCluster: > >>> > >>> Two things to add: - The job mode is very nice in the way that it runs > >> the > >>> client inside the cluster (in the same image/process that is the JM) > and > >>> thus unifies both applications and what the Spark world calls the > "driver > >>> mode". - Another thing I would add is that during the FLIP-6 design, we > >>> were thinking about setups where Dispatcher and JobManager are separate > >>> processes. A Yarn or Mesos Dispatcher of a session could run > >> independently > >>> (even as privileged processes executing no code). Then you the > "per-job" > >>> mode could still be helpful: when a job is submitted to the dispatcher, > >> it > >>> launches the JM again in a per-job mode, so that JM and TM processes > are > >>> bound to teh job only. For higher security setups, it is important that > >>> processes are not reused across jobs. > >>> > >>> However, currently in "per-job" mode we generate JobGraph in client > side, > >>> launching > >>> the JobCluster and retrieve the JobGraph for execution. So actually, we > >>> don't "run the > >>> client inside the cluster". > >>> > >>> Besides, refer to the discussion with Till[1], it would be helpful we > >>> follow the same process > >>> of session mode for that of "per-job" mode in user perspective, that we > >>> don't use > >>> OptimizedPlanEnvironment to create JobGraph, but directly deploy Flink > >>> cluster in env.execute. > >>> > >>> Generally 2 points > >>> > >>> 1. Running Flink job by invoke user main method and execute throughout, > >>> instead of create > >>> JobGraph from main-class. > >>> 2. Run the client inside the cluster. > >>> > >>> If 1 and 2 are implemented. There is obvious no need for DETACHED mode > in > >>> cluster side > >>> because we just shutdown the cluster on the exit of client that running > >>> inside cluster. Whether > >>> or not delivered the result is up to user code. > >>> > >>> [1] > >>> > >> > https://issues.apache.org/jira/browse/FLINK-14051?focusedCommentId=16931388&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16931388 > >>> [2] > >>> > >> > https://lists.apache.org/x/thread.html/e8f14a381be6c027e8945f884c3cfcb309ce49c1ba557d3749fca495@%3Cdev.flink.apache.org%3E > >>> > >>> > >>> Zili Chen <wander4...@gmail.com> 于2019年9月27日周五 下午2:13写道: > >>> > >>>> Thanks for your replies Kostas & Aljoscha! > >>>> > >>>> Below are replies point by point. > >>>> > >>>> 1. For DETACHED mode, what I said there is about the DETACHED mode in > >>>> client side. > >>>> There are two configurations overload the item DETACHED[1]. > >>>> > >>>> In client side, it means whether or not client.submitJob is blocking > to > >>>> job execution result. > >>>> Due to client.submitJob returns CompletableFuture<JobClient> > >> NON-DETACHED > >>>> is no > >>>> power at all. Caller of submitJob makes the decision whether or not > >>>> blocking to get the > >>>> JobClient and request for the job execution result. If client crashes, > >> it > >>>> is a user scope > >>>> exception that should be handled in user code; if client lost > connection > >>>> to cluster, we have > >>>> a retry times and interval configuration that automatically retry and > >>>> throws an user scope > >>>> exception if exceed. > >>>> > >>>> Your comment about poll for result or job result sounds like a concern > >> on > >>>> cluster side. > >>>> > >>>> In cluster side, DETACHED mode is alive only in JobCluster. If > DETACHED > >>>> configured, > >>>> JobCluster exits on job finished; if NON-DETACHED configured, > JobCluster > >>>> exits on job > >>>> execution result delivered. FLIP-74 doesn't stick to changes on this > >>>> scope, it is just remained. > >>>> > >>>> However, it is an interesting part we can revisit this implementation > a > >>>> bit. > >>>> > >>>> <see the next email for compact reply in this one> > >>>> > >>>> 2. The retrieval of JobClient is so important that if we don't have a > >> way > >>>> to retrieve JobClient it is > >>>> a dumb public user-facing interface(what a strange state :P). > >>>> > >>>> About the retrieval of JobClient, as mentioned in the document, two > ways > >>>> should be supported. > >>>> > >>>> (1). Retrieved as return type of job submission. > >>>> (2). Retrieve a JobClient of existing job.(with job id) > >>>> > >>>> I highly respect your thoughts about how Executors should be and > >> thoughts > >>>> on multi-layered clients. > >>>> Although, (2) is not supported by public interfaces as summary of > >>>> discussion above, we can discuss > >>>> a bit on the place of Executors on multi-layered clients and find a > way > >>>> to retrieve JobClient of > >>>> existing job with public client API. I will comment in FLIP-73 > thread[2] > >>>> since it is almost about Executors. > >>>> > >>>> Best, > >>>> tison. > >>>> > >>>> [1] > >>>> > >> > https://docs.google.com/document/d/1E-8UjOLz4QPUTxetGWbU23OlsIH9VIdodpTsxwoQTs0/edit?disco=AAAADnLLvM8 > >>>> [2] > >>>> > >> > https://lists.apache.org/x/thread.html/dc3a541709f96906b43df4155373af1cd09e08c3f105b0bd0ba3fca2@%3Cdev.flink.apache.org%3E > >>>> > >>>> > >>>> > >>>> > >>>> Kostas Kloudas <kklou...@gmail.com> 于2019年9月25日周三 下午9:29写道: > >>>> > >>>>> Hi Tison, > >>>>> > >>>>> Thanks for the FLIP and launching the discussion! > >>>>> > >>>>> As a first note, big +1 on providing/exposing a JobClient to the > users! > >>>>> > >>>>> Some points that would be nice to be clarified: > >>>>> 1) You mention that we can get rid of the DETACHED mode: I agree that > >>>>> at a high level, given that everything will now be asynchronous, > there > >>>>> is no need to keep the DETACHED mode but I think we should specify > >>>>> some aspects. For example, without the explicit separation of the > >>>>> modes, what happens when the job finishes. Does the client > >>>>> periodically poll for the result always or the result is pushed when > >>>>> in NON-DETACHED mode? What happens if the client disconnects and > >>>>> reconnects? > >>>>> > >>>>> 2) On the "how to retrieve a JobClient for a running Job", I think > >>>>> this is related to the other discussion you opened in the ML about > >>>>> multi-layered clients. First of all, I agree that exposing different > >>>>> "levels" of clients would be a nice addition, and actually there have > >>>>> been some discussions about doing so in the future. Now for this > >>>>> specific discussion: > >>>>> i) I do not think that we should expose the > >>>>> ClusterDescriptor/ClusterSpecification to the user, as this ties us > to > >>>>> a specific architecture which may change in the future. > >>>>> ii) I do not think it should be the Executor that will provide a > >>>>> JobClient for an already running job (only for the Jobs that it > >>>>> submits). The job of the executor should just be to execute() a > >>>>> pipeline. > >>>>> iii) I think a solution that respects the separation of concerns > >>>>> could be the addition of another component (in the future), something > >>>>> like a ClientFactory, or ClusterFactory that will have methods like: > >>>>> ClusterClient createCluster(Configuration), JobClient > >>>>> retrieveJobClient(Configuration , JobId), maybe even (although not > >>>>> sure) Executor getExecutor(Configuration ) and maybe more. This > >>>>> component would be responsible to interact with a cluster manager > like > >>>>> Yarn and do what is now being done by the ClusterDescriptor plus some > >>>>> more stuff. > >>>>> > >>>>> Although under the hood all these abstractions (Environments, > >>>>> Executors, ...) underneath use the same clients, I believe their > >>>>> job/existence is not contradicting but they simply hide some of the > >>>>> complexity from the user, and give us, as developers some freedom to > >>>>> change in the future some of the parts. For example, the executor > will > >>>>> take a Pipeline, create a JobGraph and submit it, instead of > requiring > >>>>> the user to do each step separately. This allows us to, for example, > >>>>> get rid of the Plan if in the future everything is DataStream. > >>>>> Essentially, I think of these as layers of an onion with the clients > >>>>> being close to the core. The higher you go, the more functionality is > >>>>> included and hidden from the public eye. > >>>>> > >>>>> Point iii) by the way is just a thought and by no means final. I also > >>>>> like the idea of multi-layered clients so this may spark up the > >>>>> discussion. > >>>>> > >>>>> Cheers, > >>>>> Kostas > >>>>> > >>>>> On Wed, Sep 25, 2019 at 2:21 PM Aljoscha Krettek < > aljos...@apache.org> > >>>>> wrote: > >>>>>> > >>>>>> Hi Tison, > >>>>>> > >>>>>> Thanks for proposing the document! I had some comments on the > >> document. > >>>>>> > >>>>>> I think the only complex thing that we still need to figure out is > >> how > >>>>> to get a JobClient for a job that is already running. As you > mentioned > >> in > >>>>> the document. Currently I’m thinking that its ok to add a method to > >>>>> Executor for retrieving a JobClient for a running job by providing an > >> ID. > >>>>> Let’s see what Kostas has to say on the topic. > >>>>>> > >>>>>> Best, > >>>>>> Aljoscha > >>>>>> > >>>>>>> On 25. Sep 2019, at 12:31, Zili Chen <wander4...@gmail.com> wrote: > >>>>>>> > >>>>>>> Hi all, > >>>>>>> > >>>>>>> Summary from the discussion about introducing Flink JobClient > >> API[1] > >>>>> we > >>>>>>> draft FLIP-74[2] to > >>>>>>> gather thoughts and towards a standard public user-facing > >> interfaces. > >>>>>>> > >>>>>>> This discussion thread aims at standardizing job level client API. > >>>>> But I'd > >>>>>>> like to emphasize that > >>>>>>> how to retrieve JobClient possibly causes further discussion on > >>>>> different > >>>>>>> level clients exposed from > >>>>>>> Flink so that a following thread will be started later to > >> coordinate > >>>>>>> FLIP-73 and FLIP-74 on > >>>>>>> expose issue. > >>>>>>> > >>>>>>> Looking forward to your opinions. > >>>>>>> > >>>>>>> Best, > >>>>>>> tison. > >>>>>>> > >>>>>>> [1] > >>>>>>> > >>>>> > >> > https://lists.apache.org/thread.html/ce99cba4a10b9dc40eb729d39910f315ae41d80ec74f09a356c73938@%3Cdev.flink.apache.org%3E > >>>>>>> [2] > >>>>>>> > >>>>> > >> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-74%3A+Flink+JobClient+API > >>>>>> > >>>>> > >>>> > > -- Konstantin Knauf | Solutions Architect +49 160 91394525 Follow us @VervericaData Ververica <https://www.ververica.com/> -- Join Flink Forward <https://flink-forward.org/> - The Apache Flink Conference Stream Processing | Event Driven | Real Time -- Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany -- Ververica GmbH Registered at Amtsgericht Charlottenburg: HRB 158244 B Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji (Tony) Cheng