Re: [DISCUSS] FLIP-85: Delayed Job Graph Generation

Yang Wang Wed, 04 Mar 2020 22:30:26 -0800

Hi Peter,
Really thanks for your response.

Hi all @Kostas Kloudas <kklou...@gmail.com> @Zili Chen
<wander4...@gmail.com> @Peter Huang <huangzhenqiu0...@gmail.com> @Rong Rong
<walter...@gmail.com>
It seems that we have reached an agreement. The “application mode”
is regarded as the enhanced “per-job”. It is
orthogonal with “cluster deploy”. Currently, we bind the “per-job” to
`run-user-main-on-client` and “application mode”
to `run-user-main-on-cluster`.


Do you have other concerns to moving FLIP-85 to voting?


Best,
Yang

Peter Huang <huangzhenqiu0...@gmail.com> 于2020年3月5日周四 下午12:48写道：

> Hi Yang and Kostas,
>
> Thanks for the clarification. It makes more sense to me if the long term
> goal is to replace per job mode to application mode
>  in the future (at the time that multiple execute can be supported).
> Before that, It will be better to keep the concept of
>  application mode internally. As Yang suggested, User only need to use a
> `-R/-- remote-deploy` cli option to launch
> a per job cluster with the main function executed in cluster
> entry-point.  +1 for the execution plan.
>
>
>
> Best Regards
> Peter Huang
>
>
>
>
> On Tue, Mar 3, 2020 at 7:11 AM Yang Wang <danrtsey...@gmail.com> wrote:
>
>> Hi Peter,
>>
>> Having the application mode does not mean we will drop the cluster-deploy
>> option. I just want to share some thoughts about “Application Mode”.
>>
>>
>> 1. The application mode could cover the per-job sematic. Its lifecyle is
>> bound
>> to the user `main()`. And all the jobs in the user main will be executed
>> in a same
>> Flink cluster. In first phase of FLIP-85 implementation, running user
>> main on the
>> cluster side could be supported in application mode.
>>
>> 2. Maybe in the future, we also need to support multiple `execute()` on
>> client side
>> in a same Flink cluster. Then the per-job mode will evolve to application
>> mode.
>>
>> 3. From user perspective, only a `-R/-- remote-deploy` cli option is
>> visible. They
>> are not aware of the application mode.
>>
>> 4. In the first phase, the application mode is working as “per-job”(only
>> one job in
>> the user main). We just leave more potential for the future.
>>
>>
>> I am not against with calling it “cluster deploy mode” if you all think
>> it is clearer for users.
>>
>>
>>
>> Best,
>> Yang
>>
>> Kostas Kloudas <kklou...@gmail.com> 于2020年3月3日周二 下午6:49写道：
>>
>>> Hi Peter,
>>>
>>> I understand your point. This is why I was also a bit torn about the
>>> name and my proposal was a bit aligned with yours (something along the
>>> lines of "cluster deploy" mode).
>>>
>>> But many of the other participants in the discussion suggested the
>>> "Application Mode". I think that the reasoning is that now the user's
>>> Application is more self-contained.
>>> It will be submitted to the cluster and the user can just disconnect.
>>> In addition, as discussed briefly in the doc, in the future there may
>>> be better support for multi-execute applications which will bring us
>>> one step closer to the true "Application Mode". But this is how I
>>> interpreted their arguments, of course they can also express their
>>> thoughts on the topic :)
>>>
>>> Cheers,
>>> Kostas
>>>
>>> On Mon, Mar 2, 2020 at 6:15 PM Peter Huang <huangzhenqiu0...@gmail.com>
>>> wrote:
>>> >
>>> > Hi Kostas,
>>> >
>>> > Thanks for updating the wiki. We have aligned with the implementations
>>> in the doc. But I feel it is still a little bit confusing of the naming
>>> from a user's perspective. It is well known that Flink support per job
>>> cluster and session cluster. The concept is in the layer of how a job is
>>> managed within Flink. The method introduced util now is a kind of mixing
>>> job and session cluster to promising the implementation complexity. We
>>> probably don't need to label it as Application Model as the same layer of
>>> per job cluster and session cluster. Conceptually, I think it is still a
>>> cluster mode implementation for per job cluster.
>>> >
>>> > To minimize the confusion of users, I think it would be better just an
>>> option of per job cluster for each type of cluster manager. How do you
>>> think?
>>> >
>>> >
>>> > Best Regards
>>> > Peter Huang
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > On Mon, Mar 2, 2020 at 7:22 AM Kostas Kloudas <kklou...@gmail.com>
>>> wrote:
>>> >>
>>> >> Hi Yang,
>>> >>
>>> >> The difference between per-job and application mode is that, as you
>>> >> described, in the per-job mode the main is executed on the client
>>> >> while in the application mode, the main is executed on the cluster.
>>> >> I do not think we have to offer "application mode" with running the
>>> >> main on the client side as this is exactly what the per-job mode does
>>> >> currently and, as you described also, it would be redundant.
>>> >>
>>> >> Sorry if this was not clear in the document.
>>> >>
>>> >> Cheers,
>>> >> Kostas
>>> >>
>>> >> On Mon, Mar 2, 2020 at 3:17 PM Yang Wang <danrtsey...@gmail.com>
>>> wrote:
>>> >> >
>>> >> > Hi Kostas,
>>> >> >
>>> >> > Thanks a lot for your conclusion and updating the FLIP-85 WIKI.
>>> Currently, i have no more
>>> >> > questions about motivation, approach, fault tolerance and the first
>>> phase implementation.
>>> >> >
>>> >> > I think the new title "Flink Application Mode" makes a lot senses
>>> to me. Especially for the
>>> >> > containerized environment, the cluster deploy option will be very
>>> useful.
>>> >> >
>>> >> > Just one concern, how do we introduce this new application mode to
>>> our users?
>>> >> > Each user program(i.e. `main()`) is an application. Currently, we
>>> intend to only support one
>>> >> > `execute()`. So what's the difference between per-job and
>>> application mode?
>>> >> >
>>> >> > For per-job, user `main()` is always executed on client side. And
>>> For application mode, user
>>> >> > `main()` could be executed on client or master side(configured via
>>> cli option).
>>> >> > Right? We need to have a clear concept. Otherwise, the users will
>>> be more and more confusing.
>>> >> >
>>> >> >
>>> >> > Best,
>>> >> > Yang
>>> >> >
>>> >> > Kostas Kloudas <kklou...@gmail.com> 于2020年3月2日周一 下午5:58写道：
>>> >> >>
>>> >> >> Hi all,
>>> >> >>
>>> >> >> I update
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-85+Flink+Application+Mode
>>> >> >> based on the discussion we had here:
>>> >> >>
>>> >> >>
>>> https://docs.google.com/document/d/1ji72s3FD9DYUyGuKnJoO4ApzV-nSsZa0-bceGXW7Ocw/edit#
>>> >> >>
>>> >> >> Please let me know what you think and please keep the discussion
>>> in the ML :)
>>> >> >>
>>> >> >> Thanks for starting the discussion and I hope that soon we will be
>>> >> >> able to vote on the FLIP.
>>> >> >>
>>> >> >> Cheers,
>>> >> >> Kostas
>>> >> >>
>>> >> >> On Thu, Jan 16, 2020 at 3:40 AM Yang Wang <danrtsey...@gmail.com>
>>> wrote:
>>> >> >> >
>>> >> >> > Hi all,
>>> >> >> >
>>> >> >> > Thanks a lot for the feedback from @Kostas Kloudas. Your all
>>> concerns are
>>> >> >> > on point. The FLIP-85 is mainly
>>> >> >> > focused on supporting cluster mode for per-job. Since it is more
>>> urgent and
>>> >> >> > have much more use
>>> >> >> > cases both in Yarn and Kubernetes deployment. For session
>>> cluster, we could
>>> >> >> > have more discussion
>>> >> >> > in a new thread later.
>>> >> >> >
>>> >> >> > #1, How to download the user jars and dependencies for per-job
>>> in cluster
>>> >> >> > mode?
>>> >> >> > For Yarn, we could register the user jars and dependencies as
>>> >> >> > LocalResource. They will be distributed
>>> >> >> > by Yarn. And once the JobManager and TaskManager launched, the
>>> jars are
>>> >> >> > already exists.
>>> >> >> > For Standalone per-job and K8s, we expect that the user jars
>>> >> >> > and dependencies are built into the image.
>>> >> >> > Or the InitContainer could be used for downloading. It is
>>> natively
>>> >> >> > distributed and we will not have bottleneck.
>>> >> >> >
>>> >> >> > #2, Job graph recovery
>>> >> >> > We could have an optimization to store job graph on the DFS.
>>> However, i
>>> >> >> > suggest building a new jobgraph
>>> >> >> > from the configuration is the default option. Since we will not
>>> always have
>>> >> >> > a DFS store when deploying a
>>> >> >> > Flink per-job cluster. Of course, we assume that using the same
>>> >> >> > configuration(e.g. job_id, user_jar, main_class,
>>> >> >> > main_args, parallelism, savepoint_settings, etc.) will get a
>>> same job
>>> >> >> > graph. I think the standalone per-job
>>> >> >> > already has the similar behavior.
>>> >> >> >
>>> >> >> > #3, What happens with jobs that have multiple execute calls?
>>> >> >> > Currently, it is really a problem. Even we use a local client on
>>> Flink
>>> >> >> > master side, it will have different behavior with
>>> >> >> > client mode. For client mode, if we execute multiple times, then
>>> we will
>>> >> >> > deploy multiple Flink clusters for each execute.
>>> >> >> > I am not pretty sure whether it is reasonable. However, i still
>>> think using
>>> >> >> > the local client is a good choice. We could
>>> >> >> > continue the discussion in a new thread. @Zili Chen <
>>> wander4...@gmail.com> Do
>>> >> >> > you want to drive this?
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > Best,
>>> >> >> > Yang
>>> >> >> >
>>> >> >> > Peter Huang <huangzhenqiu0...@gmail.com> 于2020年1月16日周四 上午1:55写道：
>>> >> >> >
>>> >> >> > > Hi Kostas,
>>> >> >> > >
>>> >> >> > > Thanks for this feedback. I can't agree more about the
>>> opinion. The
>>> >> >> > > cluster mode should be added
>>> >> >> > > first in per job cluster.
>>> >> >> > >
>>> >> >> > > 1) For job cluster implementation
>>> >> >> > > 1. Job graph recovery from configuration or store as static
>>> job graph as
>>> >> >> > > session cluster. I think the static one will be better for
>>> less recovery
>>> >> >> > > time.
>>> >> >> > > Let me update the doc for details.
>>> >> >> > >
>>> >> >> > > 2. For job execute multiple times, I think @Zili Chen
>>> >> >> > > <wander4...@gmail.com> has proposed the local client solution
>>> that can
>>> >> >> > > the run program actually in the cluster entry point. We can
>>> put the
>>> >> >> > > implementation in the second stage,
>>> >> >> > > or even a new FLIP for further discussion.
>>> >> >> > >
>>> >> >> > > 2) For session cluster implementation
>>> >> >> > > We can disable the cluster mode for the session cluster in the
>>> first
>>> >> >> > > stage. I agree the jar downloading will be a painful thing.
>>> >> >> > > We can consider about PoC and performance evaluation first. If
>>> the end to
>>> >> >> > > end experience is good enough, then we can consider
>>> >> >> > > proceeding with the solution.
>>> >> >> > >
>>> >> >> > > Looking forward to more opinions from @Yang Wang <
>>> danrtsey...@gmail.com> @Zili
>>> >> >> > > Chen <wander4...@gmail.com> @Dian Fu <dian0511...@gmail.com>.
>>> >> >> > >
>>> >> >> > >
>>> >> >> > > Best Regards
>>> >> >> > > Peter Huang
>>> >> >> > >
>>> >> >> > > On Wed, Jan 15, 2020 at 7:50 AM Kostas Kloudas <
>>> kklou...@gmail.com> wrote:
>>> >> >> > >
>>> >> >> > >> Hi all,
>>> >> >> > >>
>>> >> >> > >> I am writing here as the discussion on the Google Doc seems
>>> to be a
>>> >> >> > >> bit difficult to follow.
>>> >> >> > >>
>>> >> >> > >> I think that in order to be able to make progress, it would
>>> be helpful
>>> >> >> > >> to focus on per-job mode for now.
>>> >> >> > >> The reason is that:
>>> >> >> > >>  1) making the (unique) JobSubmitHandler responsible for
>>> creating the
>>> >> >> > >> jobgraphs,
>>> >> >> > >>   which includes downloading dependencies, is not an optimal
>>> solution
>>> >> >> > >>  2) even if we put the responsibility on the JobMaster,
>>> currently each
>>> >> >> > >> job has its own
>>> >> >> > >>   JobMaster but they all run on the same process, so we have
>>> again a
>>> >> >> > >> single entity.
>>> >> >> > >>
>>> >> >> > >> Of course after this is done, and if we feel comfortable with
>>> the
>>> >> >> > >> solution, then we can go to the session mode.
>>> >> >> > >>
>>> >> >> > >> A second comment has to do with fault-tolerance in the
>>> per-job,
>>> >> >> > >> cluster-deploy mode.
>>> >> >> > >> In the document, it is suggested that upon recovery, the
>>> JobMaster of
>>> >> >> > >> each job re-creates the JobGraph.
>>> >> >> > >> I am just wondering if it is better to create and store the
>>> jobGraph
>>> >> >> > >> upon submission and only fetch it
>>> >> >> > >> upon recovery so that we have a static jobGraph.
>>> >> >> > >>
>>> >> >> > >> Finally, I have a question which is what happens with jobs
>>> that have
>>> >> >> > >> multiple execute calls?
>>> >> >> > >> The semantics seem to change compared to the current
>>> behaviour, right?
>>> >> >> > >>
>>> >> >> > >> Cheers,
>>> >> >> > >> Kostas
>>> >> >> > >>
>>> >> >> > >> On Wed, Jan 8, 2020 at 8:05 PM tison <wander4...@gmail.com>
>>> wrote:
>>> >> >> > >> >
>>> >> >> > >> > not always, Yang Wang is also not yet a committer but he
>>> can join the
>>> >> >> > >> > channel. I cannot find the id by clicking “Add new member
>>> in channel” so
>>> >> >> > >> > come to you and ask for try out the link. Possibly I will
>>> find other
>>> >> >> > >> ways
>>> >> >> > >> > but the original purpose is that the slack channel is a
>>> public area we
>>> >> >> > >> > discuss about developing...
>>> >> >> > >> > Best,
>>> >> >> > >> > tison.
>>> >> >> > >> >
>>> >> >> > >> >
>>> >> >> > >> > Peter Huang <huangzhenqiu0...@gmail.com> 于2020年1月9日周四
>>> 上午2:44写道：
>>> >> >> > >> >
>>> >> >> > >> > > Hi Tison,
>>> >> >> > >> > >
>>> >> >> > >> > > I am not the committer of Flink yet. I think I can't join
>>> it also.
>>> >> >> > >> > >
>>> >> >> > >> > >
>>> >> >> > >> > > Best Regards
>>> >> >> > >> > > Peter Huang
>>> >> >> > >> > >
>>> >> >> > >> > > On Wed, Jan 8, 2020 at 9:39 AM tison <
>>> wander4...@gmail.com> wrote:
>>> >> >> > >> > >
>>> >> >> > >> > > > Hi Peter,
>>> >> >> > >> > > >
>>> >> >> > >> > > > Could you try out this link?
>>> >> >> > >> > > https://the-asf.slack.com/messages/CNA3ADZPH
>>> >> >> > >> > > >
>>> >> >> > >> > > > Best,
>>> >> >> > >> > > > tison.
>>> >> >> > >> > > >
>>> >> >> > >> > > >
>>> >> >> > >> > > > Peter Huang <huangzhenqiu0...@gmail.com> 于2020年1月9日周四
>>> 上午1:22写道：
>>> >> >> > >> > > >
>>> >> >> > >> > > > > Hi Tison,
>>> >> >> > >> > > > >
>>> >> >> > >> > > > > I can't join the group with shared link. Would you
>>> please add me
>>> >> >> > >> into
>>> >> >> > >> > > the
>>> >> >> > >> > > > > group? My slack account is huangzhenqiu0825.
>>> >> >> > >> > > > > Thank you in advance.
>>> >> >> > >> > > > >
>>> >> >> > >> > > > >
>>> >> >> > >> > > > > Best Regards
>>> >> >> > >> > > > > Peter Huang
>>> >> >> > >> > > > >
>>> >> >> > >> > > > > On Wed, Jan 8, 2020 at 12:02 AM tison <
>>> wander4...@gmail.com>
>>> >> >> > >> wrote:
>>> >> >> > >> > > > >
>>> >> >> > >> > > > > > Hi Peter,
>>> >> >> > >> > > > > >
>>> >> >> > >> > > > > > As described above, this effort should get
>>> attention from people
>>> >> >> > >> > > > > developing
>>> >> >> > >> > > > > > FLIP-73 a.k.a. Executor abstractions. I recommend
>>> you to join
>>> >> >> > >> the
>>> >> >> > >> > > > public
>>> >> >> > >> > > > > > slack channel[1] for Flink Client API Enhancement
>>> and you can
>>> >> >> > >> try to
>>> >> >> > >> > > > > share
>>> >> >> > >> > > > > > you detailed thoughts there. It possibly gets more
>>> concrete
>>> >> >> > >> > > attentions.
>>> >> >> > >> > > > > >
>>> >> >> > >> > > > > > Best,
>>> >> >> > >> > > > > > tison.
>>> >> >> > >> > > > > >
>>> >> >> > >> > > > > > [1]
>>> >> >> > >> > > > > >
>>> >> >> > >> > > > > >
>>> >> >> > >> > > > >
>>> >> >> > >> > > >
>>> >> >> > >> > >
>>> >> >> > >>
>>> https://slack.com/share/IS21SJ75H/Rk8HhUly9FuEHb7oGwBZ33uL/enQtODg2MDYwNjE5MTg3LTA2MjIzNDc1M2ZjZDVlMjdlZjk1M2RkYmJhNjAwMTk2ZDZkODQ4NmY5YmI4OGRhNWJkYTViMTM1NzlmMzc4OWM
>>> >> >> > >> > > > > >
>>> >> >> > >> > > > > >
>>> >> >> > >> > > > > > Peter Huang <huangzhenqiu0...@gmail.com>
>>> 于2020年1月7日周二 上午5:09写道：
>>> >> >> > >> > > > > >
>>> >> >> > >> > > > > > > Dear All,
>>> >> >> > >> > > > > > >
>>> >> >> > >> > > > > > > Happy new year! According to existing feedback
>>> from the
>>> >> >> > >> community,
>>> >> >> > >> > > we
>>> >> >> > >> > > > > > > revised the doc with the consideration of session
>>> cluster
>>> >> >> > >> support,
>>> >> >> > >> > > > and
>>> >> >> > >> > > > > > > concrete interface changes needed and execution
>>> plan. Please
>>> >> >> > >> take
>>> >> >> > >> > > one
>>> >> >> > >> > > > > > more
>>> >> >> > >> > > > > > > round of review at your most convenient time.
>>> >> >> > >> > > > > > >
>>> >> >> > >> > > > > > >
>>> >> >> > >> > > > > > >
>>> >> >> > >> > > > > >
>>> >> >> > >> > > > >
>>> >> >> > >> > > >
>>> >> >> > >> > >
>>> >> >> > >>
>>> https://docs.google.com/document/d/1aAwVjdZByA-0CHbgv16Me-vjaaDMCfhX7TzVVTuifYM/edit#
>>> >> >> > >> > > > > > >
>>> >> >> > >> > > > > > >
>>> >> >> > >> > > > > > > Best Regards
>>> >> >> > >> > > > > > > Peter Huang
>>> >> >> > >> > > > > > >
>>> >> >> > >> > > > > > >
>>> >> >> > >> > > > > > >
>>> >> >> > >> > > > > > >
>>> >> >> > >> > > > > > >
>>> >> >> > >> > > > > > > On Thu, Jan 2, 2020 at 11:29 AM Peter Huang <
>>> >> >> > >> > > > > huangzhenqiu0...@gmail.com>
>>> >> >> > >> > > > > > > wrote:
>>> >> >> > >> > > > > > >
>>> >> >> > >> > > > > > > > Hi Dian,
>>> >> >> > >> > > > > > > > Thanks for giving us valuable feedbacks.
>>> >> >> > >> > > > > > > >
>>> >> >> > >> > > > > > > > 1) It's better to have a whole design for this
>>> feature
>>> >> >> > >> > > > > > > > For the suggestion of enabling the cluster mode
>>> also session
>>> >> >> > >> > > > > cluster, I
>>> >> >> > >> > > > > > > > think Flink already supported it.
>>> WebSubmissionExtension
>>> >> >> > >> already
>>> >> >> > >> > > > > allows
>>> >> >> > >> > > > > > > > users to start a job with the specified jar by
>>> using web UI.
>>> >> >> > >> > > > > > > > But we need to enable the feature from CLI for
>>> both local
>>> >> >> > >> jar,
>>> >> >> > >> > > > remote
>>> >> >> > >> > > > > > > jar.
>>> >> >> > >> > > > > > > > I will align with Yang Wang first about the
>>> details and
>>> >> >> > >> update
>>> >> >> > >> > > the
>>> >> >> > >> > > > > > design
>>> >> >> > >> > > > > > > > doc.
>>> >> >> > >> > > > > > > >
>>> >> >> > >> > > > > > > > 2) It's better to consider the convenience for
>>> users, such
>>> >> >> > >> as
>>> >> >> > >> > > > > debugging
>>> >> >> > >> > > > > > > >
>>> >> >> > >> > > > > > > > I am wondering whether we can store the
>>> exception in
>>> >> >> > >> jobgragh
>>> >> >> > >> > > > > > > > generation in application master. As no
>>> streaming graph can
>>> >> >> > >> be
>>> >> >> > >> > > > > > scheduled
>>> >> >> > >> > > > > > > in
>>> >> >> > >> > > > > > > > this case, there will be no more TM will be
>>> requested from
>>> >> >> > >> > > FlinkRM.
>>> >> >> > >> > > > > > > > If the AM is still running, users can still
>>> query it from
>>> >> >> > >> CLI. As
>>> >> >> > >> > > > it
>>> >> >> > >> > > > > > > > requires more change, we can get some feedback
>>> from <
>>> >> >> > >> > > > > > aljos...@apache.org
>>> >> >> > >> > > > > > > >
>>> >> >> > >> > > > > > > > and @zjf...@gmail.com <zjf...@gmail.com>.
>>> >> >> > >> > > > > > > >
>>> >> >> > >> > > > > > > > 3) It's better to consider the impact to the
>>> stability of
>>> >> >> > >> the
>>> >> >> > >> > > > cluster
>>> >> >> > >> > > > > > > >
>>> >> >> > >> > > > > > > > I agree with Yang Wang's opinion.
>>> >> >> > >> > > > > > > >
>>> >> >> > >> > > > > > > >
>>> >> >> > >> > > > > > > >
>>> >> >> > >> > > > > > > > Best Regards
>>> >> >> > >> > > > > > > > Peter Huang
>>> >> >> > >> > > > > > > >
>>> >> >> > >> > > > > > > >
>>> >> >> > >> > > > > > > > On Sun, Dec 29, 2019 at 9:44 PM Dian Fu <
>>> >> >> > >> dian0511...@gmail.com>
>>> >> >> > >> > > > > wrote:
>>> >> >> > >> > > > > > > >
>>> >> >> > >> > > > > > > >> Hi all,
>>> >> >> > >> > > > > > > >>
>>> >> >> > >> > > > > > > >> Sorry to jump into this discussion. Thanks
>>> everyone for the
>>> >> >> > >> > > > > > discussion.
>>> >> >> > >> > > > > > > >> I'm very interested in this topic although I'm
>>> not an
>>> >> >> > >> expert in
>>> >> >> > >> > > > this
>>> >> >> > >> > > > > > > part.
>>> >> >> > >> > > > > > > >> So I'm glad to share my thoughts as following:
>>> >> >> > >> > > > > > > >>
>>> >> >> > >> > > > > > > >> 1) It's better to have a whole design for this
>>> feature
>>> >> >> > >> > > > > > > >> As we know, there are two deployment modes:
>>> per-job mode
>>> >> >> > >> and
>>> >> >> > >> > > > session
>>> >> >> > >> > > > > > > >> mode. I'm wondering which mode really needs
>>> this feature.
>>> >> >> > >> As the
>>> >> >> > >> > > > > > design
>>> >> >> > >> > > > > > > doc
>>> >> >> > >> > > > > > > >> mentioned, per-job mode is more used for
>>> streaming jobs and
>>> >> >> > >> > > > session
>>> >> >> > >> > > > > > > mode is
>>> >> >> > >> > > > > > > >> usually used for batch jobs(Of course, the job
>>> types and
>>> >> >> > >> the
>>> >> >> > >> > > > > > deployment
>>> >> >> > >> > > > > > > >> modes are orthogonal). Usually streaming job
>>> is only
>>> >> >> > >> needed to
>>> >> >> > >> > > be
>>> >> >> > >> > > > > > > submitted
>>> >> >> > >> > > > > > > >> once and it will run for days or weeks, while
>>> batch jobs
>>> >> >> > >> will be
>>> >> >> > >> > > > > > > submitted
>>> >> >> > >> > > > > > > >> more frequently compared with streaming jobs.
>>> This means
>>> >> >> > >> that
>>> >> >> > >> > > > maybe
>>> >> >> > >> > > > > > > session
>>> >> >> > >> > > > > > > >> mode also needs this feature. However, if we
>>> support this
>>> >> >> > >> > > feature
>>> >> >> > >> > > > in
>>> >> >> > >> > > > > > > >> session mode, the application master will
>>> become the new
>>> >> >> > >> > > > centralized
>>> >> >> > >> > > > > > > >> service(which should be solved). So in this
>>> case, it's
>>> >> >> > >> better to
>>> >> >> > >> > > > > have
>>> >> >> > >> > > > > > a
>>> >> >> > >> > > > > > > >> complete design for both per-job mode and
>>> session mode.
>>> >> >> > >> > > > Furthermore,
>>> >> >> > >> > > > > > > even
>>> >> >> > >> > > > > > > >> if we can do it phase by phase, we need to
>>> have a whole
>>> >> >> > >> picture
>>> >> >> > >> > > of
>>> >> >> > >> > > > > how
>>> >> >> > >> > > > > > > it
>>> >> >> > >> > > > > > > >> works in both per-job mode and session mode.
>>> >> >> > >> > > > > > > >>
>>> >> >> > >> > > > > > > >> 2) It's better to consider the convenience for
>>> users, such
>>> >> >> > >> as
>>> >> >> > >> > > > > > debugging
>>> >> >> > >> > > > > > > >> After we finish this feature, the job graph
>>> will be
>>> >> >> > >> compiled in
>>> >> >> > >> > > > the
>>> >> >> > >> > > > > > > >> application master, which means that users
>>> cannot easily
>>> >> >> > >> get the
>>> >> >> > >> > > > > > > exception
>>> >> >> > >> > > > > > > >> message synchorousely in the job client if
>>> there are
>>> >> >> > >> problems
>>> >> >> > >> > > > during
>>> >> >> > >> > > > > > the
>>> >> >> > >> > > > > > > >> job graph compiling (especially for platform
>>> users), such
>>> >> >> > >> as the
>>> >> >> > >> > > > > > > resource
>>> >> >> > >> > > > > > > >> path is incorrect, the user program itself has
>>> some
>>> >> >> > >> problems,
>>> >> >> > >> > > etc.
>>> >> >> > >> > > > > > What
>>> >> >> > >> > > > > > > I'm
>>> >> >> > >> > > > > > > >> thinking is that maybe we should throw the
>>> exceptions as
>>> >> >> > >> early
>>> >> >> > >> > > as
>>> >> >> > >> > > > > > > possible
>>> >> >> > >> > > > > > > >> (during job submission stage).
>>> >> >> > >> > > > > > > >>
>>> >> >> > >> > > > > > > >> 3) It's better to consider the impact to the
>>> stability of
>>> >> >> > >> the
>>> >> >> > >> > > > > cluster
>>> >> >> > >> > > > > > > >> If we perform the compiling in the application
>>> master, we
>>> >> >> > >> should
>>> >> >> > >> > > > > > > consider
>>> >> >> > >> > > > > > > >> the impact of the compiling errors. Although
>>> YARN could
>>> >> >> > >> resume
>>> >> >> > >> > > the
>>> >> >> > >> > > > > > > >> application master in case of failures, but in
>>> some case
>>> >> >> > >> the
>>> >> >> > >> > > > > compiling
>>> >> >> > >> > > > > > > >> failure may be a waste of cluster resource and
>>> may impact
>>> >> >> > >> the
>>> >> >> > >> > > > > > stability
>>> >> >> > >> > > > > > > the
>>> >> >> > >> > > > > > > >> cluster and the other jobs in the cluster,
>>> such as the
>>> >> >> > >> resource
>>> >> >> > >> > > > path
>>> >> >> > >> > > > > > is
>>> >> >> > >> > > > > > > >> incorrect, the user program itself has some
>>> problems(in
>>> >> >> > >> this
>>> >> >> > >> > > case,
>>> >> >> > >> > > > > job
>>> >> >> > >> > > > > > > >> failover cannot solve this kind of problems)
>>> etc. In the
>>> >> >> > >> current
>>> >> >> > >> > > > > > > >> implemention, the compiling errors are handled
>>> in the
>>> >> >> > >> client
>>> >> >> > >> > > side
>>> >> >> > >> > > > > and
>>> >> >> > >> > > > > > > there
>>> >> >> > >> > > > > > > >> is no impact to the cluster at all.
>>> >> >> > >> > > > > > > >>
>>> >> >> > >> > > > > > > >> Regarding to 1), it's clearly pointed in the
>>> design doc
>>> >> >> > >> that
>>> >> >> > >> > > only
>>> >> >> > >> > > > > > > per-job
>>> >> >> > >> > > > > > > >> mode will be supported. However, I think it's
>>> better to
>>> >> >> > >> also
>>> >> >> > >> > > > > consider
>>> >> >> > >> > > > > > > the
>>> >> >> > >> > > > > > > >> session mode in the design doc.
>>> >> >> > >> > > > > > > >> Regarding to 2) and 3), I have not seen
>>> related sections
>>> >> >> > >> in the
>>> >> >> > >> > > > > design
>>> >> >> > >> > > > > > > >> doc. It will be good if we can cover them in
>>> the design
>>> >> >> > >> doc.
>>> >> >> > >> > > > > > > >>
>>> >> >> > >> > > > > > > >> Feel free to correct me If there is anything I
>>> >> >> > >> misunderstand.
>>> >> >> > >> > > > > > > >>
>>> >> >> > >> > > > > > > >> Regards,
>>> >> >> > >> > > > > > > >> Dian
>>> >> >> > >> > > > > > > >>
>>> >> >> > >> > > > > > > >>
>>> >> >> > >> > > > > > > >> > 在 2019年12月27日，上午3:13，Peter Huang <
>>> >> >> > >> huangzhenqiu0...@gmail.com>
>>> >> >> > >> > > > 写道：
>>> >> >> > >> > > > > > > >> >
>>> >> >> > >> > > > > > > >> > Hi Yang,
>>> >> >> > >> > > > > > > >> >
>>> >> >> > >> > > > > > > >> > I can't agree more. The effort definitely
>>> needs to align
>>> >> >> > >> with
>>> >> >> > >> > > > the
>>> >> >> > >> > > > > > > final
>>> >> >> > >> > > > > > > >> > goal of FLIP-73.
>>> >> >> > >> > > > > > > >> > I am thinking about whether we can achieve
>>> the goal with
>>> >> >> > >> two
>>> >> >> > >> > > > > phases.
>>> >> >> > >> > > > > > > >> >
>>> >> >> > >> > > > > > > >> > 1) Phase I
>>> >> >> > >> > > > > > > >> > As the CLiFrontend will not be depreciated
>>> soon. We can
>>> >> >> > >> still
>>> >> >> > >> > > > use
>>> >> >> > >> > > > > > the
>>> >> >> > >> > > > > > > >> > deployMode flag there,
>>> >> >> > >> > > > > > > >> > pass the program info through Flink
>>> configuration,  use
>>> >> >> > >> the
>>> >> >> > >> > > > > > > >> > ClassPathJobGraphRetriever
>>> >> >> > >> > > > > > > >> > to generate the job graph in
>>> ClusterEntrypoints of yarn
>>> >> >> > >> and
>>> >> >> > >> > > > > > > Kubernetes.
>>> >> >> > >> > > > > > > >> >
>>> >> >> > >> > > > > > > >> > 2) Phase II
>>> >> >> > >> > > > > > > >> > In  AbstractJobClusterExecutor, the job
>>> graph is
>>> >> >> > >> generated in
>>> >> >> > >> > > > the
>>> >> >> > >> > > > > > > >> execute
>>> >> >> > >> > > > > > > >> > function. We can still
>>> >> >> > >> > > > > > > >> > use the deployMode in it. With deployMode =
>>> cluster, the
>>> >> >> > >> > > execute
>>> >> >> > >> > > > > > > >> function
>>> >> >> > >> > > > > > > >> > only starts the cluster.
>>> >> >> > >> > > > > > > >> >
>>> >> >> > >> > > > > > > >> > When
>>> {Yarn/Kuberneates}PerJobClusterEntrypoint starts,
>>> >> >> > >> It will
>>> >> >> > >> > > > > start
>>> >> >> > >> > > > > > > the
>>> >> >> > >> > > > > > > >> > dispatch first, then we can use
>>> >> >> > >> > > > > > > >> > a ClusterEnvironment similar to
>>> ContextEnvironment to
>>> >> >> > >> submit
>>> >> >> > >> > > the
>>> >> >> > >> > > > > job
>>> >> >> > >> > > > > > > >> with
>>> >> >> > >> > > > > > > >> > jobName the local
>>> >> >> > >> > > > > > > >> > dispatcher. For the details, we need more
>>> investigation.
>>> >> >> > >> Let's
>>> >> >> > >> > > > > wait
>>> >> >> > >> > > > > > > >> > for @Aljoscha
>>> >> >> > >> > > > > > > >> > Krettek <aljos...@apache.org> @Till
>>> Rohrmann <
>>> >> >> > >> > > > > trohrm...@apache.org
>>> >> >> > >> > > > > > >'s
>>> >> >> > >> > > > > > > >> > feedback after the holiday season.
>>> >> >> > >> > > > > > > >> >
>>> >> >> > >> > > > > > > >> > Thank you in advance. Merry Chrismas and
>>> Happy New
>>> >> >> > >> Year!!!
>>> >> >> > >> > > > > > > >> >
>>> >> >> > >> > > > > > > >> >
>>> >> >> > >> > > > > > > >> > Best Regards
>>> >> >> > >> > > > > > > >> > Peter Huang
>>> >> >> > >> > > > > > > >> >
>>> >> >> > >> > > > > > > >> >
>>> >> >> > >> > > > > > > >> >
>>> >> >> > >> > > > > > > >> >
>>> >> >> > >> > > > > > > >> >
>>> >> >> > >> > > > > > > >> >
>>> >> >> > >> > > > > > > >> >
>>> >> >> > >> > > > > > > >> >
>>> >> >> > >> > > > > > > >> > On Wed, Dec 25, 2019 at 1:08 AM Yang Wang <
>>> >> >> > >> > > > danrtsey...@gmail.com>
>>> >> >> > >> > > > > > > >> wrote:
>>> >> >> > >> > > > > > > >> >
>>> >> >> > >> > > > > > > >> >> Hi Peter,
>>> >> >> > >> > > > > > > >> >>
>>> >> >> > >> > > > > > > >> >> I think we need to reconsider tison's
>>> suggestion
>>> >> >> > >> seriously.
>>> >> >> > >> > > > After
>>> >> >> > >> > > > > > > >> FLIP-73,
>>> >> >> > >> > > > > > > >> >> the deployJobCluster has
>>> >> >> > >> > > > > > > >> >> beenmoved into
>>> `JobClusterExecutor#execute`. It should
>>> >> >> > >> not be
>>> >> >> > >> > > > > > > perceived
>>> >> >> > >> > > > > > > >> >> for `CliFrontend`. That
>>> >> >> > >> > > > > > > >> >> means the user program will *ALWAYS* be
>>> executed on
>>> >> >> > >> client
>>> >> >> > >> > > > side.
>>> >> >> > >> > > > > > This
>>> >> >> > >> > > > > > > >> is
>>> >> >> > >> > > > > > > >> >> the by design behavior.
>>> >> >> > >> > > > > > > >> >> So, we could not just add `if(client mode)
>>> .. else
>>> >> >> > >> if(cluster
>>> >> >> > >> > > > > mode)
>>> >> >> > >> > > > > > > >> ...`
>>> >> >> > >> > > > > > > >> >> codes in `CliFrontend` to bypass
>>> >> >> > >> > > > > > > >> >> the executor. We need to find a clean way
>>> to decouple
>>> >> >> > >> > > executing
>>> >> >> > >> > > > > > user
>>> >> >> > >> > > > > > > >> >> program and deploying per-job
>>> >> >> > >> > > > > > > >> >> cluster. Based on this, we could support to
>>> execute user
>>> >> >> > >> > > > program
>>> >> >> > >> > > > > on
>>> >> >> > >> > > > > > > >> client
>>> >> >> > >> > > > > > > >> >> or master side.
>>> >> >> > >> > > > > > > >> >>
>>> >> >> > >> > > > > > > >> >> Maybe Aljoscha and Jeff could give some good
>>> >> >> > >> suggestions.
>>> >> >> > >> > > > > > > >> >>
>>> >> >> > >> > > > > > > >> >>
>>> >> >> > >> > > > > > > >> >>
>>> >> >> > >> > > > > > > >> >> Best,
>>> >> >> > >> > > > > > > >> >> Yang
>>> >> >> > >> > > > > > > >> >>
>>> >> >> > >> > > > > > > >> >> Peter Huang <huangzhenqiu0...@gmail.com>
>>> 于2019年12月25日周三
>>> >> >> > >> > > > > 上午4:03写道：
>>> >> >> > >> > > > > > > >> >>
>>> >> >> > >> > > > > > > >> >>> Hi Jingjing,
>>> >> >> > >> > > > > > > >> >>>
>>> >> >> > >> > > > > > > >> >>> The improvement proposed is a deployment
>>> option for
>>> >> >> > >> CLI. For
>>> >> >> > >> > > > SQL
>>> >> >> > >> > > > > > > based
>>> >> >> > >> > > > > > > >> >>> Flink application, It is more convenient
>>> to use the
>>> >> >> > >> existing
>>> >> >> > >> > > > > model
>>> >> >> > >> > > > > > > in
>>> >> >> > >> > > > > > > >> >>> SqlClient in which
>>> >> >> > >> > > > > > > >> >>> the job graph is generated within
>>> SqlClient. After
>>> >> >> > >> adding
>>> >> >> > >> > > the
>>> >> >> > >> > > > > > > delayed
>>> >> >> > >> > > > > > > >> job
>>> >> >> > >> > > > > > > >> >>> graph generation, I think there is no
>>> change is needed
>>> >> >> > >> for
>>> >> >> > >> > > > your
>>> >> >> > >> > > > > > > side.
>>> >> >> > >> > > > > > > >> >>>
>>> >> >> > >> > > > > > > >> >>>
>>> >> >> > >> > > > > > > >> >>> Best Regards
>>> >> >> > >> > > > > > > >> >>> Peter Huang
>>> >> >> > >> > > > > > > >> >>>
>>> >> >> > >> > > > > > > >> >>>
>>> >> >> > >> > > > > > > >> >>> On Wed, Dec 18, 2019 at 6:01 AM jingjing
>>> bai <
>>> >> >> > >> > > > > > > >> baijingjing7...@gmail.com>
>>> >> >> > >> > > > > > > >> >>> wrote:
>>> >> >> > >> > > > > > > >> >>>
>>> >> >> > >> > > > > > > >> >>>> hi peter:
>>> >> >> > >> > > > > > > >> >>>>    we had extension SqlClent to support
>>> sql job
>>> >> >> > >> submit in
>>> >> >> > >> > > web
>>> >> >> > >> > > > > > base
>>> >> >> > >> > > > > > > on
>>> >> >> > >> > > > > > > >> >>>> flink 1.9.   we support submit to yarn on
>>> per job
>>> >> >> > >> mode too.
>>> >> >> > >> > > > > > > >> >>>>    in this case, the job graph generated
>>> on client
>>> >> >> > >> side
>>> >> >> > >> > > .  I
>>> >> >> > >> > > > > > think
>>> >> >> > >> > > > > > > >> >>> this
>>> >> >> > >> > > > > > > >> >>>> discuss Mainly to improve api programme.
>>> but in my
>>> >> >> > >> case ,
>>> >> >> > >> > > > > there
>>> >> >> > >> > > > > > is
>>> >> >> > >> > > > > > > >> no
>>> >> >> > >> > > > > > > >> >>>> jar to upload but only a sql string .
>>> >> >> > >> > > > > > > >> >>>>    do u had more suggestion to improve
>>> for sql mode
>>> >> >> > >> or it
>>> >> >> > >> > > is
>>> >> >> > >> > > > > > only a
>>> >> >> > >> > > > > > > >> >>>> switch for api programme？
>>> >> >> > >> > > > > > > >> >>>>
>>> >> >> > >> > > > > > > >> >>>>
>>> >> >> > >> > > > > > > >> >>>> best
>>> >> >> > >> > > > > > > >> >>>> bai jj
>>> >> >> > >> > > > > > > >> >>>>
>>> >> >> > >> > > > > > > >> >>>>
>>> >> >> > >> > > > > > > >> >>>> Yang Wang <danrtsey...@gmail.com>
>>> 于2019年12月18日周三
>>> >> >> > >> 下午7:21写道：
>>> >> >> > >> > > > > > > >> >>>>
>>> >> >> > >> > > > > > > >> >>>>> I just want to revive this discussion.
>>> >> >> > >> > > > > > > >> >>>>>
>>> >> >> > >> > > > > > > >> >>>>> Recently, i am thinking about how to
>>> natively run
>>> >> >> > >> flink
>>> >> >> > >> > > > > per-job
>>> >> >> > >> > > > > > > >> >>> cluster on
>>> >> >> > >> > > > > > > >> >>>>> Kubernetes.
>>> >> >> > >> > > > > > > >> >>>>> The per-job mode on Kubernetes is very
>>> different
>>> >> >> > >> from on
>>> >> >> > >> > > > Yarn.
>>> >> >> > >> > > > > > And
>>> >> >> > >> > > > > > > >> we
>>> >> >> > >> > > > > > > >> >>> will
>>> >> >> > >> > > > > > > >> >>>>> have
>>> >> >> > >> > > > > > > >> >>>>> the same deployment requirements to the
>>> client and
>>> >> >> > >> entry
>>> >> >> > >> > > > > point.
>>> >> >> > >> > > > > > > >> >>>>>
>>> >> >> > >> > > > > > > >> >>>>> 1. Flink client not always need a local
>>> jar to start
>>> >> >> > >> a
>>> >> >> > >> > > Flink
>>> >> >> > >> > > > > > > per-job
>>> >> >> > >> > > > > > > >> >>>>> cluster. We could
>>> >> >> > >> > > > > > > >> >>>>> support multiple schemas. For example,
>>> >> >> > >> > > > file:///path/of/my.jar
>>> >> >> > >> > > > > > > means
>>> >> >> > >> > > > > > > >> a
>>> >> >> > >> > > > > > > >> >>> jar
>>> >> >> > >> > > > > > > >> >>>>> located
>>> >> >> > >> > > > > > > >> >>>>> at client side,
>>> >> >> > >> hdfs://myhdfs/user/myname/flink/my.jar
>>> >> >> > >> > > > means a
>>> >> >> > >> > > > > > jar
>>> >> >> > >> > > > > > > >> >>> located
>>> >> >> > >> > > > > > > >> >>>>> at
>>> >> >> > >> > > > > > > >> >>>>> remote hdfs,
>>> local:///path/in/image/my.jar means a
>>> >> >> > >> jar
>>> >> >> > >> > > > located
>>> >> >> > >> > > > > > at
>>> >> >> > >> > > > > > > >> >>>>> jobmanager side.
>>> >> >> > >> > > > > > > >> >>>>>
>>> >> >> > >> > > > > > > >> >>>>> 2. Support running user program on
>>> master side. This
>>> >> >> > >> also
>>> >> >> > >> > > > > means
>>> >> >> > >> > > > > > > the
>>> >> >> > >> > > > > > > >> >>> entry
>>> >> >> > >> > > > > > > >> >>>>> point
>>> >> >> > >> > > > > > > >> >>>>> will generate the job graph on master
>>> side. We could
>>> >> >> > >> use
>>> >> >> > >> > > the
>>> >> >> > >> > > > > > > >> >>>>> ClasspathJobGraphRetriever
>>> >> >> > >> > > > > > > >> >>>>> or start a local Flink client to achieve
>>> this
>>> >> >> > >> purpose.
>>> >> >> > >> > > > > > > >> >>>>>
>>> >> >> > >> > > > > > > >> >>>>>
>>> >> >> > >> > > > > > > >> >>>>> cc tison, Aljoscha & Kostas Do you think
>>> this is the
>>> >> >> > >> right
>>> >> >> > >> > > > > > > >> direction we
>>> >> >> > >> > > > > > > >> >>>>> need to work?
>>> >> >> > >> > > > > > > >> >>>>>
>>> >> >> > >> > > > > > > >> >>>>> tison <wander4...@gmail.com>
>>> 于2019年12月12日周四
>>> >> >> > >> 下午4:48写道：
>>> >> >> > >> > > > > > > >> >>>>>
>>> >> >> > >> > > > > > > >> >>>>>> A quick idea is that we separate the
>>> deployment
>>> >> >> > >> from user
>>> >> >> > >> > > > > > program
>>> >> >> > >> > > > > > > >> >>> that
>>> >> >> > >> > > > > > > >> >>>>> it
>>> >> >> > >> > > > > > > >> >>>>>> has always been done
>>> >> >> > >> > > > > > > >> >>>>>> outside the program. On user program
>>> executed there
>>> >> >> > >> is
>>> >> >> > >> > > > > always a
>>> >> >> > >> > > > > > > >> >>>>>> ClusterClient that communicates with
>>> >> >> > >> > > > > > > >> >>>>>> an existing cluster, remote or local.
>>> It will be
>>> >> >> > >> another
>>> >> >> > >> > > > > thread
>>> >> >> > >> > > > > > > so
>>> >> >> > >> > > > > > > >> >>> just
>>> >> >> > >> > > > > > > >> >>>>> for
>>> >> >> > >> > > > > > > >> >>>>>> your information.
>>> >> >> > >> > > > > > > >> >>>>>>
>>> >> >> > >> > > > > > > >> >>>>>> Best,
>>> >> >> > >> > > > > > > >> >>>>>> tison.
>>> >> >> > >> > > > > > > >> >>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>
>>> >> >> > >> > > > > > > >> >>>>>> tison <wander4...@gmail.com>
>>> 于2019年12月12日周四
>>> >> >> > >> 下午4:40写道：
>>> >> >> > >> > > > > > > >> >>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>> Hi Peter,
>>> >> >> > >> > > > > > > >> >>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>> Another concern I realized recently is
>>> that with
>>> >> >> > >> current
>>> >> >> > >> > > > > > > Executors
>>> >> >> > >> > > > > > > >> >>>>>>> abstraction(FLIP-73)
>>> >> >> > >> > > > > > > >> >>>>>>> I'm afraid that user program is
>>> designed to ALWAYS
>>> >> >> > >> run
>>> >> >> > >> > > on
>>> >> >> > >> > > > > the
>>> >> >> > >> > > > > > > >> >>> client
>>> >> >> > >> > > > > > > >> >>>>>> side.
>>> >> >> > >> > > > > > > >> >>>>>>> Specifically,
>>> >> >> > >> > > > > > > >> >>>>>>> we deploy the job in executor when
>>> env.execute
>>> >> >> > >> called.
>>> >> >> > >> > > > This
>>> >> >> > >> > > > > > > >> >>>>> abstraction
>>> >> >> > >> > > > > > > >> >>>>>>> possibly prevents
>>> >> >> > >> > > > > > > >> >>>>>>> Flink runs user program on the cluster
>>> side.
>>> >> >> > >> > > > > > > >> >>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>> For your proposal, in this case we
>>> already
>>> >> >> > >> compiled the
>>> >> >> > >> > > > > > program
>>> >> >> > >> > > > > > > >> and
>>> >> >> > >> > > > > > > >> >>>>> run
>>> >> >> > >> > > > > > > >> >>>>>> on
>>> >> >> > >> > > > > > > >> >>>>>>> the client side,
>>> >> >> > >> > > > > > > >> >>>>>>> even we deploy a cluster and retrieve
>>> job graph
>>> >> >> > >> from
>>> >> >> > >> > > > program
>>> >> >> > >> > > > > > > >> >>>>> metadata, it
>>> >> >> > >> > > > > > > >> >>>>>>> doesn't make
>>> >> >> > >> > > > > > > >> >>>>>>> many sense.
>>> >> >> > >> > > > > > > >> >>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>> cc Aljoscha & Kostas what do you think
>>> about this
>>> >> >> > >> > > > > constraint?
>>> >> >> > >> > > > > > > >> >>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>> Best,
>>> >> >> > >> > > > > > > >> >>>>>>> tison.
>>> >> >> > >> > > > > > > >> >>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>> Peter Huang <
>>> huangzhenqiu0...@gmail.com>
>>> >> >> > >> 于2019年12月10日周二
>>> >> >> > >> > > > > > > >> 下午12:45写道：
>>> >> >> > >> > > > > > > >> >>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>> Hi Tison,
>>> >> >> > >> > > > > > > >> >>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>> Yes, you are right. I think I made
>>> the wrong
>>> >> >> > >> argument
>>> >> >> > >> > > in
>>> >> >> > >> > > > > the
>>> >> >> > >> > > > > > > doc.
>>> >> >> > >> > > > > > > >> >>>>>>>> Basically, the packaging jar problem
>>> is only for
>>> >> >> > >> > > platform
>>> >> >> > >> > > > > > > users.
>>> >> >> > >> > > > > > > >> >>> In
>>> >> >> > >> > > > > > > >> >>>>> our
>>> >> >> > >> > > > > > > >> >>>>>>>> internal deploy service,
>>> >> >> > >> > > > > > > >> >>>>>>>> we further optimized the deployment
>>> latency by
>>> >> >> > >> letting
>>> >> >> > >> > > > > users
>>> >> >> > >> > > > > > to
>>> >> >> > >> > > > > > > >> >>>>>> packaging
>>> >> >> > >> > > > > > > >> >>>>>>>> flink-runtime together with the uber
>>> jar, so that
>>> >> >> > >> we
>>> >> >> > >> > > > don't
>>> >> >> > >> > > > > > need
>>> >> >> > >> > > > > > > >> to
>>> >> >> > >> > > > > > > >> >>>>>>>> consider
>>> >> >> > >> > > > > > > >> >>>>>>>> multiple flink version
>>> >> >> > >> > > > > > > >> >>>>>>>> support for now. In the session
>>> client mode, as
>>> >> >> > >> Flink
>>> >> >> > >> > > > libs
>>> >> >> > >> > > > > > will
>>> >> >> > >> > > > > > > >> be
>>> >> >> > >> > > > > > > >> >>>>>> shipped
>>> >> >> > >> > > > > > > >> >>>>>>>> anyway as local resources of yarn.
>>> Users actually
>>> >> >> > >> don't
>>> >> >> > >> > > > > need
>>> >> >> > >> > > > > > to
>>> >> >> > >> > > > > > > >> >>>>> package
>>> >> >> > >> > > > > > > >> >>>>>>>> those libs into job jar.
>>> >> >> > >> > > > > > > >> >>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>> Best Regards
>>> >> >> > >> > > > > > > >> >>>>>>>> Peter Huang
>>> >> >> > >> > > > > > > >> >>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>> On Mon, Dec 9, 2019 at 8:35 PM tison <
>>> >> >> > >> > > > wander4...@gmail.com
>>> >> >> > >> > > > > >
>>> >> >> > >> > > > > > > >> >>> wrote:
>>> >> >> > >> > > > > > > >> >>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>>>> 3. What do you mean about the
>>> package? Do users
>>> >> >> > >> need
>>> >> >> > >> > > to
>>> >> >> > >> > > > > > > >> >>> compile
>>> >> >> > >> > > > > > > >> >>>>>> their
>>> >> >> > >> > > > > > > >> >>>>>>>>> jars
>>> >> >> > >> > > > > > > >> >>>>>>>>> inlcuding flink-clients,
>>> flink-optimizer,
>>> >> >> > >> flink-table
>>> >> >> > >> > > > > codes?
>>> >> >> > >> > > > > > > >> >>>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>>> The answer should be no because they
>>> exist in
>>> >> >> > >> system
>>> >> >> > >> > > > > > > classpath.
>>> >> >> > >> > > > > > > >> >>>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>>> Best,
>>> >> >> > >> > > > > > > >> >>>>>>>>> tison.
>>> >> >> > >> > > > > > > >> >>>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>>> Yang Wang <danrtsey...@gmail.com>
>>> 于2019年12月10日周二
>>> >> >> > >> > > > > 下午12:18写道：
>>> >> >> > >> > > > > > > >> >>>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>>>> Hi Peter,
>>> >> >> > >> > > > > > > >> >>>>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>>>> Thanks a lot for starting this
>>> discussion. I
>>> >> >> > >> think
>>> >> >> > >> > > this
>>> >> >> > >> > > > > is
>>> >> >> > >> > > > > > a
>>> >> >> > >> > > > > > > >> >>> very
>>> >> >> > >> > > > > > > >> >>>>>>>> useful
>>> >> >> > >> > > > > > > >> >>>>>>>>>> feature.
>>> >> >> > >> > > > > > > >> >>>>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>>>> Not only for Yarn, i am focused on
>>> flink on
>>> >> >> > >> > > Kubernetes
>>> >> >> > >> > > > > > > >> >>>>> integration
>>> >> >> > >> > > > > > > >> >>>>>> and
>>> >> >> > >> > > > > > > >> >>>>>>>>> come
>>> >> >> > >> > > > > > > >> >>>>>>>>>> across the same
>>> >> >> > >> > > > > > > >> >>>>>>>>>> problem. I do not want the job
>>> graph generated
>>> >> >> > >> on
>>> >> >> > >> > > > client
>>> >> >> > >> > > > > > > side.
>>> >> >> > >> > > > > > > >> >>>>>>>> Instead,
>>> >> >> > >> > > > > > > >> >>>>>>>>> the
>>> >> >> > >> > > > > > > >> >>>>>>>>>> user jars are built in
>>> >> >> > >> > > > > > > >> >>>>>>>>>> a user-defined image. When the job
>>> manager
>>> >> >> > >> launched,
>>> >> >> > >> > > we
>>> >> >> > >> > > > > > just
>>> >> >> > >> > > > > > > >> >>>>> need to
>>> >> >> > >> > > > > > > >> >>>>>>>>>> generate the job graph
>>> >> >> > >> > > > > > > >> >>>>>>>>>> based on local user jars.
>>> >> >> > >> > > > > > > >> >>>>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>>>> I have some small suggestion about
>>> this.
>>> >> >> > >> > > > > > > >> >>>>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>>>> 1. `ProgramJobGraphRetriever` is
>>> very similar to
>>> >> >> > >> > > > > > > >> >>>>>>>>>> `ClasspathJobGraphRetriever`, the
>>> differences
>>> >> >> > >> > > > > > > >> >>>>>>>>>> are the former needs
>>> `ProgramMetadata` and the
>>> >> >> > >> latter
>>> >> >> > >> > > > > needs
>>> >> >> > >> > > > > > > >> >>> some
>>> >> >> > >> > > > > > > >> >>>>>>>>> arguments.
>>> >> >> > >> > > > > > > >> >>>>>>>>>> Is it possible to
>>> >> >> > >> > > > > > > >> >>>>>>>>>> have an unified `JobGraphRetriever`
>>> to support
>>> >> >> > >> both?
>>> >> >> > >> > > > > > > >> >>>>>>>>>> 2. Is it possible to not use a
>>> local user jar to
>>> >> >> > >> > > start
>>> >> >> > >> > > > a
>>> >> >> > >> > > > > > > >> >>> per-job
>>> >> >> > >> > > > > > > >> >>>>>>>> cluster?
>>> >> >> > >> > > > > > > >> >>>>>>>>>> In your case, the user jars has
>>> >> >> > >> > > > > > > >> >>>>>>>>>> existed on hdfs already and we do
>>> need to
>>> >> >> > >> download
>>> >> >> > >> > > the
>>> >> >> > >> > > > > jars
>>> >> >> > >> > > > > > > to
>>> >> >> > >> > > > > > > >> >>>>>>>> deployer
>>> >> >> > >> > > > > > > >> >>>>>>>>>> service. Currently, we
>>> >> >> > >> > > > > > > >> >>>>>>>>>> always need a local user jar to
>>> start a flink
>>> >> >> > >> > > cluster.
>>> >> >> > >> > > > It
>>> >> >> > >> > > > > > is
>>> >> >> > >> > > > > > > >> >>> be
>>> >> >> > >> > > > > > > >> >>>>>> great
>>> >> >> > >> > > > > > > >> >>>>>>>> if
>>> >> >> > >> > > > > > > >> >>>>>>>>> we
>>> >> >> > >> > > > > > > >> >>>>>>>>>> could support remote user jars.
>>> >> >> > >> > > > > > > >> >>>>>>>>>>>> In the implementation, we assume
>>> users package
>>> >> >> > >> > > > > > > >> >>> flink-clients,
>>> >> >> > >> > > > > > > >> >>>>>>>>>> flink-optimizer, flink-table
>>> together within
>>> >> >> > >> the job
>>> >> >> > >> > > > jar.
>>> >> >> > >> > > > > > > >> >>>>> Otherwise,
>>> >> >> > >> > > > > > > >> >>>>>>>> the
>>> >> >> > >> > > > > > > >> >>>>>>>>>> job graph generation within
>>> >> >> > >> JobClusterEntryPoint will
>>> >> >> > >> > > > > fail.
>>> >> >> > >> > > > > > > >> >>>>>>>>>> 3. What do you mean about the
>>> package? Do users
>>> >> >> > >> need
>>> >> >> > >> > > to
>>> >> >> > >> > > > > > > >> >>> compile
>>> >> >> > >> > > > > > > >> >>>>>> their
>>> >> >> > >> > > > > > > >> >>>>>>>>> jars
>>> >> >> > >> > > > > > > >> >>>>>>>>>> inlcuding flink-clients,
>>> flink-optimizer,
>>> >> >> > >> flink-table
>>> >> >> > >> > > > > > codes?
>>> >> >> > >> > > > > > > >> >>>>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>>>> Best,
>>> >> >> > >> > > > > > > >> >>>>>>>>>> Yang
>>> >> >> > >> > > > > > > >> >>>>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>>>> Peter Huang <
>>> huangzhenqiu0...@gmail.com>
>>> >> >> > >> > > > 于2019年12月10日周二
>>> >> >> > >> > > > > > > >> >>>>> 上午2:37写道：
>>> >> >> > >> > > > > > > >> >>>>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>>>>> Dear All,
>>> >> >> > >> > > > > > > >> >>>>>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>>>>> Recently, the Flink community
>>> starts to
>>> >> >> > >> improve the
>>> >> >> > >> > > > yarn
>>> >> >> > >> > > > > > > >> >>>>> cluster
>>> >> >> > >> > > > > > > >> >>>>>>>>>> descriptor
>>> >> >> > >> > > > > > > >> >>>>>>>>>>> to make job jar and config files
>>> configurable
>>> >> >> > >> from
>>> >> >> > >> > > > CLI.
>>> >> >> > >> > > > > It
>>> >> >> > >> > > > > > > >> >>>>>> improves
>>> >> >> > >> > > > > > > >> >>>>>>>> the
>>> >> >> > >> > > > > > > >> >>>>>>>>>>> flexibility of  Flink deployment
>>> Yarn Per Job
>>> >> >> > >> Mode.
>>> >> >> > >> > > > For
>>> >> >> > >> > > > > > > >> >>>>> platform
>>> >> >> > >> > > > > > > >> >>>>>>>> users
>>> >> >> > >> > > > > > > >> >>>>>>>>>> who
>>> >> >> > >> > > > > > > >> >>>>>>>>>>> manage tens of hundreds of
>>> streaming pipelines
>>> >> >> > >> for
>>> >> >> > >> > > the
>>> >> >> > >> > > > > > whole
>>> >> >> > >> > > > > > > >> >>>>> org
>>> >> >> > >> > > > > > > >> >>>>>> or
>>> >> >> > >> > > > > > > >> >>>>>>>>>>> company, we found the job graph
>>> generation in
>>> >> >> > >> > > > > client-side
>>> >> >> > >> > > > > > is
>>> >> >> > >> > > > > > > >> >>>>>> another
>>> >> >> > >> > > > > > > >> >>>>>>>>>>> pinpoint. Thus, we want to propose
>>> a
>>> >> >> > >> configurable
>>> >> >> > >> > > > > feature
>>> >> >> > >> > > > > > > >> >>> for
>>> >> >> > >> > > > > > > >> >>>>>>>>>>> FlinkYarnSessionCli. The feature
>>> can allow
>>> >> >> > >> users to
>>> >> >> > >> > > > > choose
>>> >> >> > >> > > > > > > >> >>> the
>>> >> >> > >> > > > > > > >> >>>>> job
>>> >> >> > >> > > > > > > >> >>>>>>>>> graph
>>> >> >> > >> > > > > > > >> >>>>>>>>>>> generation in Flink
>>> ClusterEntryPoint so that
>>> >> >> > >> the
>>> >> >> > >> > > job
>>> >> >> > >> > > > > jar
>>> >> >> > >> > > > > > > >> >>>>> doesn't
>>> >> >> > >> > > > > > > >> >>>>>>>> need
>>> >> >> > >> > > > > > > >> >>>>>>>>> to
>>> >> >> > >> > > > > > > >> >>>>>>>>>>> be locally for the job graph
>>> generation. The
>>> >> >> > >> > > proposal
>>> >> >> > >> > > > is
>>> >> >> > >> > > > > > > >> >>>>> organized
>>> >> >> > >> > > > > > > >> >>>>>>>> as a
>>> >> >> > >> > > > > > > >> >>>>>>>>>>> FLIP
>>> >> >> > >> > > > > > > >> >>>>>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>
>>> >> >> > >> > > > > > > >> >>>>>
>>> >> >> > >> > > > > > > >> >>>
>>> >> >> > >> > > > > > > >>
>>> >> >> > >> > > > > > >
>>> >> >> > >> > > > > >
>>> >> >> > >> > > > >
>>> >> >> > >> > > >
>>> >> >> > >> > >
>>> >> >> > >>
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-85+Delayed+JobGraph+Generation
>>> >> >> > >> > > > > > > >> >>>>>>>>>>> .
>>> >> >> > >> > > > > > > >> >>>>>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>>>>> Any questions and suggestions are
>>> welcomed.
>>> >> >> > >> Thank
>>> >> >> > >> > > you
>>> >> >> > >> > > > in
>>> >> >> > >> > > > > > > >> >>>>> advance.
>>> >> >> > >> > > > > > > >> >>>>>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>>>>> Best Regards
>>> >> >> > >> > > > > > > >> >>>>>>>>>>> Peter Huang
>>> >> >> > >> > > > > > > >> >>>>>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>
>>> >> >> > >> > > > > > > >> >>>>>
>>> >> >> > >> > > > > > > >> >>>>
>>> >> >> > >> > > > > > > >> >>>
>>> >> >> > >> > > > > > > >> >>
>>> >> >> > >> > > > > > > >>
>>> >> >> > >> > > > > > > >>
>>> >> >> > >> > > > > > >
>>> >> >> > >> > > > > >
>>> >> >> > >> > > > >
>>> >> >> > >> > > >
>>> >> >> > >> > >
>>> >> >> > >>
>>> >> >> > >
>>> >> >>
>>>
>>

Re: [DISCUSS] FLIP-85: Delayed Job Graph Generation

Reply via email to