Re: [DISCUSS] FLIP-6 Problems

Renjie Liu Wed, 06 Jun 2018 00:45:49 -0700

That's really great! I'll help to contribute to the process.

On Wed, Jun 6, 2018 at 3:17 PM Till Rohrmann <trohrm...@apache.org> wrote:


> Hi Renjie,
>
> there is already an issue for introducing further scheduling constraints
> (e.g. tags) to achieve TM isolation when using the session mode [1]. What
> it does not cover is the isolation of the JMs which need to be executed in
> their own processes. At the moment they share the same process with the
> Dispatcher because it was simpler to do it like that as first iteration.
> Here is the issue for isolating JobManagers [2].
>
> Concerning the resource specification, the corresponding issue can be found
> here [3].
>
> [1] https://issues.apache.org/jira/browse/FLINK-8886
> [2] https://issues.apache.org/jira/browse/FLINK-9537
> [3] https://issues.apache.org/jira/browse/FLINK-5131
>
> Cheers,
> Till
>
> On Wed, Jun 6, 2018 at 2:13 AM Renjie Liu <liurenjie2...@gmail.com> wrote:
>
> > Hi, Stephan:
> >
> > Yes that's what I mean. In fact the most import thing is to share the
> > dispatcher so that we can have *a centralized gateway for flink job
> > management and submission. The problem with per job cluster is that we
> > can't have a centralized gateway.*
> >
> > I didn't realize that job manager also needs to run user code before and
> > yes that means we job manager should also be isolated.
> >
> > Wouldn't it be better to separate job manager from the dispatcher so that
> > user code does't interfere with each other? In fact it seems that in most
> > production environments job isolation is required since nobody want their
> > job to be affected by others.
> >
> > On Tue, Jun 5, 2018 at 11:34 PM Stephan Ewen <se...@apache.org> wrote:
> >
> > > Hi Renjie,
> > >
> > > When you suggest to have TaskManager isolation in session mode, do you
> > mean
> > > to have a shared JobManager / Dispatcher, but job-specific
> TaskManagers?
> > > If this mainly to reduce the overhead of the per-job JobManager?
> > >
> > > One assumption so far was that if TaskManager isolation is required,
> > > JobManager isolation is also required, because some user code
> potentially
> > > also runs on the JobManager, like CheckpointHooks, Input/Output
> Formats,
> > > ...
> > >
> > > Best,
> > > Stephan
> > >
> > >
> > >
> > > On Tue, Jun 5, 2018 at 4:20 PM, Renjie Liu <liurenjie2...@gmail.com>
> > > wrote:
> > >
> > > > Hi, Till:
> > > >
> > > >
> > > >    1. Does the community has any plan to add task manager isolation
> > into
> > > >    the session mode?
> > > >    2. Is there any issues to track this feature? I want to help
> > > contribute.
> > > >    3. Thanks for the knowledge but it can't help if task manager
> > > isolation
> > > >    is not present.
> > > >
> > > >
> > > > On Tue, Jun 5, 2018 at 7:28 PM Till Rohrmann <trohrm...@apache.org>
> > > wrote:
> > > >
> > > > > Hi Renjie,
> > > > >
> > > > > 1) you're right that the Flink session mode does not give you
> proper
> > > job
> > > > > isolation. It is the same as with Flink 1.4 session mode. If this
> is
> > a
> > > > > strong requirement for you, then I recommend using the per job
> mode.
> > > > >
> > > > > 2) At the moment it is also not possible to define per job resource
> > > > > requirements when using the session mode. This is a feature which
> the
> > > > > community has started implementing but it is not yet fully done. I
> > > assume
> > > > > that the community will continue working on it. At the moment, the
> > > > solution
> > > > > would be to use the per job mode to not waste unnecessary
> resources.
> > > > >
> > > > > 3) I think the assigned ResourceID for a TaskManager is shown in
> the
> > > web
> > > > UI
> > > > > and when querying the "/taskmanagers" REST endpoint. The resource
> id
> > is
> > > > > derived from the Mesos task id. Would that help to identify which
> TM
> > is
> > > > > running on which Mesos task?
> > > > >
> > > > > Cheers,
> > > > > Till
> > > > >
> > > > > On Tue, Jun 5, 2018 at 5:13 AM Renjie Liu <liurenjie2...@gmail.com
> >
> > > > wrote:
> > > > >
> > > > > > ---------- Forwarded message ---------
> > > > > > From: Renjie Liu <liurenjie2...@gmail.com>
> > > > > > Date: Tue, Jun 5, 2018 at 10:43 AM
> > > > > > Subject: [DISCUSS] FLIP-6 Problems
> > > > > > To: user <u...@flink.apache.org>
> > > > > >
> > > > > >
> > > > > > Hi:
> > > > > >
> > > > > > We've deployed flink 1.5.0 and tested the new cluster manager,
> it's
> > > > > really
> > > > > > great for flink to be elastic. However we've also found some
> > problems
> > > > > that
> > > > > > blocks us from deploying it to production environment.
> > > > > >
> > > > > > 1. Task manager isolation. Currently flink allows different jobs
> to
> > > > > execute
> > > > > > on same task managers, this is unacceptable in production
> > environment
> > > > > since
> > > > > > a faulty written job may kill task managers and affect other
> jobs.
> > > > > > 2. Per job resource configuration. Currently flink session
> cluster
> > > can
> > > > > only
> > > > > > allocate same size and configuration task managers. This may
> waste
> > a
> > > > lot
> > > > > of
> > > > > > resources if we have a lot of jobs with different resource
> > > requirement.
> > > > > > 3. Task manager's name is meanless.  This is a problem since we
> > can't
> > > > > > monitor status of container in mesos environment.
> > > > > >
> > > > > > One solution to the above problems is to use per job cluster,
> but a
> > > > > > centralized cluster manager can help to manage flink deployment
> and
> > > > jobs
> > > > > > better.
> > > > > >
> > > > > > How you guys think about those? If the community agrees with us,
> we
> > > > would
> > > > > > like to propose design and implementation to enhance the flink
> > > cluster
> > > > > > manager.
> > > > > > --
> > > > > > Liu, Renjie
> > > > > > Software Engineer, MVAD
> > > > > > --
> > > > > > Liu, Renjie
> > > > > > Software Engineer, MVAD
> > > > > >
> > > > >
> > > > --
> > > > Liu, Renjie
> > > > Software Engineer, MVAD
> > > >
> > >
> > --
> > Liu, Renjie
> > Software Engineer, MVAD
> >
>
-- 
Liu, Renjie
Software Engineer, MVAD

Re: [DISCUSS] FLIP-6 Problems

Reply via email to