Our main use cases are mesos, maybe we can start with mesos support. On Wed, Jun 6, 2018 at 5:00 PM Stephan Ewen <se...@apache.org> wrote:
> The FLIP-6 design was specifically such that it allows for separation of > Dispatcher, ResourceManager, and JobManagers. > So that could be another extension at some point. > > It should be conceptually rather simple, the dispatcher creates per job a > new container launch context with the "JobManagerRunner" and starts that. > In practice, it is quite a bit of work still, with all the details of Yarn > to take care of. > > > > On Wed, Jun 6, 2018 at 9:45 AM, Renjie Liu <liurenjie2...@gmail.com> > wrote: > > > That's really great! I'll help to contribute to the process. > > > > On Wed, Jun 6, 2018 at 3:17 PM Till Rohrmann <trohrm...@apache.org> > wrote: > > > > > Hi Renjie, > > > > > > there is already an issue for introducing further scheduling > constraints > > > (e.g. tags) to achieve TM isolation when using the session mode [1]. > What > > > it does not cover is the isolation of the JMs which need to be executed > > in > > > their own processes. At the moment they share the same process with the > > > Dispatcher because it was simpler to do it like that as first > iteration. > > > Here is the issue for isolating JobManagers [2]. > > > > > > Concerning the resource specification, the corresponding issue can be > > found > > > here [3]. > > > > > > [1] https://issues.apache.org/jira/browse/FLINK-8886 > > > [2] https://issues.apache.org/jira/browse/FLINK-9537 > > > [3] https://issues.apache.org/jira/browse/FLINK-5131 > > > > > > Cheers, > > > Till > > > > > > On Wed, Jun 6, 2018 at 2:13 AM Renjie Liu <liurenjie2...@gmail.com> > > wrote: > > > > > > > Hi, Stephan: > > > > > > > > Yes that's what I mean. In fact the most import thing is to share the > > > > dispatcher so that we can have *a centralized gateway for flink job > > > > management and submission. The problem with per job cluster is that > we > > > > can't have a centralized gateway.* > > > > > > > > I didn't realize that job manager also needs to run user code before > > and > > > > yes that means we job manager should also be isolated. > > > > > > > > Wouldn't it be better to separate job manager from the dispatcher so > > that > > > > user code does't interfere with each other? In fact it seems that in > > most > > > > production environments job isolation is required since nobody want > > their > > > > job to be affected by others. > > > > > > > > On Tue, Jun 5, 2018 at 11:34 PM Stephan Ewen <se...@apache.org> > wrote: > > > > > > > > > Hi Renjie, > > > > > > > > > > When you suggest to have TaskManager isolation in session mode, do > > you > > > > mean > > > > > to have a shared JobManager / Dispatcher, but job-specific > > > TaskManagers? > > > > > If this mainly to reduce the overhead of the per-job JobManager? > > > > > > > > > > One assumption so far was that if TaskManager isolation is > required, > > > > > JobManager isolation is also required, because some user code > > > potentially > > > > > also runs on the JobManager, like CheckpointHooks, Input/Output > > > Formats, > > > > > ... > > > > > > > > > > Best, > > > > > Stephan > > > > > > > > > > > > > > > > > > > > On Tue, Jun 5, 2018 at 4:20 PM, Renjie Liu < > liurenjie2...@gmail.com> > > > > > wrote: > > > > > > > > > > > Hi, Till: > > > > > > > > > > > > > > > > > > 1. Does the community has any plan to add task manager > isolation > > > > into > > > > > > the session mode? > > > > > > 2. Is there any issues to track this feature? I want to help > > > > > contribute. > > > > > > 3. Thanks for the knowledge but it can't help if task manager > > > > > isolation > > > > > > is not present. > > > > > > > > > > > > > > > > > > On Tue, Jun 5, 2018 at 7:28 PM Till Rohrmann < > trohrm...@apache.org > > > > > > > > wrote: > > > > > > > > > > > > > Hi Renjie, > > > > > > > > > > > > > > 1) you're right that the Flink session mode does not give you > > > proper > > > > > job > > > > > > > isolation. It is the same as with Flink 1.4 session mode. If > this > > > is > > > > a > > > > > > > strong requirement for you, then I recommend using the per job > > > mode. > > > > > > > > > > > > > > 2) At the moment it is also not possible to define per job > > resource > > > > > > > requirements when using the session mode. This is a feature > which > > > the > > > > > > > community has started implementing but it is not yet fully > done. > > I > > > > > assume > > > > > > > that the community will continue working on it. At the moment, > > the > > > > > > solution > > > > > > > would be to use the per job mode to not waste unnecessary > > > resources. > > > > > > > > > > > > > > 3) I think the assigned ResourceID for a TaskManager is shown > in > > > the > > > > > web > > > > > > UI > > > > > > > and when querying the "/taskmanagers" REST endpoint. The > resource > > > id > > > > is > > > > > > > derived from the Mesos task id. Would that help to identify > which > > > TM > > > > is > > > > > > > running on which Mesos task? > > > > > > > > > > > > > > Cheers, > > > > > > > Till > > > > > > > > > > > > > > On Tue, Jun 5, 2018 at 5:13 AM Renjie Liu < > > liurenjie2...@gmail.com > > > > > > > > > > wrote: > > > > > > > > > > > > > > > ---------- Forwarded message --------- > > > > > > > > From: Renjie Liu <liurenjie2...@gmail.com> > > > > > > > > Date: Tue, Jun 5, 2018 at 10:43 AM > > > > > > > > Subject: [DISCUSS] FLIP-6 Problems > > > > > > > > To: user <u...@flink.apache.org> > > > > > > > > > > > > > > > > > > > > > > > > Hi: > > > > > > > > > > > > > > > > We've deployed flink 1.5.0 and tested the new cluster > manager, > > > it's > > > > > > > really > > > > > > > > great for flink to be elastic. However we've also found some > > > > problems > > > > > > > that > > > > > > > > blocks us from deploying it to production environment. > > > > > > > > > > > > > > > > 1. Task manager isolation. Currently flink allows different > > jobs > > > to > > > > > > > execute > > > > > > > > on same task managers, this is unacceptable in production > > > > environment > > > > > > > since > > > > > > > > a faulty written job may kill task managers and affect other > > > jobs. > > > > > > > > 2. Per job resource configuration. Currently flink session > > > cluster > > > > > can > > > > > > > only > > > > > > > > allocate same size and configuration task managers. This may > > > waste > > > > a > > > > > > lot > > > > > > > of > > > > > > > > resources if we have a lot of jobs with different resource > > > > > requirement. > > > > > > > > 3. Task manager's name is meanless. This is a problem since > we > > > > can't > > > > > > > > monitor status of container in mesos environment. > > > > > > > > > > > > > > > > One solution to the above problems is to use per job cluster, > > > but a > > > > > > > > centralized cluster manager can help to manage flink > deployment > > > and > > > > > > jobs > > > > > > > > better. > > > > > > > > > > > > > > > > How you guys think about those? If the community agrees with > > us, > > > we > > > > > > would > > > > > > > > like to propose design and implementation to enhance the > flink > > > > > cluster > > > > > > > > manager. > > > > > > > > -- > > > > > > > > Liu, Renjie > > > > > > > > Software Engineer, MVAD > > > > > > > > -- > > > > > > > > Liu, Renjie > > > > > > > > Software Engineer, MVAD > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Liu, Renjie > > > > > > Software Engineer, MVAD > > > > > > > > > > > > > > > -- > > > > Liu, Renjie > > > > Software Engineer, MVAD > > > > > > > > > -- > > Liu, Renjie > > Software Engineer, MVAD > > > -- Liu, Renjie Software Engineer, MVAD