Thank you for writing this FLIP.

I cannot really give much input into the mechanics of GPU-aware scheduling
and GPU allocation, as I have no experience with that.

One thought I had when reading the proposal is if it makes sense to look at
the "GPU Manager" as an "External Resource Manager", and GPU is one such
resource.
The way I understand the ResourceProfile and ResourceSpec, that is how it
is done there.
It has the advantage that it looks more extensible. Maybe there is a GPU
Resource, a specialized NVIDIA GPU Resource, and FPGA Resource, a Alibaba
TPU Resource, etc.

Best,
Stephan


On Tue, Mar 3, 2020 at 7:57 AM Becket Qin <becket....@gmail.com> wrote:

> Thanks for the FLIP Yangze. GPU resource management support is a must-have
> for machine learning use cases. Actually it is one of the mostly asked
> question from the users who are interested in using Flink for ML.
>
> Some quick comments / questions to the wiki.
> 1. The WebUI / REST API should probably also be mentioned in the public
> interface section.
> 2. Is the data structure that holds GPU info also a public API?
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Tue, Mar 3, 2020 at 10:15 AM Xintong Song <tonysong...@gmail.com>
> wrote:
>
> > Thanks for drafting the FLIP and kicking off the discussion, Yangze.
> >
> > Big +1 for this feature. Supporting using of GPU in Flink is significant,
> > especially for the ML scenarios.
> > I've reviewed the FLIP wiki doc and it looks good to me. I think it's a
> > very good first step for Flink's GPU supports.
> >
> > Thank you~
> >
> > Xintong Song
> >
> >
> >
> > On Mon, Mar 2, 2020 at 12:06 PM Yangze Guo <karma...@gmail.com> wrote:
> >
> > > Hi everyone,
> > >
> > > We would like to start a discussion thread on "FLIP-108: Add GPU
> > > support in Flink"[1].
> > >
> > > This FLIP mainly discusses the following issues:
> > >
> > > - Enable user to configure how many GPUs in a task executor and
> > > forward such requirements to the external resource managers (for
> > > Kubernetes/Yarn/Mesos setups).
> > > - Provide information of available GPU resources to operators.
> > >
> > > Key changes proposed in the FLIP are as follows:
> > >
> > > - Forward GPU resource requirements to Yarn/Kubernetes.
> > > - Introduce GPUManager as one of the task manager services to discover
> > > and expose GPU resource information to the context of functions.
> > > - Introduce the default script for GPU discovery, in which we provide
> > > the privilege mode to help user to achieve worker-level isolation in
> > > standalone mode.
> > >
> > > Please find more details in the FLIP wiki document [1]. Looking forward
> > to
> > > your feedbacks.
> > >
> > > [1]
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-108%3A+Add+GPU+support+in+Flink
> > >
> > > Best,
> > > Yangze Guo
> > >
> >
>

Reply via email to