Re: [DISCUSS] FLIP-108: Add GPU support in Flink

Yangze Guo Tue, 03 Mar 2020 19:20:22 -0800

Thanks for all the feedbacks.

@Becket
Regarding the WebUI and GPUInfo, you're right, I'll add them to the
Public API section.



@Stephan @Becket
Regarding the general extended resource mechanism, I second Xintong's
suggestion.
- It's better to leverage ResourceProfile and ResourceSpec after we
supporting fine-grained GPU scheduling. As a first step proposal, I
prefer to not include it in the scope of this FLIP.
- Regarding the "Extended Resource Manager", if I understand
correctly, it just a code refactoring atm, we could extract the
open/close/allocateExtendResources of GPUManager to that interface. If
that is the case, +1 to do it during implementation.

@Xingbo
As Xintong said, we looked into how Spark supports a general "Custom
Resource Scheduling" before and decided to introduce a common resource
configuration 
schema(taskmanager.resource.{resourceName}.amount/discovery-script)
to make it more extensible. I think the "resource" is a proper level
to contain all the configs of extended resources.

Best,
Yangze Guo

On Wed, Mar 4, 2020 at 10:48 AM Xingbo Huang <hxbks...@gmail.com> wrote:
>
> Thanks a lot for the FLIP, Yangze.
>
> There is no doubt that GPU resource management support will greatly
> facilitate the development of AI-related applications by PyFlink users.
>
> I have only one comment about this wiki:
>
> Regarding the names of several GPU configurations, I think it is better to
> delete the resource field makes it consistent with the names of other
> resource-related configurations in TaskManagerOption.
>
> e.g. taskmanager.resource.gpu.discovery-script.path ->
> taskmanager.gpu.discovery-script.path
>
> Best,
>
> Xingbo
>
>
> Xintong Song <tonysong...@gmail.com> 于2020年3月4日周三 上午10:39写道：
>
> > @Stephan, @Becket,
> >
> > Actually, Yangze, Yang and I also had an offline discussion about making
> > the "GPU Support" as some general "Extended Resource Support". We believe
> > supporting extended resources in a general mechanism is definitely a good
> > and extensible way. The reason we propose this FLIP narrowing its scope
> > down to GPU alone, is mainly for the concern on extra efforts and review
> > capacity needed for a general mechanism.
> >
> > To come up with a well design on a general extended resource management
> > mechanism, we would need to investigate more on how people use different
> > kind of resources in practice. For GPU, we learnt such knowledge from the
> > experts, Becket and his team members. But for FPGA, or other potential
> > extended resources, we don't have such convenient information sources,
> > making the investigation requires more efforts, which I tend to think is
> > not necessary atm.
> >
> > On the other hand, we also looked into how Spark supports a general "Custom
> > Resource Scheduling". Assuming we want to have a similar general extended
> > resource mechanism in the future, we believe that the current GPU support
> > design can be easily extended, in an incremental way without too many
> > reworks.
> >
> >    - The most important part is probably user interfaces. Spark offers
> >    configuration options to define the amount, discovery script and vendor
> > (on
> >    k8s) in a per resource type bias [1], which is very similar to what we
> >    proposed in this FLIP. I think it's not necessary to expose config
> > options
> >    in the general way atm, since we do not have supports for other resource
> >    types now. If later we decided to have per resource type config
> > options, we
> >    can have backwards compatibility on the current proposed options with
> >    simple key mapping.
> >    - For the GPU Manager, if later needed we can change it to a "Extended
> >    Resource Manager" (or whatever it is called). That should be a pure
> >    component-internal refactoring.
> >    - For ResourceProfile and ResourceSpec, there are already fields for
> >    general extended resource. We can of course leverage them when
> > supporting
> >    fine grained GPU scheduling. That is also not in the scope of this first
> >    step proposal, and would require FLIP-56 to be finished first.
> >
> > To summary up, I agree with Becket that have a separate FLIP for the
> > general extended resource mechanism, and keep it in mind when discussing
> > and implementing the current one.
> >
> > Thank you~
> >
> > Xintong Song
> >
> >
> > [1]
> >
> > https://spark.apache.org/docs/3.0.0-preview/configuration.html#custom-resource-scheduling-and-configuration-overview
> >
> > On Wed, Mar 4, 2020 at 9:18 AM Becket Qin <becket....@gmail.com> wrote:
> >
> > > That's a good point, Stephan. It makes total sense to generalize the
> > > resource management to support custom resources. Having that allows users
> > > to add new resources by themselves. The general resource management may
> > > involve two different aspects:
> > >
> > > 1. The custom resource type definition. It is supported by the extended
> > > resources in ResourceProfile and ResourceSpec. This will likely cover
> > > majority of the cases.
> > >
> > > 2. The custom resource allocation logic, i.e. how to assign the resources
> > > to different tasks, operators, and so on. This may require two levels /
> > > steps:
> > >     a. Subtask level - make sure the subtasks are put into suitable
> > slots.
> > > It is done by the global RM and is not customizable right now.
> > >     b. Operator level - map the exact resource to the operators in TM.
> > e.g.
> > > GPU 1 for operator A, GPU 2 for operator B. This step is needed assuming
> > > the global RM does not distinguish individual resources of the same type.
> > > It is true for memory, but not for GPU.
> > >
> > > The GPU manager is designed to do 2.b here. So it should discover the
> > > physical GPU information and bind/match them to each operators. Making
> > this
> > > general will fill in the missing piece to support custom resource type
> > > definition. But I'd avoid calling it a "External Resource Manager" to
> > avoid
> > > confusion with RM, maybe something like "Operator Resource Assigner"
> > would
> > > be more accurate. So for each resource type users can have an optional
> > > "Operator Resource Assigner" in the TM. For memory, users don't need
> > this,
> > > but for other extended resources, users may need that.
> > >
> > > Personally I think a pluggable "Operator Resource Assigner" is achievable
> > > in this FLIP. But I am also OK with having that in a separate FLIP
> > because
> > > the interface between the "Operator Resource Assigner" and operator may
> > > take a while to settle down if we want to make it generic. But I think
> > our
> > > implementation should take this future work into consideration so that we
> > > don't need to break backwards compatibility once we have that.
> > >
> > > Thanks,
> > >
> > > Jiangjie (Becket) Qin
> > >
> > > On Wed, Mar 4, 2020 at 12:27 AM Stephan Ewen <se...@apache.org> wrote:
> > >
> > > > Thank you for writing this FLIP.
> > > >
> > > > I cannot really give much input into the mechanics of GPU-aware
> > > scheduling
> > > > and GPU allocation, as I have no experience with that.
> > > >
> > > > One thought I had when reading the proposal is if it makes sense to
> > look
> > > at
> > > > the "GPU Manager" as an "External Resource Manager", and GPU is one
> > such
> > > > resource.
> > > > The way I understand the ResourceProfile and ResourceSpec, that is how
> > it
> > > > is done there.
> > > > It has the advantage that it looks more extensible. Maybe there is a
> > GPU
> > > > Resource, a specialized NVIDIA GPU Resource, and FPGA Resource, a
> > Alibaba
> > > > TPU Resource, etc.
> > > >
> > > > Best,
> > > > Stephan
> > > >
> > > >
> > > > On Tue, Mar 3, 2020 at 7:57 AM Becket Qin <becket....@gmail.com>
> > wrote:
> > > >
> > > > > Thanks for the FLIP Yangze. GPU resource management support is a
> > > > must-have
> > > > > for machine learning use cases. Actually it is one of the mostly
> > asked
> > > > > question from the users who are interested in using Flink for ML.
> > > > >
> > > > > Some quick comments / questions to the wiki.
> > > > > 1. The WebUI / REST API should probably also be mentioned in the
> > public
> > > > > interface section.
> > > > > 2. Is the data structure that holds GPU info also a public API?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jiangjie (Becket) Qin
> > > > >
> > > > > On Tue, Mar 3, 2020 at 10:15 AM Xintong Song <tonysong...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Thanks for drafting the FLIP and kicking off the discussion,
> > Yangze.
> > > > > >
> > > > > > Big +1 for this feature. Supporting using of GPU in Flink is
> > > > significant,
> > > > > > especially for the ML scenarios.
> > > > > > I've reviewed the FLIP wiki doc and it looks good to me. I think
> > > it's a
> > > > > > very good first step for Flink's GPU supports.
> > > > > >
> > > > > > Thank you~
> > > > > >
> > > > > > Xintong Song
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Mon, Mar 2, 2020 at 12:06 PM Yangze Guo <karma...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > Hi everyone,
> > > > > > >
> > > > > > > We would like to start a discussion thread on "FLIP-108: Add GPU
> > > > > > > support in Flink"[1].
> > > > > > >
> > > > > > > This FLIP mainly discusses the following issues:
> > > > > > >
> > > > > > > - Enable user to configure how many GPUs in a task executor and
> > > > > > > forward such requirements to the external resource managers (for
> > > > > > > Kubernetes/Yarn/Mesos setups).
> > > > > > > - Provide information of available GPU resources to operators.
> > > > > > >
> > > > > > > Key changes proposed in the FLIP are as follows:
> > > > > > >
> > > > > > > - Forward GPU resource requirements to Yarn/Kubernetes.
> > > > > > > - Introduce GPUManager as one of the task manager services to
> > > > discover
> > > > > > > and expose GPU resource information to the context of functions.
> > > > > > > - Introduce the default script for GPU discovery, in which we
> > > provide
> > > > > > > the privilege mode to help user to achieve worker-level isolation
> > > in
> > > > > > > standalone mode.
> > > > > > >
> > > > > > > Please find more details in the FLIP wiki document [1]. Looking
> > > > forward
> > > > > > to
> > > > > > > your feedbacks.
> > > > > > >
> > > > > > > [1]
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-108%3A+Add+GPU+support+in+Flink
> > > > > > >
> > > > > > > Best,
> > > > > > > Yangze Guo
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >

Re: [DISCUSS] FLIP-108: Add GPU support in Flink

Reply via email to