Thank you for writing this FLIP. I cannot really give much input into the mechanics of GPU-aware scheduling and GPU allocation, as I have no experience with that.
One thought I had when reading the proposal is if it makes sense to look at the "GPU Manager" as an "External Resource Manager", and GPU is one such resource. The way I understand the ResourceProfile and ResourceSpec, that is how it is done there. It has the advantage that it looks more extensible. Maybe there is a GPU Resource, a specialized NVIDIA GPU Resource, and FPGA Resource, a Alibaba TPU Resource, etc. Best, Stephan On Tue, Mar 3, 2020 at 7:57 AM Becket Qin <becket....@gmail.com> wrote: > Thanks for the FLIP Yangze. GPU resource management support is a must-have > for machine learning use cases. Actually it is one of the mostly asked > question from the users who are interested in using Flink for ML. > > Some quick comments / questions to the wiki. > 1. The WebUI / REST API should probably also be mentioned in the public > interface section. > 2. Is the data structure that holds GPU info also a public API? > > Thanks, > > Jiangjie (Becket) Qin > > On Tue, Mar 3, 2020 at 10:15 AM Xintong Song <tonysong...@gmail.com> > wrote: > > > Thanks for drafting the FLIP and kicking off the discussion, Yangze. > > > > Big +1 for this feature. Supporting using of GPU in Flink is significant, > > especially for the ML scenarios. > > I've reviewed the FLIP wiki doc and it looks good to me. I think it's a > > very good first step for Flink's GPU supports. > > > > Thank you~ > > > > Xintong Song > > > > > > > > On Mon, Mar 2, 2020 at 12:06 PM Yangze Guo <karma...@gmail.com> wrote: > > > > > Hi everyone, > > > > > > We would like to start a discussion thread on "FLIP-108: Add GPU > > > support in Flink"[1]. > > > > > > This FLIP mainly discusses the following issues: > > > > > > - Enable user to configure how many GPUs in a task executor and > > > forward such requirements to the external resource managers (for > > > Kubernetes/Yarn/Mesos setups). > > > - Provide information of available GPU resources to operators. > > > > > > Key changes proposed in the FLIP are as follows: > > > > > > - Forward GPU resource requirements to Yarn/Kubernetes. > > > - Introduce GPUManager as one of the task manager services to discover > > > and expose GPU resource information to the context of functions. > > > - Introduce the default script for GPU discovery, in which we provide > > > the privilege mode to help user to achieve worker-level isolation in > > > standalone mode. > > > > > > Please find more details in the FLIP wiki document [1]. Looking forward > > to > > > your feedbacks. > > > > > > [1] > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-108%3A+Add+GPU+support+in+Flink > > > > > > Best, > > > Yangze Guo > > > > > >