Re: [DISCUSS] FLIP-108: Add GPU support in Flink

Xintong Song Thu, 26 Mar 2020 19:42:25 -0700

Thanks for updating the FLIP, Yangze.

I agree with Till that we probably want to separate the K8s/Yarn decorator
calls. Users can still configure one driver class, and we can use
`instanceof` to check whether the driver implemented K8s/Yarn specific
interfaces.


Moreover, I'm not sure about exposing entire `ContainerRequest` / `Pod`
(`AbstractKubernetesStepDecorator` directly manipulates on `Pod`) to user
codes. It gives more access to user codes than needed for defining external
resource, which might cause problems. Instead, I would suggest to have
interface like `Map<String key, String value>
getYarn/KubernetesExternalResource()` and assemble them into
`ContainerRequest` / `Pod` in Yarn/KubernetesResourceManager.

Thank you~

Xintong Song



On Fri, Mar 27, 2020 at 1:10 AM Till Rohrmann <[email protected]> wrote:

> Hi everyone,
>
> I'm a bit late to the party. I think the current proposal looks good.
>
> Concerning the ExternalResourceDriver interface defined in the FLIP [1], I
> would suggest to not include the decorator calls for Kubernetes and Yarn in
> the base interface. Instead I would suggest to segregate the deployment
> specific decorator calls into separate interfaces. That way an
> ExternalResourceDriver does not have to support all deployments from the
> very beginning. Moreover, some resources might not be supported by a
> specific deployment target and the natural way to express this would be to
> not implement the respective deployment specific interface.
>
> Moreover, having void
> addExternalResourceToRequest(AMRMClient.ContainerRequest containerRequest)
> in the ExternalResourceDriver interface would require Hadoop on Flink's
> classpath whenever the external resource driver is being used.
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-108%3A+Add+GPU+support+in+Flink
>
> Cheers,
> Till
>
> On Thu, Mar 26, 2020 at 12:45 PM Stephan Ewen <[email protected]> wrote:
>
> > Nice, thanks a lot!
> >
> > On Thu, Mar 26, 2020 at 10:21 AM Yangze Guo <[email protected]> wrote:
> >
> > > Thanks for the suggestion, @Stephan, @Becket and @Xintong.
> > >
> > > I've updated the FLIP accordingly. I do not add a
> > > ResourceInfoProvider. Instead, I introduce the ExternalResourceDriver,
> > > which takes the responsibility of all relevant operations on both RM
> > > and TM sides.
> > > After a rethink about decoupling the management of external resources
> > > from TaskExecutor, I think we could do the same thing on the
> > > ResourceManager side. We do not need to add a specific allocation
> > > logic to the ResourceManager each time we add a specific external
> > > resource.
> > > - For Yarn, we need the ExternalResourceDriver to edit the
> > > containerRequest.
> > > - For Kubenetes, ExternalResourceDriver could provide a decorator for
> > > the TM pod.
> > >
> > > In this way, just like MetricReporter, we allow users to define their
> > > custom ExternalResourceDriver. It is more extensible and fits the
> > > separation of concerns. For more details, please take a look at [1].
> > >
> > > [1]
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-108%3A+Add+GPU+support+in+Flink
> > >
> > > Best,
> > > Yangze Guo
> > >
> > > On Wed, Mar 25, 2020 at 7:32 PM Stephan Ewen <[email protected]> wrote:
> > > >
> > > > This sounds good to go ahead from my side.
> > > >
> > > > I like the approach that Becket suggested - in that case the core
> > > > abstraction that everyone would need to understand would be "external
> > > > resource allocation" and the "ResourceInfoProvider", and the GPU
> > specific
> > > > code would be a specific implementation only known to that component
> > that
> > > > allocates the external resource. That fits the separation of concerns
> > > well.
> > > >
> > > > I also understand that it should not be over-engineered in the first
> > > > version, so some simplification makes sense, and then gradually
> expand
> > > from
> > > > there.
> > > >
> > > > So +1 to go ahead with what was suggested above (Xintong / Becket)
> from
> > > my
> > > > side.
> > > >
> > > > On Mon, Mar 23, 2020 at 6:55 AM Xintong Song <[email protected]>
> > > wrote:
> > > >
> > > > > Thanks for the comments, Stephan & Becket.
> > > > >
> > > > > @Stephan
> > > > >
> > > > > I see your concern, and I completely agree with you that we should
> > > first
> > > > > think about the "library" / "plugin" / "extension" style if
> possible.
> > > > >
> > > > > If GPUs are sliced and assigned during scheduling, there may be
> > reason,
> > > > > > although it looks that it would belong to the slot then. Is that
> > > what we
> > > > > > are doing here?
> > > > >
> > > > >
> > > > > In the current proposal, we do not have the GPUs sliced and
> assigned
> > to
> > > > > slots, because it could be problematic without dynamic slot
> > allocation.
> > > > > E.g., the number of GPUs might not be evenly divisible by the
> number
> > of
> > > > > slots.
> > > > >
> > > > > I think it makes sense to eventually have the GPUs assigned to
> slots.
> > > Even
> > > > > then, we might still need a TM level GPUManager (or
> ResourceProvider
> > > like
> > > > > Becket suggested). For memory, in each slot we can simply request
> the
> > > > > amount of memory, leaving it to JVM / OS to decide which memory
> > > (address)
> > > > > should be assigned. For GPU, and potentially other resources like
> > > FPGA, we
> > > > > need to explicitly specify which GPU (index) should be used.
> > > Therefore, we
> > > > > need some component at the TM level to coordinate which slot uses
> > which
> > > > > GPU.
> > > > >
> > > > > IMO, unless we say Flink will not support slot-level GPU slicing at
> > > least
> > > > > in the foreseeable future, I don't see a good way to avoid touching
> > > the TM
> > > > > core. To that end, I think Becket's suggestion points to a good
> > > direction,
> > > > > that supports more features (GPU, FPGA, etc.) with less coupling to
> > > the TM
> > > > > core (only needs to understand the general interfaces). The
> detailed
> > > > > implementation for specific resource types can even be encapsulated
> > as
> > > a
> > > > > library.
> > > > >
> > > > > @Becket
> > > > >
> > > > > Thanks for sharing your thought on the final state. Despite the
> > > details how
> > > > > the interfaces should look like, I think this is a really good
> > > abstraction
> > > > > for supporting general resource types.
> > > > >
> > > > > I'd like to further clarify that, the following three things are
> all
> > > that
> > > > > the "Flink core" needs to understand.
> > > > >
> > > > >    - The *amount* of resource, for scheduling. Actually, we already
> > > have
> > > > >    the Resource class in ResourceProfile and ResourceSpec for
> > extended
> > > > >    resource. It's just not really used.
> > > > >    - The *info*, that Flink provides to the operators / user codes.
> > > > >    - The *provider*, which generates the info based on the amount.
> > > > >
> > > > > The "core" does not need to understand the specific implementation
> > > details
> > > > > of the above three. They can even be implemented in a 3rd-party
> > > library.
> > > > > Similar to how we allow users to define their custom
> MetricReporter.
> > > > >
> > > > > Thank you~
> > > > >
> > > > > Xintong Song
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Mar 23, 2020 at 8:45 AM Becket Qin <[email protected]>
> > > wrote:
> > > > >
> > > > > > Thanks for the comment, Stephan.
> > > > > >
> > > > > >   - If everything becomes a "core feature", it will make the
> > project
> > > hard
> > > > > > > to develop in the future. Thinking "library" / "plugin" /
> > > "extension"
> > > > > > style
> > > > > > > where possible helps.
> > > > > >
> > > > > >
> > > > > > Completely agree. It is much more important to design a mechanism
> > > than
> > > > > > focusing on a specific case. Here is what I am thinking to fully
> > > support
> > > > > > custom resource management:
> > > > > > 1. On the JM / RM side, use ResourceProfile and ResourceSpec to
> > > define
> > > > > the
> > > > > > resource and the amount required. They will be used to find
> > suitable
> > > TMs
> > > > > > slots to run the tasks. At this point, the resources are only
> > > measured by
> > > > > > amount, i.e. they do not have individual ID.
> > > > > >
> > > > > > 2. On the TM side, have something like *"ResourceInfoProvider"*
> to
> > > > > identify
> > > > > > and provides the detail information of the individual resource,
> > e.g.
> > > GPU
> > > > > > ID.. It is important because the operator may have to explicitly
> > > interact
> > > > > > with the physical resource it uses. The ResourceInfoProvider
> might
> > > look
> > > > > > like something below.
> > > > > > interface ResourceInfoProvider<INFO> {
> > > > > >     Map<AbstractID, INFO> retrieveResourceInfo(OperatorId opId,
> > > > > > ResourceProfile resourceProfile);
> > > > > > }
> > > > > >
> > > > > > - There could be several "*ResourceInfoProvider*" configured on
> the
> > > TM to
> > > > > > retrieve the information for different resources.
> > > > > > - The TM will be responsible to assign those individual resources
> > to
> > > each
> > > > > > operator according to their requested amount.
> > > > > > - The operators will be able to get the ResourceInfo from their
> > > > > > RuntimeContext.
> > > > > >
> > > > > > If we agree this is a reasonable final state. We can adapt the
> > > current
> > > > > FLIP
> > > > > > to it. In fact it does not sound a big change to me. All the
> > proposed
> > > > > > configuration can be as is, it is just that Flink itself won't
> care
> > > about
> > > > > > them, instead a GPUInfoProviver implementing the
> > ResourceInfoProvider
> > > > > will
> > > > > > use them.
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Jiangjie (Becket) Qin
> > > > > >
> > > > > > On Mon, Mar 23, 2020 at 1:47 AM Stephan Ewen <[email protected]>
> > > wrote:
> > > > > >
> > > > > > > Hi all!
> > > > > > >
> > > > > > > The main point I wanted to throw into the discussion is the
> > > following:
> > > > > > >   - With more and more use cases, more and more tools go into
> > Flink
> > > > > > >   - If everything becomes a "core feature", it will make the
> > > project
> > > > > hard
> > > > > > > to develop in the future. Thinking "library" / "plugin" /
> > > "extension"
> > > > > > style
> > > > > > > where possible helps.
> > > > > > >
> > > > > > >   - A good thought experiment is always: How many future
> > developers
> > > > > have
> > > > > > to
> > > > > > > interact with this code (and possibly understand it partially),
> > > even if
> > > > > > the
> > > > > > > features they touch have nothing to do with GPU support. If
> many
> > > > > > > contributors to unrelated features will have to touch it and
> > > understand
> > > > > > it,
> > > > > > > then let's think if there is a different solution. Maybe there
> is
> > > not,
> > > > > > but
> > > > > > > then we should be sure why.
> > > > > > >
> > > > > > >   - That led me to raising this issue: If the GPU manager
> > becomes a
> > > > > core
> > > > > > > service in the TaskManager, Environment, RuntimeContext, etc.
> > then
> > > > > > everyone
> > > > > > > developing TM and streaming tasks need to understand the GPU
> > > manager.
> > > > > > That
> > > > > > > seems oddly specific, is my impression.
> > > > > > >
> > > > > > > Access to configuration seems not the right reason to do that.
> We
> > > > > should
> > > > > > > expose the Flink configuration from the RuntimeContext anyways.
> > > > > > >
> > > > > > > If GPUs are sliced and assigned during scheduling, there may be
> > > reason,
> > > > > > > although it looks that it would belong to the slot then. Is
> that
> > > what
> > > > > we
> > > > > > > are doing here?
> > > > > > >
> > > > > > > Best,
> > > > > > > Stephan
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Mar 20, 2020 at 2:58 AM Xintong Song <
> > > [email protected]>
> > > > > > > wrote:
> > > > > > >
> > > > > > > >  Thanks for the feedback, Becket.
> > > > > > > >
> > > > > > > > IMO, eventually an operator should only see info of GPUs that
> > are
> > > > > > > dedicated
> > > > > > > > for it, instead of all GPUs on the machine/container in the
> > > current
> > > > > > > design.
> > > > > > > > It does not make sense to let the user who writes a UDF to
> > worry
> > > > > about
> > > > > > > > coordination among multiple operators running on the same
> > > machine.
> > > > > And
> > > > > > if
> > > > > > > > we want to limit the GPU info an operator sees, we should not
> > > let the
> > > > > > > > operator to instantiate GPUManager, which means we have to
> > expose
> > > > > > > something
> > > > > > > > through runtime context, either GPU info or some kind of
> > limited
> > > > > access
> > > > > > > to
> > > > > > > > the GPUManager.
> > > > > > > >
> > > > > > > > Thank you~
> > > > > > > >
> > > > > > > > Xintong Song
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Thu, Mar 19, 2020 at 5:48 PM Becket Qin <
> > [email protected]
> > > >
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > It probably make sense for us to first agree on the final
> > > state.
> > > > > More
> > > > > > > > > specifically, will the resource info be exposed through
> > runtime
> > > > > > context
> > > > > > > > > eventually?
> > > > > > > > >
> > > > > > > > > If that is the final state and we have a seamless migration
> > > story
> > > > > > from
> > > > > > > > this
> > > > > > > > > FLIP to that final state, Personally I think it is OK to
> > > expose the
> > > > > > GPU
> > > > > > > > > info in the runtime context.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > >
> > > > > > > > > Jiangjie (Becket) Qin
> > > > > > > > >
> > > > > > > > > On Mon, Mar 16, 2020 at 11:21 AM Xintong Song <
> > > > > [email protected]
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > @Yangze,
> > > > > > > > > > I think what Stephan means (@Stephan, please correct me
> if
> > > I'm
> > > > > > wrong)
> > > > > > > > is
> > > > > > > > > > that, we might not need to hold and maintain the
> GPUManager
> > > as a
> > > > > > > > service
> > > > > > > > > in
> > > > > > > > > > TaskManagerServices or RuntimeContext. An alternative is
> to
> > > > > create
> > > > > > /
> > > > > > > > > > retrieve the GPUManager only in the operators that need
> it,
> > > e.g.,
> > > > > > > with
> > > > > > > > a
> > > > > > > > > > static method `GPUManager.get()`.
> > > > > > > > > >
> > > > > > > > > > @Stephan,
> > > > > > > > > > I agree with you on excluding GPUManager from
> > > > > TaskManagerServices.
> > > > > > > > > >
> > > > > > > > > >    - For the first step, where we provide unified
> TM-level
> > > GPU
> > > > > > > > > information
> > > > > > > > > >    to all operators, it should be fine to have operators
> > > access /
> > > > > > > > > >    lazy-initiate GPUManager by themselves.
> > > > > > > > > >    - In future, we might have some more fine-grained GPU
> > > > > > management,
> > > > > > > > > where
> > > > > > > > > >    we need to maintain GPUManager as a service and put
> GPU
> > > info
> > > > > in
> > > > > > > slot
> > > > > > > > > >    profiles. But at least for now it's not necessary to
> > > introduce
> > > > > > > such
> > > > > > > > > >    complexity.
> > > > > > > > > >
> > > > > > > > > > However, I have some concerns on excluding GPUManager
> from
> > > > > > > > RuntimeContext
> > > > > > > > > > and let operators access it directly.
> > > > > > > > > >
> > > > > > > > > >    - Configurations needed for creating the GPUManager is
> > not
> > > > > > always
> > > > > > > > > >    available for operators.
> > > > > > > > > >    - If later we want to have fine-grained control over
> GPU
> > > > > (e.g.,
> > > > > > > > > >    operators in each slot can only see GPUs reserved for
> > that
> > > > > > slot),
> > > > > > > > the
> > > > > > > > > >    approach cannot be easily extended.
> > > > > > > > > >
> > > > > > > > > > I would suggest to wrap the GPUManager behind
> > RuntimeContext
> > > and
> > > > > > only
> > > > > > > > > > expose the GPUInfo to users. For now, we can declare a
> > method
> > > > > > > > > > `getGPUInfo()` in RuntimeContext, with a default
> definition
> > > that
> > > > > > > calls
> > > > > > > > > > `GPUManager.get()` to get the lazily-created GPUManager.
> If
> > > later
> > > > > > we
> > > > > > > > want
> > > > > > > > > > to create / retrieve GPUManager in a different way, we
> can
> > > simply
> > > > > > > > change
> > > > > > > > > > how `getGPUInfo` is implemented, without needing to
> change
> > > any
> > > > > > public
> > > > > > > > > > interfaces.
> > > > > > > > > >
> > > > > > > > > > Thank you~
> > > > > > > > > >
> > > > > > > > > > Xintong Song
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Sat, Mar 14, 2020 at 10:09 AM Yangze Guo <
> > > [email protected]>
> > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > @Shephan
> > > > > > > > > > > Do you mean Minicluster? Yes, it makes sense to share
> the
> > > GPU
> > > > > > > Manager
> > > > > > > > > > > in such scenario.
> > > > > > > > > > > If that's what you worry about, I'm +1 for holding
> > > > > > > > > > > GPUManager(ExternalResourceManagers) in TaskExecutor
> > > instead of
> > > > > > > > > > > TaskManagerServices.
> > > > > > > > > > >
> > > > > > > > > > > Regarding the RuntimeContext/FunctionContext, it just
> > > holds the
> > > > > > GPU
> > > > > > > > > > > info instead of the GPU Manager. AFAIK, it's the only
> > > place we
> > > > > > > could
> > > > > > > > > > > pass GPU info to the RichFunction/UserDefinedFunction.
> > > > > > > > > > >
> > > > > > > > > > > Best,
> > > > > > > > > > > Yangze Guo
> > > > > > > > > > >
> > > > > > > > > > > On Sat, Mar 14, 2020 at 4:06 AM Isaac Godfried <
> > > > > > > [email protected]
> > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > ---- On Fri, 13 Mar 2020 15:58:20 +0000
> > [email protected]
> > > > > wrote
> > > > > > > > ----
> > > > > > > > > > > >
> > > > > > > > > > > > > > Can we somehow keep this out of the TaskManager
> > > services
> > > > > > > > > > > > > I fear that we could not. IMO, the GPUManager(or
> > > > > > > > > > > > > ExternalServicesManagers in future) is conceptually
> > > one of
> > > > > > the
> > > > > > > > task
> > > > > > > > > > > > > manager services, just like MemoryManager before
> > 1.10.
> > > > > > > > > > > > > - It maintains/holds the GPU resource at TM level
> and
> > > all
> > > > > of
> > > > > > > the
> > > > > > > > > > > > > operators allocate the GPU resources from it. So,
> it
> > > should
> > > > > > be
> > > > > > > > > > > > > exclusive to a single TaskExecutor.
> > > > > > > > > > > > > - We could add a collection called
> > > ExternalResourceManagers
> > > > > > to
> > > > > > > > hold
> > > > > > > > > > > > > all managers of other external resources in the
> > future.
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Can you help me understand why this needs the
> addition
> > in
> > > > > > > > > > > TaskMagerServices
> > > > > > > > > > > > or in the RuntimeContext?
> > > > > > > > > > > > Are you worried about the case when multiple Task
> > > Executors
> > > > > run
> > > > > > > in
> > > > > > > > > the
> > > > > > > > > > > same
> > > > > > > > > > > > JVM? That's not common, but wouldn't it actually be
> > good
> > > in
> > > > > > that
> > > > > > > > case
> > > > > > > > > > to
> > > > > > > > > > > > share the GPU Manager, given that the GPU is shared?
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > > Stephan
> > > > > > > > > > > >
> > > > > > > > > > > > ---------------------------
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > What parts need information about this?
> > > > > > > > > > > > > In this FLIP, operators need the information. Thus,
> > we
> > > > > expose
> > > > > > > GPU
> > > > > > > > > > > > > information to the RuntimeContext/FunctionContext.
> > The
> > > slot
> > > > > > > > profile
> > > > > > > > > > is
> > > > > > > > > > > > > not aware of GPU resources as GPU is TM level
> > resource
> > > now.
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Can the GPU Manager be a "self contained" thing
> > that
> > > > > simply
> > > > > > > > takes
> > > > > > > > > > the
> > > > > > > > > > > > > configuration, and then abstracts everything
> > > internally?
> > > > > > > > > > > > > Yes, we just pass the path/args of the discover
> > script
> > > and
> > > > > > how
> > > > > > > > many
> > > > > > > > > > > > > GPUs per TM to it. It takes the responsibility to
> get
> > > the
> > > > > GPU
> > > > > > > > > > > > > information and expose them to the
> > > > > > > RuntimeContext/FunctionContext
> > > > > > > > > of
> > > > > > > > > > > > > Operators. Meanwhile, we'd better not allow
> operators
> > > to
> > > > > > > directly
> > > > > > > > > > > > > access GPUManager, it should get what they want
> from
> > > > > Context.
> > > > > > > We
> > > > > > > > > > could
> > > > > > > > > > > > > then decouple the interface/implementation of
> > > GPUManager
> > > > > and
> > > > > > > > Public
> > > > > > > > > > > > > API.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Best,
> > > > > > > > > > > > > Yangze Guo
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Fri, Mar 13, 2020 at 7:26 PM Stephan Ewen <
> > > > > > [email protected]
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > It sounds fine to initially start with GPU
> specific
> > > > > support
> > > > > > > and
> > > > > > > > > > think
> > > > > > > > > > > > > about
> > > > > > > > > > > > > > generalizing this once we better understand the
> > > space.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > About the implementation suggested in FLIP-108:
> > > > > > > > > > > > > > - Can we somehow keep this out of the TaskManager
> > > > > services?
> > > > > > > > > > Anything
> > > > > > > > > > > we
> > > > > > > > > > > > > > have to pull through all layers of the TM makes
> the
> > > TM
> > > > > > > > components
> > > > > > > > > > yet
> > > > > > > > > > > > > more
> > > > > > > > > > > > > > complex and harder to maintain.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > - What parts need information about this?
> > > > > > > > > > > > > > -> do the slot profiles need information about
> the
> > > GPU?
> > > > > > > > > > > > > > -> Can the GPU Manager be a "self contained"
> thing
> > > that
> > > > > > > simply
> > > > > > > > > > takes
> > > > > > > > > > > > > > the configuration, and then abstracts everything
> > > > > > internally?
> > > > > > > > > > > Operators
> > > > > > > > > > > > > can
> > > > > > > > > > > > > > access it via "GPUManager.get()" or so?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Wed, Mar 4, 2020 at 4:19 AM Yangze Guo <
> > > > > > > [email protected]>
> > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks for all the feedbacks.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > @Becket
> > > > > > > > > > > > > > > Regarding the WebUI and GPUInfo, you're right,
> > > I'll add
> > > > > > > them
> > > > > > > > to
> > > > > > > > > > the
> > > > > > > > > > > > > > > Public API section.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > @Stephan @Becket
> > > > > > > > > > > > > > > Regarding the general extended resource
> > mechanism,
> > > I
> > > > > > second
> > > > > > > > > > > Xintong's
> > > > > > > > > > > > > > > suggestion.
> > > > > > > > > > > > > > > - It's better to leverage ResourceProfile and
> > > > > > ResourceSpec
> > > > > > > > > after
> > > > > > > > > > we
> > > > > > > > > > > > > > > supporting fine-grained GPU scheduling. As a
> > first
> > > step
> > > > > > > > > > proposal, I
> > > > > > > > > > > > > > > prefer to not include it in the scope of this
> > FLIP.
> > > > > > > > > > > > > > > - Regarding the "Extended Resource Manager",
> if I
> > > > > > > understand
> > > > > > > > > > > > > > > correctly, it just a code refactoring atm, we
> > could
> > > > > > extract
> > > > > > > > the
> > > > > > > > > > > > > > > open/close/allocateExtendResources of
> GPUManager
> > to
> > > > > that
> > > > > > > > > > > interface. If
> > > > > > > > > > > > > > > that is the case, +1 to do it during
> > > implementation.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > @Xingbo
> > > > > > > > > > > > > > > As Xintong said, we looked into how Spark
> > supports
> > > a
> > > > > > > general
> > > > > > > > > > > "Custom
> > > > > > > > > > > > > > > Resource Scheduling" before and decided to
> > > introduce a
> > > > > > > common
> > > > > > > > > > > resource
> > > > > > > > > > > > > > > configuration
> > > > > > > > > > > > > > >
> > > > > > > > > >
> > > > > schema(taskmanager.resource.{resourceName}.amount/discovery-script)
> > > > > > > > > > > > > > > to make it more extensible. I think the
> > "resource"
> > > is a
> > > > > > > > proper
> > > > > > > > > > > level
> > > > > > > > > > > > > > > to contain all the configs of extended
> resources.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Best,
> > > > > > > > > > > > > > > Yangze Guo
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Wed, Mar 4, 2020 at 10:48 AM Xingbo Huang <
> > > > > > > > > [email protected]
> > > > > > > > > > >
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks a lot for the FLIP, Yangze.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > There is no doubt that GPU resource
> management
> > > > > support
> > > > > > > will
> > > > > > > > > > > greatly
> > > > > > > > > > > > > > > > facilitate the development of AI-related
> > > applications
> > > > > > by
> > > > > > > > > > PyFlink
> > > > > > > > > > > > > users.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I have only one comment about this wiki:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Regarding the names of several GPU
> > > configurations, I
> > > > > > > think
> > > > > > > > it
> > > > > > > > > > is
> > > > > > > > > > > > > better
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > delete the resource field makes it consistent
> > > with
> > > > > the
> > > > > > > > names
> > > > > > > > > of
> > > > > > > > > > > other
> > > > > > > > > > > > > > > > resource-related configurations in
> > > TaskManagerOption.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > e.g.
> > > taskmanager.resource.gpu.discovery-script.path
> > > > > ->
> > > > > > > > > > > > > > > > taskmanager.gpu.discovery-script.path
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Best,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Xingbo
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Xintong Song <[email protected]>
> > > 于2020年3月4日周三
> > > > > > > > 上午10:39写道：
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > @Stephan, @Becket,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Actually, Yangze, Yang and I also had an
> > > offline
> > > > > > > > discussion
> > > > > > > > > > > about
> > > > > > > > > > > > > > > making
> > > > > > > > > > > > > > > > > the "GPU Support" as some general "Extended
> > > > > Resource
> > > > > > > > > > Support".
> > > > > > > > > > > We
> > > > > > > > > > > > > > > believe
> > > > > > > > > > > > > > > > > supporting extended resources in a general
> > > > > mechanism
> > > > > > is
> > > > > > > > > > > definitely
> > > > > > > > > > > > > a
> > > > > > > > > > > > > > > good
> > > > > > > > > > > > > > > > > and extensible way. The reason we propose
> > this
> > > FLIP
> > > > > > > > > narrowing
> > > > > > > > > > > its
> > > > > > > > > > > > > scope
> > > > > > > > > > > > > > > > > down to GPU alone, is mainly for the
> concern
> > on
> > > > > extra
> > > > > > > > > efforts
> > > > > > > > > > > and
> > > > > > > > > > > > > > > review
> > > > > > > > > > > > > > > > > capacity needed for a general mechanism.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > To come up with a well design on a general
> > > extended
> > > > > > > > > resource
> > > > > > > > > > > > > management
> > > > > > > > > > > > > > > > > mechanism, we would need to investigate
> more
> > > on how
> > > > > > > > people
> > > > > > > > > > use
> > > > > > > > > > > > > > > different
> > > > > > > > > > > > > > > > > kind of resources in practice. For GPU, we
> > > learnt
> > > > > > such
> > > > > > > > > > > knowledge
> > > > > > > > > > > > > from
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > experts, Becket and his team members. But
> for
> > > FPGA,
> > > > > > or
> > > > > > > > > other
> > > > > > > > > > > > > potential
> > > > > > > > > > > > > > > > > extended resources, we don't have such
> > > convenient
> > > > > > > > > information
> > > > > > > > > > > > > sources,
> > > > > > > > > > > > > > > > > making the investigation requires more
> > efforts,
> > > > > > which I
> > > > > > > > > tend
> > > > > > > > > > to
> > > > > > > > > > > > > think
> > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > not necessary atm.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On the other hand, we also looked into how
> > > Spark
> > > > > > > > supports a
> > > > > > > > > > > general
> > > > > > > > > > > > > > > "Custom
> > > > > > > > > > > > > > > > > Resource Scheduling". Assuming we want to
> > have
> > > a
> > > > > > > similar
> > > > > > > > > > > general
> > > > > > > > > > > > > > > extended
> > > > > > > > > > > > > > > > > resource mechanism in the future, we
> believe
> > > that
> > > > > the
> > > > > > > > > current
> > > > > > > > > > > GPU
> > > > > > > > > > > > > > > support
> > > > > > > > > > > > > > > > > design can be easily extended, in an
> > > incremental
> > > > > way
> > > > > > > > > without
> > > > > > > > > > > too
> > > > > > > > > > > > > many
> > > > > > > > > > > > > > > > > reworks.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > - The most important part is probably user
> > > > > > interfaces.
> > > > > > > > > Spark
> > > > > > > > > > > > > offers
> > > > > > > > > > > > > > > > > configuration options to define the amount,
> > > > > discovery
> > > > > > > > > script
> > > > > > > > > > > and
> > > > > > > > > > > > > > > vendor
> > > > > > > > > > > > > > > > > (on
> > > > > > > > > > > > > > > > > k8s) in a per resource type bias [1], which
> > is
> > > very
> > > > > > > > similar
> > > > > > > > > > to
> > > > > > > > > > > > > what
> > > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > > proposed in this FLIP. I think it's not
> > > necessary
> > > > > to
> > > > > > > > expose
> > > > > > > > > > > > > config
> > > > > > > > > > > > > > > > > options
> > > > > > > > > > > > > > > > > in the general way atm, since we do not
> have
> > > > > supports
> > > > > > > for
> > > > > > > > > > other
> > > > > > > > > > > > > > > resource
> > > > > > > > > > > > > > > > > types now. If later we decided to have per
> > > resource
> > > > > > > type
> > > > > > > > > > config
> > > > > > > > > > > > > > > > > options, we
> > > > > > > > > > > > > > > > > can have backwards compatibility on the
> > current
> > > > > > > proposed
> > > > > > > > > > > options
> > > > > > > > > > > > > > > with
> > > > > > > > > > > > > > > > > simple key mapping.
> > > > > > > > > > > > > > > > > - For the GPU Manager, if later needed we
> can
> > > > > change
> > > > > > it
> > > > > > > > to
> > > > > > > > > a
> > > > > > > > > > > > > > > "Extended
> > > > > > > > > > > > > > > > > Resource Manager" (or whatever it is
> called).
> > > That
> > > > > > > should
> > > > > > > > > be
> > > > > > > > > > a
> > > > > > > > > > > > > pure
> > > > > > > > > > > > > > > > > component-internal refactoring.
> > > > > > > > > > > > > > > > > - For ResourceProfile and ResourceSpec,
> there
> > > are
> > > > > > > already
> > > > > > > > > > > > > fields for
> > > > > > > > > > > > > > > > > general extended resource. We can of course
> > > > > leverage
> > > > > > > them
> > > > > > > > > > when
> > > > > > > > > > > > > > > > > supporting
> > > > > > > > > > > > > > > > > fine grained GPU scheduling. That is also
> not
> > > in
> > > > > the
> > > > > > > > scope
> > > > > > > > > of
> > > > > > > > > > > > > this
> > > > > > > > > > > > > > > first
> > > > > > > > > > > > > > > > > step proposal, and would require FLIP-56 to
> > be
> > > > > > finished
> > > > > > > > > > first.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > To summary up, I agree with Becket that
> have
> > a
> > > > > > separate
> > > > > > > > > FLIP
> > > > > > > > > > > for
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > general extended resource mechanism, and
> keep
> > > it in
> > > > > > > mind
> > > > > > > > > when
> > > > > > > > > > > > > > > discussing
> > > > > > > > > > > > > > > > > and implementing the current one.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thank you~
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Xintong Song
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > [1]
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
> >
> https://spark.apache.org/docs/3.0.0-preview/configuration.html#custom-resource-scheduling-and-configuration-overview
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Wed, Mar 4, 2020 at 9:18 AM Becket Qin <
> > > > > > > > > > > [email protected]>
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > That's a good point, Stephan. It makes
> > total
> > > > > sense
> > > > > > to
> > > > > > > > > > > generalize
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > resource management to support custom
> > > resources.
> > > > > > > Having
> > > > > > > > > > that
> > > > > > > > > > > > > allows
> > > > > > > > > > > > > > > users
> > > > > > > > > > > > > > > > > > to add new resources by themselves. The
> > > general
> > > > > > > > resource
> > > > > > > > > > > > > management
> > > > > > > > > > > > > > > may
> > > > > > > > > > > > > > > > > > involve two different aspects:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > 1. The custom resource type definition.
> It
> > is
> > > > > > > supported
> > > > > > > > > by
> > > > > > > > > > > the
> > > > > > > > > > > > > > > extended
> > > > > > > > > > > > > > > > > > resources in ResourceProfile and
> > > ResourceSpec.
> > > > > This
> > > > > > > > will
> > > > > > > > > > > likely
> > > > > > > > > > > > > cover
> > > > > > > > > > > > > > > > > > majority of the cases.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > 2. The custom resource allocation logic,
> > > i.e. how
> > > > > > to
> > > > > > > > > assign
> > > > > > > > > > > the
> > > > > > > > > > > > > > > resources
> > > > > > > > > > > > > > > > > > to different tasks, operators, and so on.
> > > This
> > > > > may
> > > > > > > > > require
> > > > > > > > > > > two
> > > > > > > > > > > > > > > levels /
> > > > > > > > > > > > > > > > > > steps:
> > > > > > > > > > > > > > > > > > a. Subtask level - make sure the subtasks
> > > are put
> > > > > > > into
> > > > > > > > > > > > > suitable
> > > > > > > > > > > > > > > > > slots.
> > > > > > > > > > > > > > > > > > It is done by the global RM and is not
> > > > > customizable
> > > > > > > > right
> > > > > > > > > > > now.
> > > > > > > > > > > > > > > > > > b. Operator level - map the exact
> resource
> > > to the
> > > > > > > > > operators
> > > > > > > > > > > > > in
> > > > > > > > > > > > > > > TM.
> > > > > > > > > > > > > > > > > e.g.
> > > > > > > > > > > > > > > > > > GPU 1 for operator A, GPU 2 for operator
> B.
> > > This
> > > > > > step
> > > > > > > > is
> > > > > > > > > > > needed
> > > > > > > > > > > > > > > assuming
> > > > > > > > > > > > > > > > > > the global RM does not distinguish
> > individual
> > > > > > > resources
> > > > > > > > > of
> > > > > > > > > > > the
> > > > > > > > > > > > > same
> > > > > > > > > > > > > > > type.
> > > > > > > > > > > > > > > > > > It is true for memory, but not for GPU.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > The GPU manager is designed to do 2.b
> here.
> > > So it
> > > > > > > > should
> > > > > > > > > > > > > discover the
> > > > > > > > > > > > > > > > > > physical GPU information and bind/match
> > them
> > > to
> > > > > > each
> > > > > > > > > > > operators.
> > > > > > > > > > > > > > > Making
> > > > > > > > > > > > > > > > > this
> > > > > > > > > > > > > > > > > > general will fill in the missing piece to
> > > support
> > > > > > > > custom
> > > > > > > > > > > resource
> > > > > > > > > > > > > > > type
> > > > > > > > > > > > > > > > > > definition. But I'd avoid calling it a
> > > "External
> > > > > > > > Resource
> > > > > > > > > > > > > Manager" to
> > > > > > > > > > > > > > > > > avoid
> > > > > > > > > > > > > > > > > > confusion with RM, maybe something like
> > > "Operator
> > > > > > > > > Resource
> > > > > > > > > > > > > Assigner"
> > > > > > > > > > > > > > > > > would
> > > > > > > > > > > > > > > > > > be more accurate. So for each resource
> type
> > > users
> > > > > > can
> > > > > > > > > have
> > > > > > > > > > an
> > > > > > > > > > > > > > > optional
> > > > > > > > > > > > > > > > > > "Operator Resource Assigner" in the TM.
> For
> > > > > memory,
> > > > > > > > users
> > > > > > > > > > > don't
> > > > > > > > > > > > > need
> > > > > > > > > > > > > > > > > this,
> > > > > > > > > > > > > > > > > > but for other extended resources, users
> may
> > > need
> > > > > > > that.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Personally I think a pluggable "Operator
> > > Resource
> > > > > > > > > Assigner"
> > > > > > > > > > > is
> > > > > > > > > > > > > > > achievable
> > > > > > > > > > > > > > > > > > in this FLIP. But I am also OK with
> having
> > > that
> > > > > in
> > > > > > a
> > > > > > > > > > separate
> > > > > > > > > > > > > FLIP
> > > > > > > > > > > > > > > > > because
> > > > > > > > > > > > > > > > > > the interface between the "Operator
> > Resource
> > > > > > > Assigner"
> > > > > > > > > and
> > > > > > > > > > > > > operator
> > > > > > > > > > > > > > > may
> > > > > > > > > > > > > > > > > > take a while to settle down if we want to
> > > make it
> > > > > > > > > generic.
> > > > > > > > > > > But I
> > > > > > > > > > > > > > > think
> > > > > > > > > > > > > > > > > our
> > > > > > > > > > > > > > > > > > implementation should take this future
> work
> > > into
> > > > > > > > > > > consideration so
> > > > > > > > > > > > > > > that we
> > > > > > > > > > > > > > > > > > don't need to break backwards
> compatibility
> > > once
> > > > > we
> > > > > > > > have
> > > > > > > > > > > that.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Jiangjie (Becket) Qin
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > On Wed, Mar 4, 2020 at 12:27 AM Stephan
> > Ewen
> > > <
> > > > > > > > > > > [email protected]>
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Thank you for writing this FLIP.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > I cannot really give much input into
> the
> > > > > > mechanics
> > > > > > > of
> > > > > > > > > > > GPU-aware
> > > > > > > > > > > > > > > > > > scheduling
> > > > > > > > > > > > > > > > > > > and GPU allocation, as I have no
> > experience
> > > > > with
> > > > > > > > that.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > One thought I had when reading the
> > > proposal is
> > > > > if
> > > > > > > it
> > > > > > > > > > makes
> > > > > > > > > > > > > sense to
> > > > > > > > > > > > > > > > > look
> > > > > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > > > > > the "GPU Manager" as an "External
> > Resource
> > > > > > > Manager",
> > > > > > > > > and
> > > > > > > > > > > GPU
> > > > > > > > > > > > > is one
> > > > > > > > > > > > > > > > > such
> > > > > > > > > > > > > > > > > > > resource.
> > > > > > > > > > > > > > > > > > > The way I understand the
> ResourceProfile
> > > and
> > > > > > > > > > ResourceSpec,
> > > > > > > > > > > > > that is
> > > > > > > > > > > > > > > how
> > > > > > > > > > > > > > > > > it
> > > > > > > > > > > > > > > > > > > is done there.
> > > > > > > > > > > > > > > > > > > It has the advantage that it looks more
> > > > > > extensible.
> > > > > > > > > Maybe
> > > > > > > > > > > > > there is
> > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > > GPU
> > > > > > > > > > > > > > > > > > > Resource, a specialized NVIDIA GPU
> > > Resource,
> > > > > and
> > > > > > > FPGA
> > > > > > > > > > > > > Resource, a
> > > > > > > > > > > > > > > > > Alibaba
> > > > > > > > > > > > > > > > > > > TPU Resource, etc.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Best,
> > > > > > > > > > > > > > > > > > > Stephan
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > On Tue, Mar 3, 2020 at 7:57 AM Becket
> > Qin <
> > > > > > > > > > > > > [email protected]>
> > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Thanks for the FLIP Yangze. GPU
> > resource
> > > > > > > management
> > > > > > > > > > > support
> > > > > > > > > > > > > is a
> > > > > > > > > > > > > > > > > > > must-have
> > > > > > > > > > > > > > > > > > > > for machine learning use cases.
> > Actually
> > > it
> > > > > is
> > > > > > > one
> > > > > > > > of
> > > > > > > > > > the
> > > > > > > > > > > > > mostly
> > > > > > > > > > > > > > > > > asked
> > > > > > > > > > > > > > > > > > > > question from the users who are
> > > interested in
> > > > > > > using
> > > > > > > > > > Flink
> > > > > > > > > > > > > for ML.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Some quick comments / questions to
> the
> > > wiki.
> > > > > > > > > > > > > > > > > > > > 1. The WebUI / REST API should
> probably
> > > also
> > > > > be
> > > > > > > > > > > mentioned in
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > public
> > > > > > > > > > > > > > > > > > > > interface section.
> > > > > > > > > > > > > > > > > > > > 2. Is the data structure that holds
> GPU
> > > info
> > > > > > > also a
> > > > > > > > > > > public
> > > > > > > > > > > > > API?
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Jiangjie (Becket) Qin
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > On Tue, Mar 3, 2020 at 10:15 AM
> Xintong
> > > Song
> > > > > <
> > > > > > > > > > > > > > > [email protected]>
> > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Thanks for drafting the FLIP and
> > > kicking
> > > > > off
> > > > > > > the
> > > > > > > > > > > > > discussion,
> > > > > > > > > > > > > > > > > Yangze.
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Big +1 for this feature. Supporting
> > > using
> > > > > of
> > > > > > > GPU
> > > > > > > > in
> > > > > > > > > > > Flink
> > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > > > significant,
> > > > > > > > > > > > > > > > > > > > > especially for the ML scenarios.
> > > > > > > > > > > > > > > > > > > > > I've reviewed the FLIP wiki doc and
> > it
> > > > > looks
> > > > > > > good
> > > > > > > > > to
> > > > > > > > > > > me. I
> > > > > > > > > > > > > > > think
> > > > > > > > > > > > > > > > > > it's a
> > > > > > > > > > > > > > > > > > > > > very good first step for Flink's
> GPU
> > > > > > supports.
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Thank you~
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Xintong Song
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > On Mon, Mar 2, 2020 at 12:06 PM
> > Yangze
> > > Guo
> > > > > <
> > > > > > > > > > > > > [email protected]
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > Hi everyone,
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > We would like to start a
> discussion
> > > > > thread
> > > > > > on
> > > > > > > > > > > "FLIP-108:
> > > > > > > > > > > > > Add
> > > > > > > > > > > > > > > GPU
> > > > > > > > > > > > > > > > > > > > > > support in Flink"[1].
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > This FLIP mainly discusses the
> > > following
> > > > > > > > issues:
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > - Enable user to configure how
> many
> > > GPUs
> > > > > > in a
> > > > > > > > > task
> > > > > > > > > > > > > executor
> > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > > > > > forward such requirements to the
> > > external
> > > > > > > > > resource
> > > > > > > > > > > > > managers
> > > > > > > > > > > > > > > (for
> > > > > > > > > > > > > > > > > > > > > > Kubernetes/Yarn/Mesos setups).
> > > > > > > > > > > > > > > > > > > > > > - Provide information of
> available
> > > GPU
> > > > > > > > resources
> > > > > > > > > to
> > > > > > > > > > > > > > > operators.
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > Key changes proposed in the FLIP
> > are
> > > as
> > > > > > > > follows:
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > - Forward GPU resource
> requirements
> > > to
> > > > > > > > > > > Yarn/Kubernetes.
> > > > > > > > > > > > > > > > > > > > > > - Introduce GPUManager as one of
> > the
> > > task
> > > > > > > > manager
> > > > > > > > > > > > > services to
> > > > > > > > > > > > > > > > > > > discover
> > > > > > > > > > > > > > > > > > > > > > and expose GPU resource
> information
> > > to
> > > > > the
> > > > > > > > > context
> > > > > > > > > > of
> > > > > > > > > > > > > > > functions.
> > > > > > > > > > > > > > > > > > > > > > - Introduce the default script
> for
> > > GPU
> > > > > > > > discovery,
> > > > > > > > > > in
> > > > > > > > > > > > > which we
> > > > > > > > > > > > > > > > > > provide
> > > > > > > > > > > > > > > > > > > > > > the privilege mode to help user
> to
> > > > > achieve
> > > > > > > > > > > worker-level
> > > > > > > > > > > > > > > isolation
> > > > > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > > > > > > standalone mode.
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > Please find more details in the
> > FLIP
> > > wiki
> > > > > > > > > document
> > > > > > > > > > > [1].
> > > > > > > > > > > > > > > Looking
> > > > > > > > > > > > > > > > > > > forward
> > > > > > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > > > > > your feedbacks.
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > [1]
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-108%3A+Add+GPU+support+in+Flink
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > Best,
> > > > > > > > > > > > > > > > > > > > > > Yangze Guo
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
> > >
> >
>

Re: [DISCUSS] FLIP-108: Add GPU support in Flink

Reply via email to