Thank you all for your participation! I'll start voting for this FLIP. Best, Yangze Guo
On Wed, Apr 1, 2020 at 4:55 PM Stephan Ewen <se...@apache.org> wrote: > > Sounds good! > > On Tue, Mar 31, 2020 at 4:32 AM Yangze Guo <karma...@gmail.com> wrote: > > > Hi everyone, > > I've updated the FLIP accordingly. The key change is replacing two > > resource allocation interfaces to config options. > > > > If there are no further comments, I would like to start a voting > > thread by tomorrow. > > > > Best, > > Yangze Guo > > > > On Mon, Mar 30, 2020 at 9:15 PM Till Rohrmann <trohrm...@apache.org> > > wrote: > > > > > > If there is no need for the ExternalResourceDriver on the RM side, then > > it > > > is always a good idea to keep it simple and don't introduce it. One can > > > always change things once one realizes that there is a need for it. > > > > > > Cheers, > > > Till > > > > > > On Mon, Mar 30, 2020 at 12:00 PM Yangze Guo <karma...@gmail.com> wrote: > > > > > > > Hi @Till, @Xintong > > > > > > > > I think even without the credential concerns, replacing the interfaces > > > > with configuration options is a good idea from my side. > > > > - Currently, I don't see any external resource does not compatible > > > > with this mechanism > > > > - It reduces the burden of users to implement a plugin themselves. > > > > WDYT? > > > > > > > > Best, > > > > Yangze Guo > > > > > > > > On Mon, Mar 30, 2020 at 5:44 PM Xintong Song <tonysong...@gmail.com> > > > > wrote: > > > > > > > > > > I also agree that the pluggable ExternalResourceDriver should be > > loaded > > > > by > > > > > the cluster class loader. Despite the plugin might be implemented by > > > > users, > > > > > external resources (as part of task executor resources) should be > > cluster > > > > > configurations, unlike job-level user codes such as UDFs, because the > > > > task > > > > > executors belongs to the cluster rather than jobs. > > > > > > > > > > > > > > > IIUC, the concern Stephan raised is about the potential credential > > > > problem > > > > > when executing user codes on RM with cluster class loader. The > > concern > > > > > makes sense to me, and I think what Yangze suggested should be a good > > > > > approach trying to prevent such credential problems. The only > > purpose we > > > > > tried to execute user codes (i.e. > > getKubernetes/YarnExternalResource) on > > > > RM > > > > > was that, we need to set these key-value pairs to pod/container > > requests. > > > > > Replacing the interfaces getKubernetes/YarnExternalResource with > > > > > configuration options > > > > > 'external-resource.{resourceName}.yarn/kubernetes.key/amount', > > > > > we can still fulfill that purpose, without the credential risks. > > > > > > > > > > > > > > > Thank you~ > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > > > > > On Mon, Mar 30, 2020 at 5:17 PM Till Rohrmann <trohrm...@apache.org> > > > > wrote: > > > > > > > > > > > At the moment the RM does not have a user code class loader and I > > agree > > > > > > with Stephan that it should stay like this. This, however, does not > > > > mean > > > > > > that we cannot support pluggable components in the RM. As long as > > the > > > > > > plugins are on the system's class path, it should be fine for the > > RM to > > > > > > load them. For example, we could add external resources via Flink's > > > > plugin > > > > > > mechanism or something similar. > > > > > > > > > > > > A very simple implementation of such an ExternalResourceDriver > > could > > > > be a > > > > > > class which simply returns what is written in the flink-conf.yaml > > > > under a > > > > > > given key. > > > > > > > > > > > > Cheers, > > > > > > Till > > > > > > > > > > > > On Mon, Mar 30, 2020 at 5:39 AM Yangze Guo <karma...@gmail.com> > > wrote: > > > > > > > > > > > > > Hi, Stephan, > > > > > > > > > > > > > > I see your concern and I totally agree with you. > > > > > > > > > > > > > > The interface on RM side is now `Map<String key, String/Long > > value> > > > > > > > getYarn/KubernetesExternalResource()`. The only valid > > information RM > > > > > > > get from it is the configuration key of that external resource in > > > > > > > Yarn/K8s. The "String/Long value" would be the same as the > > > > > > > external-resource.{resourceName}.amount. > > > > > > > So, I think it makes sense to replace these two interfaces with > > two > > > > > > > configs, i.e. > > external-resource.{resourceName}.yarn/kubernetes.key. > > > > We > > > > > > > may lose some extensibility, but AFAIK it could work with common > > > > > > > external resources like GPU, FPGA. WDYT? > > > > > > > > > > > > > > Best, > > > > > > > Yangze Guo > > > > > > > > > > > > > > On Fri, Mar 27, 2020 at 7:59 PM Stephan Ewen <se...@apache.org> > > > > wrote: > > > > > > > > > > > > > > > > Maybe one final comment: It is probably not an issue, but let's > > > > try and > > > > > > > > keep user code (via user code classloader) out of the > > > > ResourceManager, > > > > > > if > > > > > > > > possible. > > > > > > > > > > > > > > > > As background: > > > > > > > > > > > > > > > > There were thoughts in the past to support setups where the RM > > > > must run > > > > > > > > with "superuser" credentials, but we cannot run JM/TM with > > these > > > > > > > > credentials, as the user code might access them otherwise. > > > > > > > > This is actually possible today, you can run the RM in a > > different > > > > JVM > > > > > > or > > > > > > > > in a different container, and give it more credentials than > > JMs / > > > > TMs. > > > > > > > But > > > > > > > > for this to be feasible, we cannot allow any user-defined code > > to > > > > be in > > > > > > > the > > > > > > > > JVM, because that instantaneously breaks the isolation of > > > > credentials. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Mar 27, 2020 at 4:01 AM Yangze Guo <karma...@gmail.com > > > > > > > wrote: > > > > > > > > > > > > > > > > > Thanks for the feedback, @Till and @Xintong. > > > > > > > > > > > > > > > > > > Regarding separating the interface, I'm also +1 with it. > > > > > > > > > > > > > > > > > > Regarding the resource allocation interface, true, it's > > > > dangerous to > > > > > > > > > give much access to user codes. Changing the return type to > > > > > > Map<String > > > > > > > > > key, String/Long value> makes sense to me. AFAIK, it is > > > > compatible > > > > > > > > > with all the first-party supported resources for > > > > Yarn/Kubernetes. It > > > > > > > > > could also free us from the potential dependency issue as > > well. > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > Yangze Guo > > > > > > > > > > > > > > > > > > On Fri, Mar 27, 2020 at 10:42 AM Xintong Song < > > > > tonysong...@gmail.com > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > Thanks for updating the FLIP, Yangze. > > > > > > > > > > > > > > > > > > > > I agree with Till that we probably want to separate the > > > > K8s/Yarn > > > > > > > > > decorator > > > > > > > > > > calls. Users can still configure one driver class, and we > > can > > > > use > > > > > > > > > > `instanceof` to check whether the driver implemented > > K8s/Yarn > > > > > > > specific > > > > > > > > > > interfaces. > > > > > > > > > > > > > > > > > > > > Moreover, I'm not sure about exposing entire > > > > `ContainerRequest` / > > > > > > > `Pod` > > > > > > > > > > (`AbstractKubernetesStepDecorator` directly manipulates on > > > > `Pod`) > > > > > > to > > > > > > > user > > > > > > > > > > codes. It gives more access to user codes than needed for > > > > defining > > > > > > > > > external > > > > > > > > > > resource, which might cause problems. Instead, I would > > suggest > > > > to > > > > > > > have > > > > > > > > > > interface like `Map<String key, String value> > > > > > > > > > > getYarn/KubernetesExternalResource()` and assemble them > > into > > > > > > > > > > `ContainerRequest` / `Pod` in > > Yarn/KubernetesResourceManager. > > > > > > > > > > > > > > > > > > > > Thank you~ > > > > > > > > > > > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Mar 27, 2020 at 1:10 AM Till Rohrmann < > > > > > > trohrm...@apache.org> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Hi everyone, > > > > > > > > > > > > > > > > > > > > > > I'm a bit late to the party. I think the current proposal > > > > looks > > > > > > > good. > > > > > > > > > > > > > > > > > > > > > > Concerning the ExternalResourceDriver interface defined > > in > > > > the > > > > > > FLIP > > > > > > > > > [1], I > > > > > > > > > > > would suggest to not include the decorator calls for > > > > Kubernetes > > > > > > and > > > > > > > > > Yarn in > > > > > > > > > > > the base interface. Instead I would suggest to segregate > > the > > > > > > > deployment > > > > > > > > > > > specific decorator calls into separate interfaces. That > > way > > > > an > > > > > > > > > > > ExternalResourceDriver does not have to support all > > > > deployments > > > > > > > from > > > > > > > > > the > > > > > > > > > > > very beginning. Moreover, some resources might not be > > > > supported > > > > > > by > > > > > > > a > > > > > > > > > > > specific deployment target and the natural way to express > > > > this > > > > > > > would > > > > > > > > > be to > > > > > > > > > > > not implement the respective deployment specific > > interface. > > > > > > > > > > > > > > > > > > > > > > Moreover, having void > > > > > > > > > > > addExternalResourceToRequest(AMRMClient.ContainerRequest > > > > > > > > > containerRequest) > > > > > > > > > > > in the ExternalResourceDriver interface would require > > Hadoop > > > > on > > > > > > > Flink's > > > > > > > > > > > classpath whenever the external resource driver is being > > > > used. > > > > > > > > > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-108%3A+Add+GPU+support+in+Flink > > > > > > > > > > > > > > > > > > > > > > Cheers, > > > > > > > > > > > Till > > > > > > > > > > > > > > > > > > > > > > On Thu, Mar 26, 2020 at 12:45 PM Stephan Ewen < > > > > se...@apache.org> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > Nice, thanks a lot! > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Mar 26, 2020 at 10:21 AM Yangze Guo < > > > > > > karma...@gmail.com> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > Thanks for the suggestion, @Stephan, @Becket and > > > > @Xintong. > > > > > > > > > > > > > > > > > > > > > > > > > > I've updated the FLIP accordingly. I do not add a > > > > > > > > > > > > > ResourceInfoProvider. Instead, I introduce the > > > > > > > > > ExternalResourceDriver, > > > > > > > > > > > > > which takes the responsibility of all relevant > > > > operations on > > > > > > > both > > > > > > > > > RM > > > > > > > > > > > > > and TM sides. > > > > > > > > > > > > > After a rethink about decoupling the management of > > > > external > > > > > > > > > resources > > > > > > > > > > > > > from TaskExecutor, I think we could do the same > > thing on > > > > the > > > > > > > > > > > > > ResourceManager side. We do not need to add a > > specific > > > > > > > allocation > > > > > > > > > > > > > logic to the ResourceManager each time we add a > > specific > > > > > > > external > > > > > > > > > > > > > resource. > > > > > > > > > > > > > - For Yarn, we need the ExternalResourceDriver to > > edit > > > > the > > > > > > > > > > > > > containerRequest. > > > > > > > > > > > > > - For Kubenetes, ExternalResourceDriver could > > provide a > > > > > > > decorator > > > > > > > > > for > > > > > > > > > > > > > the TM pod. > > > > > > > > > > > > > > > > > > > > > > > > > > In this way, just like MetricReporter, we allow > > users to > > > > > > define > > > > > > > > > their > > > > > > > > > > > > > custom ExternalResourceDriver. It is more extensible > > and > > > > fits > > > > > > > the > > > > > > > > > > > > > separation of concerns. For more details, please > > take a > > > > look > > > > > > at > > > > > > > > > [1]. > > > > > > > > > > > > > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-108%3A+Add+GPU+support+in+Flink > > > > > > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > > > > Yangze Guo > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Mar 25, 2020 at 7:32 PM Stephan Ewen < > > > > > > se...@apache.org > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > This sounds good to go ahead from my side. > > > > > > > > > > > > > > > > > > > > > > > > > > > > I like the approach that Becket suggested - in that > > > > case > > > > > > the > > > > > > > core > > > > > > > > > > > > > > abstraction that everyone would need to understand > > > > would be > > > > > > > > > "external > > > > > > > > > > > > > > resource allocation" and the > > "ResourceInfoProvider", > > > > and > > > > > > the > > > > > > > GPU > > > > > > > > > > > > specific > > > > > > > > > > > > > > code would be a specific implementation only known > > to > > > > that > > > > > > > > > component > > > > > > > > > > > > that > > > > > > > > > > > > > > allocates the external resource. That fits the > > > > separation > > > > > > of > > > > > > > > > concerns > > > > > > > > > > > > > well. > > > > > > > > > > > > > > > > > > > > > > > > > > > > I also understand that it should not be > > > > over-engineered in > > > > > > > the > > > > > > > > > first > > > > > > > > > > > > > > version, so some simplification makes sense, and > > then > > > > > > > gradually > > > > > > > > > > > expand > > > > > > > > > > > > > from > > > > > > > > > > > > > > there. > > > > > > > > > > > > > > > > > > > > > > > > > > > > So +1 to go ahead with what was suggested above > > > > (Xintong / > > > > > > > > > Becket) > > > > > > > > > > > from > > > > > > > > > > > > > my > > > > > > > > > > > > > > side. > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Mar 23, 2020 at 6:55 AM Xintong Song < > > > > > > > > > tonysong...@gmail.com> > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks for the comments, Stephan & Becket. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > @Stephan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I see your concern, and I completely agree with > > you > > > > that > > > > > > we > > > > > > > > > should > > > > > > > > > > > > > first > > > > > > > > > > > > > > > think about the "library" / "plugin" / > > "extension" > > > > style > > > > > > if > > > > > > > > > > > possible. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > If GPUs are sliced and assigned during > > scheduling, > > > > there > > > > > > > may be > > > > > > > > > > > > reason, > > > > > > > > > > > > > > > > although it looks that it would belong to the > > slot > > > > > > then. > > > > > > > Is > > > > > > > > > that > > > > > > > > > > > > > what we > > > > > > > > > > > > > > > > are doing here? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > In the current proposal, we do not have the GPUs > > > > sliced > > > > > > and > > > > > > > > > > > assigned > > > > > > > > > > > > to > > > > > > > > > > > > > > > slots, because it could be problematic without > > > > dynamic > > > > > > slot > > > > > > > > > > > > allocation. > > > > > > > > > > > > > > > E.g., the number of GPUs might not be evenly > > > > divisible by > > > > > > > the > > > > > > > > > > > number > > > > > > > > > > > > of > > > > > > > > > > > > > > > slots. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I think it makes sense to eventually have the > > GPUs > > > > > > > assigned to > > > > > > > > > > > slots. > > > > > > > > > > > > > Even > > > > > > > > > > > > > > > then, we might still need a TM level GPUManager > > (or > > > > > > > > > > > ResourceProvider > > > > > > > > > > > > > like > > > > > > > > > > > > > > > Becket suggested). For memory, in each slot we > > can > > > > simply > > > > > > > > > request > > > > > > > > > > > the > > > > > > > > > > > > > > > amount of memory, leaving it to JVM / OS to > > decide > > > > which > > > > > > > memory > > > > > > > > > > > > > (address) > > > > > > > > > > > > > > > should be assigned. For GPU, and potentially > > other > > > > > > > resources > > > > > > > > > like > > > > > > > > > > > > > FPGA, we > > > > > > > > > > > > > > > need to explicitly specify which GPU (index) > > should > > > > be > > > > > > > used. > > > > > > > > > > > > > Therefore, we > > > > > > > > > > > > > > > need some component at the TM level to coordinate > > > > which > > > > > > > slot > > > > > > > > > uses > > > > > > > > > > > > which > > > > > > > > > > > > > > > GPU. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > IMO, unless we say Flink will not support > > slot-level > > > > GPU > > > > > > > > > slicing at > > > > > > > > > > > > > least > > > > > > > > > > > > > > > in the foreseeable future, I don't see a good > > way to > > > > > > avoid > > > > > > > > > touching > > > > > > > > > > > > > the TM > > > > > > > > > > > > > > > core. To that end, I think Becket's suggestion > > > > points to > > > > > > a > > > > > > > good > > > > > > > > > > > > > direction, > > > > > > > > > > > > > > > that supports more features (GPU, FPGA, etc.) > > with > > > > less > > > > > > > > > coupling to > > > > > > > > > > > > > the TM > > > > > > > > > > > > > > > core (only needs to understand the general > > > > interfaces). > > > > > > The > > > > > > > > > > > detailed > > > > > > > > > > > > > > > implementation for specific resource types can > > even > > > > be > > > > > > > > > encapsulated > > > > > > > > > > > > as > > > > > > > > > > > > > a > > > > > > > > > > > > > > > library. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > @Becket > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks for sharing your thought on the final > > state. > > > > > > > Despite the > > > > > > > > > > > > > details how > > > > > > > > > > > > > > > the interfaces should look like, I think this is > > a > > > > really > > > > > > > good > > > > > > > > > > > > > abstraction > > > > > > > > > > > > > > > for supporting general resource types. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I'd like to further clarify that, the following > > three > > > > > > > things > > > > > > > > > are > > > > > > > > > > > all > > > > > > > > > > > > > that > > > > > > > > > > > > > > > the "Flink core" needs to understand. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - The *amount* of resource, for scheduling. > > > > Actually, > > > > > > we > > > > > > > > > already > > > > > > > > > > > > > have > > > > > > > > > > > > > > > the Resource class in ResourceProfile and > > > > ResourceSpec > > > > > > > for > > > > > > > > > > > > extended > > > > > > > > > > > > > > > resource. It's just not really used. > > > > > > > > > > > > > > > - The *info*, that Flink provides to the > > > > operators / > > > > > > > user > > > > > > > > > codes. > > > > > > > > > > > > > > > - The *provider*, which generates the info > > based > > > > on > > > > > > the > > > > > > > > > amount. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The "core" does not need to understand the > > specific > > > > > > > > > implementation > > > > > > > > > > > > > details > > > > > > > > > > > > > > > of the above three. They can even be implemented > > in a > > > > > > > 3rd-party > > > > > > > > > > > > > library. > > > > > > > > > > > > > > > Similar to how we allow users to define their > > custom > > > > > > > > > > > MetricReporter. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thank you~ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Mar 23, 2020 at 8:45 AM Becket Qin < > > > > > > > > > becket....@gmail.com> > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks for the comment, Stephan. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - If everything becomes a "core feature", it > > will > > > > > > make > > > > > > > the > > > > > > > > > > > > project > > > > > > > > > > > > > hard > > > > > > > > > > > > > > > > > to develop in the future. Thinking "library" > > / > > > > > > > "plugin" / > > > > > > > > > > > > > "extension" > > > > > > > > > > > > > > > > style > > > > > > > > > > > > > > > > > where possible helps. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Completely agree. It is much more important to > > > > design a > > > > > > > > > mechanism > > > > > > > > > > > > > than > > > > > > > > > > > > > > > > focusing on a specific case. Here is what I am > > > > thinking > > > > > > > to > > > > > > > > > fully > > > > > > > > > > > > > support > > > > > > > > > > > > > > > > custom resource management: > > > > > > > > > > > > > > > > 1. On the JM / RM side, use ResourceProfile and > > > > > > > ResourceSpec > > > > > > > > > to > > > > > > > > > > > > > define > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > resource and the amount required. They will be > > > > used to > > > > > > > find > > > > > > > > > > > > suitable > > > > > > > > > > > > > TMs > > > > > > > > > > > > > > > > slots to run the tasks. At this point, the > > > > resources > > > > > > are > > > > > > > only > > > > > > > > > > > > > measured by > > > > > > > > > > > > > > > > amount, i.e. they do not have individual ID. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 2. On the TM side, have something like > > > > > > > > > *"ResourceInfoProvider"* > > > > > > > > > > > to > > > > > > > > > > > > > > > identify > > > > > > > > > > > > > > > > and provides the detail information of the > > > > individual > > > > > > > > > resource, > > > > > > > > > > > > e.g. > > > > > > > > > > > > > GPU > > > > > > > > > > > > > > > > ID.. It is important because the operator may > > have > > > > to > > > > > > > > > explicitly > > > > > > > > > > > > > interact > > > > > > > > > > > > > > > > with the physical resource it uses. The > > > > > > > ResourceInfoProvider > > > > > > > > > > > might > > > > > > > > > > > > > look > > > > > > > > > > > > > > > > like something below. > > > > > > > > > > > > > > > > interface ResourceInfoProvider<INFO> { > > > > > > > > > > > > > > > > Map<AbstractID, INFO> > > > > > > retrieveResourceInfo(OperatorId > > > > > > > > > opId, > > > > > > > > > > > > > > > > ResourceProfile resourceProfile); > > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - There could be several > > "*ResourceInfoProvider*" > > > > > > > configured > > > > > > > > > on > > > > > > > > > > > the > > > > > > > > > > > > > TM to > > > > > > > > > > > > > > > > retrieve the information for different > > resources. > > > > > > > > > > > > > > > > - The TM will be responsible to assign those > > > > individual > > > > > > > > > resources > > > > > > > > > > > > to > > > > > > > > > > > > > each > > > > > > > > > > > > > > > > operator according to their requested amount. > > > > > > > > > > > > > > > > - The operators will be able to get the > > > > ResourceInfo > > > > > > from > > > > > > > > > their > > > > > > > > > > > > > > > > RuntimeContext. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > If we agree this is a reasonable final state. > > We > > > > can > > > > > > > adapt > > > > > > > > > the > > > > > > > > > > > > > current > > > > > > > > > > > > > > > FLIP > > > > > > > > > > > > > > > > to it. In fact it does not sound a big change > > to > > > > me. > > > > > > All > > > > > > > the > > > > > > > > > > > > proposed > > > > > > > > > > > > > > > > configuration can be as is, it is just that > > Flink > > > > > > itself > > > > > > > > > won't > > > > > > > > > > > care > > > > > > > > > > > > > about > > > > > > > > > > > > > > > > them, instead a GPUInfoProviver implementing > > the > > > > > > > > > > > > ResourceInfoProvider > > > > > > > > > > > > > > > will > > > > > > > > > > > > > > > > use them. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Jiangjie (Becket) Qin > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Mar 23, 2020 at 1:47 AM Stephan Ewen < > > > > > > > > > se...@apache.org> > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi all! > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The main point I wanted to throw into the > > > > discussion > > > > > > > is the > > > > > > > > > > > > > following: > > > > > > > > > > > > > > > > > - With more and more use cases, more and > > more > > > > tools > > > > > > > go > > > > > > > > > into > > > > > > > > > > > > Flink > > > > > > > > > > > > > > > > > - If everything becomes a "core feature", > > it > > > > will > > > > > > > make > > > > > > > > > the > > > > > > > > > > > > > project > > > > > > > > > > > > > > > hard > > > > > > > > > > > > > > > > > to develop in the future. Thinking "library" > > / > > > > > > > "plugin" / > > > > > > > > > > > > > "extension" > > > > > > > > > > > > > > > > style > > > > > > > > > > > > > > > > > where possible helps. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - A good thought experiment is always: How > > many > > > > > > > future > > > > > > > > > > > > developers > > > > > > > > > > > > > > > have > > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > > > interact with this code (and possibly > > understand > > > > it > > > > > > > > > partially), > > > > > > > > > > > > > even if > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > features they touch have nothing to do with > > GPU > > > > > > > support. If > > > > > > > > > > > many > > > > > > > > > > > > > > > > > contributors to unrelated features will have > > to > > > > touch > > > > > > > it > > > > > > > > > and > > > > > > > > > > > > > understand > > > > > > > > > > > > > > > > it, > > > > > > > > > > > > > > > > > then let's think if there is a different > > > > solution. > > > > > > > Maybe > > > > > > > > > there > > > > > > > > > > > is > > > > > > > > > > > > > not, > > > > > > > > > > > > > > > > but > > > > > > > > > > > > > > > > > then we should be sure why. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - That led me to raising this issue: If > > the GPU > > > > > > > manager > > > > > > > > > > > > becomes a > > > > > > > > > > > > > > > core > > > > > > > > > > > > > > > > > service in the TaskManager, Environment, > > > > > > > RuntimeContext, > > > > > > > > > etc. > > > > > > > > > > > > then > > > > > > > > > > > > > > > > everyone > > > > > > > > > > > > > > > > > developing TM and streaming tasks need to > > > > understand > > > > > > > the > > > > > > > > > GPU > > > > > > > > > > > > > manager. > > > > > > > > > > > > > > > > That > > > > > > > > > > > > > > > > > seems oddly specific, is my impression. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Access to configuration seems not the right > > > > reason to > > > > > > > do > > > > > > > > > that. > > > > > > > > > > > We > > > > > > > > > > > > > > > should > > > > > > > > > > > > > > > > > expose the Flink configuration from the > > > > > > RuntimeContext > > > > > > > > > anyways. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > If GPUs are sliced and assigned during > > > > scheduling, > > > > > > > there > > > > > > > > > may be > > > > > > > > > > > > > reason, > > > > > > > > > > > > > > > > > although it looks that it would belong to the > > > > slot > > > > > > > then. Is > > > > > > > > > > > that > > > > > > > > > > > > > what > > > > > > > > > > > > > > > we > > > > > > > > > > > > > > > > > are doing here? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > > > > > > > > Stephan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Mar 20, 2020 at 2:58 AM Xintong Song > > < > > > > > > > > > > > > > tonysong...@gmail.com> > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks for the feedback, Becket. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > IMO, eventually an operator should only see > > > > info of > > > > > > > GPUs > > > > > > > > > that > > > > > > > > > > > > are > > > > > > > > > > > > > > > > > dedicated > > > > > > > > > > > > > > > > > > for it, instead of all GPUs on the > > > > > > machine/container > > > > > > > in > > > > > > > > > the > > > > > > > > > > > > > current > > > > > > > > > > > > > > > > > design. > > > > > > > > > > > > > > > > > > It does not make sense to let the user who > > > > writes a > > > > > > > UDF > > > > > > > > > to > > > > > > > > > > > > worry > > > > > > > > > > > > > > > about > > > > > > > > > > > > > > > > > > coordination among multiple operators > > running > > > > on > > > > > > the > > > > > > > same > > > > > > > > > > > > > machine. > > > > > > > > > > > > > > > And > > > > > > > > > > > > > > > > if > > > > > > > > > > > > > > > > > > we want to limit the GPU info an operator > > > > sees, we > > > > > > > > > should not > > > > > > > > > > > > > let the > > > > > > > > > > > > > > > > > > operator to instantiate GPUManager, which > > > > means we > > > > > > > have > > > > > > > > > to > > > > > > > > > > > > expose > > > > > > > > > > > > > > > > > something > > > > > > > > > > > > > > > > > > through runtime context, either GPU info or > > > > some > > > > > > > kind of > > > > > > > > > > > > limited > > > > > > > > > > > > > > > access > > > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > > > > the GPUManager. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thank you~ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Mar 19, 2020 at 5:48 PM Becket Qin > > < > > > > > > > > > > > > becket....@gmail.com > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > It probably make sense for us to first > > agree > > > > on > > > > > > the > > > > > > > > > final > > > > > > > > > > > > > state. > > > > > > > > > > > > > > > More > > > > > > > > > > > > > > > > > > > specifically, will the resource info be > > > > exposed > > > > > > > through > > > > > > > > > > > > runtime > > > > > > > > > > > > > > > > context > > > > > > > > > > > > > > > > > > > eventually? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > If that is the final state and we have a > > > > seamless > > > > > > > > > migration > > > > > > > > > > > > > story > > > > > > > > > > > > > > > > from > > > > > > > > > > > > > > > > > > this > > > > > > > > > > > > > > > > > > > FLIP to that final state, Personally I > > think > > > > it > > > > > > is > > > > > > > OK > > > > > > > > > to > > > > > > > > > > > > > expose the > > > > > > > > > > > > > > > > GPU > > > > > > > > > > > > > > > > > > > info in the runtime context. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Jiangjie (Becket) Qin > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Mar 16, 2020 at 11:21 AM Xintong > > > > Song < > > > > > > > > > > > > > > > tonysong...@gmail.com > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > @Yangze, > > > > > > > > > > > > > > > > > > > > I think what Stephan means (@Stephan, > > > > please > > > > > > > correct > > > > > > > > > me > > > > > > > > > > > if > > > > > > > > > > > > > I'm > > > > > > > > > > > > > > > > wrong) > > > > > > > > > > > > > > > > > > is > > > > > > > > > > > > > > > > > > > > that, we might not need to hold and > > > > maintain > > > > > > the > > > > > > > > > > > GPUManager > > > > > > > > > > > > > as a > > > > > > > > > > > > > > > > > > service > > > > > > > > > > > > > > > > > > > in > > > > > > > > > > > > > > > > > > > > TaskManagerServices or RuntimeContext. > > An > > > > > > > > > alternative is > > > > > > > > > > > to > > > > > > > > > > > > > > > create > > > > > > > > > > > > > > > > / > > > > > > > > > > > > > > > > > > > > retrieve the GPUManager only in the > > > > operators > > > > > > > that > > > > > > > > > need > > > > > > > > > > > it, > > > > > > > > > > > > > e.g., > > > > > > > > > > > > > > > > > with > > > > > > > > > > > > > > > > > > a > > > > > > > > > > > > > > > > > > > > static method `GPUManager.get()`. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > @Stephan, > > > > > > > > > > > > > > > > > > > > I agree with you on excluding > > GPUManager > > > > from > > > > > > > > > > > > > > > TaskManagerServices. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - For the first step, where we > > provide > > > > > > unified > > > > > > > > > > > TM-level > > > > > > > > > > > > > GPU > > > > > > > > > > > > > > > > > > > information > > > > > > > > > > > > > > > > > > > > to all operators, it should be fine > > to > > > > have > > > > > > > > > operators > > > > > > > > > > > > > access / > > > > > > > > > > > > > > > > > > > > lazy-initiate GPUManager by > > themselves. > > > > > > > > > > > > > > > > > > > > - In future, we might have some more > > > > > > > fine-grained > > > > > > > > > GPU > > > > > > > > > > > > > > > > management, > > > > > > > > > > > > > > > > > > > where > > > > > > > > > > > > > > > > > > > > we need to maintain GPUManager as a > > > > service > > > > > > > and > > > > > > > > > put > > > > > > > > > > > GPU > > > > > > > > > > > > > info > > > > > > > > > > > > > > > in > > > > > > > > > > > > > > > > > slot > > > > > > > > > > > > > > > > > > > > profiles. But at least for now it's > > not > > > > > > > necessary > > > > > > > > > to > > > > > > > > > > > > > introduce > > > > > > > > > > > > > > > > > such > > > > > > > > > > > > > > > > > > > > complexity. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > However, I have some concerns on > > excluding > > > > > > > GPUManager > > > > > > > > > > > from > > > > > > > > > > > > > > > > > > RuntimeContext > > > > > > > > > > > > > > > > > > > > and let operators access it directly. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - Configurations needed for > > creating the > > > > > > > > > GPUManager is > > > > > > > > > > > > not > > > > > > > > > > > > > > > > always > > > > > > > > > > > > > > > > > > > > available for operators. > > > > > > > > > > > > > > > > > > > > - If later we want to have > > fine-grained > > > > > > > control > > > > > > > > > over > > > > > > > > > > > GPU > > > > > > > > > > > > > > > (e.g., > > > > > > > > > > > > > > > > > > > > operators in each slot can only see > > GPUs > > > > > > > reserved > > > > > > > > > for > > > > > > > > > > > > that > > > > > > > > > > > > > > > > slot), > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > > approach cannot be easily extended. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I would suggest to wrap the GPUManager > > > > behind > > > > > > > > > > > > RuntimeContext > > > > > > > > > > > > > and > > > > > > > > > > > > > > > > only > > > > > > > > > > > > > > > > > > > > expose the GPUInfo to users. For now, > > we > > > > can > > > > > > > declare > > > > > > > > > a > > > > > > > > > > > > method > > > > > > > > > > > > > > > > > > > > `getGPUInfo()` in RuntimeContext, with > > a > > > > > > default > > > > > > > > > > > definition > > > > > > > > > > > > > that > > > > > > > > > > > > > > > > > calls > > > > > > > > > > > > > > > > > > > > `GPUManager.get()` to get the > > > > lazily-created > > > > > > > > > GPUManager. > > > > > > > > > > > If > > > > > > > > > > > > > later > > > > > > > > > > > > > > > > we > > > > > > > > > > > > > > > > > > want > > > > > > > > > > > > > > > > > > > > to create / retrieve GPUManager in a > > > > different > > > > > > > way, > > > > > > > > > we > > > > > > > > > > > can > > > > > > > > > > > > > simply > > > > > > > > > > > > > > > > > > change > > > > > > > > > > > > > > > > > > > > how `getGPUInfo` is implemented, > > without > > > > > > needing > > > > > > > to > > > > > > > > > > > change > > > > > > > > > > > > > any > > > > > > > > > > > > > > > > public > > > > > > > > > > > > > > > > > > > > interfaces. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thank you~ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Sat, Mar 14, 2020 at 10:09 AM Yangze > > > > Guo < > > > > > > > > > > > > > karma...@gmail.com> > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > @Shephan > > > > > > > > > > > > > > > > > > > > > Do you mean Minicluster? Yes, it > > makes > > > > sense > > > > > > to > > > > > > > > > share > > > > > > > > > > > the > > > > > > > > > > > > > GPU > > > > > > > > > > > > > > > > > Manager > > > > > > > > > > > > > > > > > > > > > in such scenario. > > > > > > > > > > > > > > > > > > > > > If that's what you worry about, I'm > > +1 > > > > for > > > > > > > holding > > > > > > > > > > > > > > > > > > > > > GPUManager(ExternalResourceManagers) > > in > > > > > > > > > TaskExecutor > > > > > > > > > > > > > instead of > > > > > > > > > > > > > > > > > > > > > TaskManagerServices. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Regarding the > > > > RuntimeContext/FunctionContext, > > > > > > > it > > > > > > > > > just > > > > > > > > > > > > > holds the > > > > > > > > > > > > > > > > GPU > > > > > > > > > > > > > > > > > > > > > info instead of the GPU Manager. > > AFAIK, > > > > it's > > > > > > > the > > > > > > > > > only > > > > > > > > > > > > > place we > > > > > > > > > > > > > > > > > could > > > > > > > > > > > > > > > > > > > > > pass GPU info to the > > > > > > > > > RichFunction/UserDefinedFunction. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > > > > > > > > > > > > Yangze Guo > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Sat, Mar 14, 2020 at 4:06 AM Isaac > > > > > > Godfried > > > > > > > < > > > > > > > > > > > > > > > > > is...@paddlesoft.net > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ---- On Fri, 13 Mar 2020 15:58:20 > > +0000 > > > > > > > > > > > > se...@apache.org > > > > > > > > > > > > > > > wrote > > > > > > > > > > > > > > > > > > ---- > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Can we somehow keep this out > > of the > > > > > > > > > TaskManager > > > > > > > > > > > > > services > > > > > > > > > > > > > > > > > > > > > > > I fear that we could not. IMO, > > the > > > > > > > > > GPUManager(or > > > > > > > > > > > > > > > > > > > > > > > ExternalServicesManagers in > > future) > > > > is > > > > > > > > > conceptually > > > > > > > > > > > > > one of > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > task > > > > > > > > > > > > > > > > > > > > > > > manager services, just like > > > > MemoryManager > > > > > > > > > before > > > > > > > > > > > > 1.10. > > > > > > > > > > > > > > > > > > > > > > > - It maintains/holds the GPU > > > > resource at > > > > > > TM > > > > > > > > > level > > > > > > > > > > > and > > > > > > > > > > > > > all > > > > > > > > > > > > > > > of > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > > > > > operators allocate the GPU > > resources > > > > from > > > > > > > it. > > > > > > > > > So, > > > > > > > > > > > it > > > > > > > > > > > > > should > > > > > > > > > > > > > > > > be > > > > > > > > > > > > > > > > > > > > > > > exclusive to a single > > TaskExecutor. > > > > > > > > > > > > > > > > > > > > > > > - We could add a collection > > called > > > > > > > > > > > > > ExternalResourceManagers > > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > > > > hold > > > > > > > > > > > > > > > > > > > > > > > all managers of other external > > > > resources > > > > > > > in the > > > > > > > > > > > > future. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Can you help me understand why this > > > > needs > > > > > > the > > > > > > > > > > > addition > > > > > > > > > > > > in > > > > > > > > > > > > > > > > > > > > > TaskMagerServices > > > > > > > > > > > > > > > > > > > > > > or in the RuntimeContext? > > > > > > > > > > > > > > > > > > > > > > Are you worried about the case when > > > > > > multiple > > > > > > > Task > > > > > > > > > > > > > Executors > > > > > > > > > > > > > > > run > > > > > > > > > > > > > > > > > in > > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > > > same > > > > > > > > > > > > > > > > > > > > > > JVM? That's not common, but > > wouldn't it > > > > > > > actually > > > > > > > > > be > > > > > > > > > > > > good > > > > > > > > > > > > > in > > > > > > > > > > > > > > > > that > > > > > > > > > > > > > > > > > > case > > > > > > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > > > > > > > > share the GPU Manager, given that > > the > > > > GPU > > > > > > is > > > > > > > > > shared? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > > > > > Stephan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --------------------------- > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > What parts need information about > > > > this? > > > > > > > > > > > > > > > > > > > > > > > In this FLIP, operators need the > > > > > > > information. > > > > > > > > > Thus, > > > > > > > > > > > > we > > > > > > > > > > > > > > > expose > > > > > > > > > > > > > > > > > GPU > > > > > > > > > > > > > > > > > > > > > > > information to the > > > > > > > > > RuntimeContext/FunctionContext. > > > > > > > > > > > > The > > > > > > > > > > > > > slot > > > > > > > > > > > > > > > > > > profile > > > > > > > > > > > > > > > > > > > > is > > > > > > > > > > > > > > > > > > > > > > > not aware of GPU resources as > > GPU is > > > > TM > > > > > > > level > > > > > > > > > > > > resource > > > > > > > > > > > > > now. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Can the GPU Manager be a "self > > > > > > contained" > > > > > > > > > thing > > > > > > > > > > > > that > > > > > > > > > > > > > > > simply > > > > > > > > > > > > > > > > > > takes > > > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > > > > > configuration, and then abstracts > > > > > > > everything > > > > > > > > > > > > > internally? > > > > > > > > > > > > > > > > > > > > > > > Yes, we just pass the path/args > > of > > > > the > > > > > > > discover > > > > > > > > > > > > script > > > > > > > > > > > > > and > > > > > > > > > > > > > > > > how > > > > > > > > > > > > > > > > > > many > > > > > > > > > > > > > > > > > > > > > > > GPUs per TM to it. It takes the > > > > > > > responsibility > > > > > > > > > to > > > > > > > > > > > get > > > > > > > > > > > > > the > > > > > > > > > > > > > > > GPU > > > > > > > > > > > > > > > > > > > > > > > information and expose them to > > the > > > > > > > > > > > > > > > > > RuntimeContext/FunctionContext > > > > > > > > > > > > > > > > > > > of > > > > > > > > > > > > > > > > > > > > > > > Operators. Meanwhile, we'd > > better not > > > > > > allow > > > > > > > > > > > operators > > > > > > > > > > > > > to > > > > > > > > > > > > > > > > > directly > > > > > > > > > > > > > > > > > > > > > > > access GPUManager, it should get > > what > > > > > > they > > > > > > > want > > > > > > > > > > > from > > > > > > > > > > > > > > > Context. > > > > > > > > > > > > > > > > > We > > > > > > > > > > > > > > > > > > > > could > > > > > > > > > > > > > > > > > > > > > > > then decouple the > > > > > > interface/implementation > > > > > > > of > > > > > > > > > > > > > GPUManager > > > > > > > > > > > > > > > and > > > > > > > > > > > > > > > > > > Public > > > > > > > > > > > > > > > > > > > > > > > API. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > > > > > > > > > > > > > > Yangze Guo > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Mar 13, 2020 at 7:26 PM > > > > Stephan > > > > > > > Ewen < > > > > > > > > > > > > > > > > se...@apache.org > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > It sounds fine to initially > > start > > > > with > > > > > > > GPU > > > > > > > > > > > specific > > > > > > > > > > > > > > > support > > > > > > > > > > > > > > > > > and > > > > > > > > > > > > > > > > > > > > think > > > > > > > > > > > > > > > > > > > > > > > about > > > > > > > > > > > > > > > > > > > > > > > > generalizing this once we > > better > > > > > > > understand > > > > > > > > > the > > > > > > > > > > > > > space. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > About the implementation > > suggested > > > > in > > > > > > > > > FLIP-108: > > > > > > > > > > > > > > > > > > > > > > > > - Can we somehow keep this out > > of > > > > the > > > > > > > > > TaskManager > > > > > > > > > > > > > > > services? > > > > > > > > > > > > > > > > > > > > Anything > > > > > > > > > > > > > > > > > > > > > we > > > > > > > > > > > > > > > > > > > > > > > > have to pull through all > > layers of > > > > the > > > > > > TM > > > > > > > > > makes > > > > > > > > > > > the > > > > > > > > > > > > > TM > > > > > > > > > > > > > > > > > > components > > > > > > > > > > > > > > > > > > > > yet > > > > > > > > > > > > > > > > > > > > > > > more > > > > > > > > > > > > > > > > > > > > > > > > complex and harder to maintain. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - What parts need information > > about > > > > > > this? > > > > > > > > > > > > > > > > > > > > > > > > -> do the slot profiles need > > > > > > information > > > > > > > > > about > > > > > > > > > > > the > > > > > > > > > > > > > GPU? > > > > > > > > > > > > > > > > > > > > > > > > -> Can the GPU Manager be a > > "self > > > > > > > contained" > > > > > > > > > > > thing > > > > > > > > > > > > > that > > > > > > > > > > > > > > > > > simply > > > > > > > > > > > > > > > > > > > > takes > > > > > > > > > > > > > > > > > > > > > > > > the configuration, and then > > > > abstracts > > > > > > > > > everything > > > > > > > > > > > > > > > > internally? > > > > > > > > > > > > > > > > > > > > > Operators > > > > > > > > > > > > > > > > > > > > > > > can > > > > > > > > > > > > > > > > > > > > > > > > access it via > > "GPUManager.get()" > > > > or so? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Mar 4, 2020 at 4:19 AM > > > > Yangze > > > > > > > Guo < > > > > > > > > > > > > > > > > > karma...@gmail.com> > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks for all the feedbacks. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > @Becket > > > > > > > > > > > > > > > > > > > > > > > > > Regarding the WebUI and > > GPUInfo, > > > > > > you're > > > > > > > > > right, > > > > > > > > > > > > > I'll add > > > > > > > > > > > > > > > > > them > > > > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > > > > > > > Public API section. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > @Stephan @Becket > > > > > > > > > > > > > > > > > > > > > > > > > Regarding the general > > extended > > > > > > resource > > > > > > > > > > > > mechanism, > > > > > > > > > > > > > I > > > > > > > > > > > > > > > > second > > > > > > > > > > > > > > > > > > > > > Xintong's > > > > > > > > > > > > > > > > > > > > > > > > > suggestion. > > > > > > > > > > > > > > > > > > > > > > > > > - It's better to leverage > > > > > > > ResourceProfile > > > > > > > > > and > > > > > > > > > > > > > > > > ResourceSpec > > > > > > > > > > > > > > > > > > > after > > > > > > > > > > > > > > > > > > > > we > > > > > > > > > > > > > > > > > > > > > > > > > supporting fine-grained GPU > > > > > > > scheduling. As > > > > > > > > > a > > > > > > > > > > > > first > > > > > > > > > > > > > step > > > > > > > > > > > > > > > > > > > > proposal, I > > > > > > > > > > > > > > > > > > > > > > > > > prefer to not include it in > > the > > > > scope > > > > > > > of > > > > > > > > > this > > > > > > > > > > > > FLIP. > > > > > > > > > > > > > > > > > > > > > > > > > - Regarding the "Extended > > > > Resource > > > > > > > > > Manager", > > > > > > > > > > > if I > > > > > > > > > > > > > > > > > understand > > > > > > > > > > > > > > > > > > > > > > > > > correctly, it just a code > > > > refactoring > > > > > > > atm, > > > > > > > > > we > > > > > > > > > > > > could > > > > > > > > > > > > > > > > extract > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > > > > > > > > > > > open/close/allocateExtendResources of > > > > > > > > > > > GPUManager > > > > > > > > > > > > to > > > > > > > > > > > > > > > that > > > > > > > > > > > > > > > > > > > > > interface. If > > > > > > > > > > > > > > > > > > > > > > > > > that is the case, +1 to do it > > > > during > > > > > > > > > > > > > implementation. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > @Xingbo > > > > > > > > > > > > > > > > > > > > > > > > > As Xintong said, we looked > > into > > > > how > > > > > > > Spark > > > > > > > > > > > > supports > > > > > > > > > > > > > a > > > > > > > > > > > > > > > > > general > > > > > > > > > > > > > > > > > > > > > "Custom > > > > > > > > > > > > > > > > > > > > > > > > > Resource Scheduling" before > > and > > > > > > > decided to > > > > > > > > > > > > > introduce a > > > > > > > > > > > > > > > > > common > > > > > > > > > > > > > > > > > > > > > resource > > > > > > > > > > > > > > > > > > > > > > > > > configuration > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > schema(taskmanager.resource.{resourceName}.amount/discovery-script) > > > > > > > > > > > > > > > > > > > > > > > > > to make it more extensible. I > > > > think > > > > > > the > > > > > > > > > > > > "resource" > > > > > > > > > > > > > is a > > > > > > > > > > > > > > > > > > proper > > > > > > > > > > > > > > > > > > > > > level > > > > > > > > > > > > > > > > > > > > > > > > > to contain all the configs of > > > > > > extended > > > > > > > > > > > resources. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > > > > > > > > > > > > > > > > Yangze Guo > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Mar 4, 2020 at 10:48 > > AM > > > > > > Xingbo > > > > > > > > > Huang < > > > > > > > > > > > > > > > > > > > hxbks...@gmail.com > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks a lot for the FLIP, > > > > Yangze. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > There is no doubt that GPU > > > > resource > > > > > > > > > > > management > > > > > > > > > > > > > > > support > > > > > > > > > > > > > > > > > will > > > > > > > > > > > > > > > > > > > > > greatly > > > > > > > > > > > > > > > > > > > > > > > > > > facilitate the development > > of > > > > > > > AI-related > > > > > > > > > > > > > applications > > > > > > > > > > > > > > > > by > > > > > > > > > > > > > > > > > > > > PyFlink > > > > > > > > > > > > > > > > > > > > > > > users. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I have only one comment > > about > > > > this > > > > > > > wiki: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Regarding the names of > > several > > > > GPU > > > > > > > > > > > > > configurations, I > > > > > > > > > > > > > > > > > think > > > > > > > > > > > > > > > > > > it > > > > > > > > > > > > > > > > > > > > is > > > > > > > > > > > > > > > > > > > > > > > better > > > > > > > > > > > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > > > > > > > > > > > > delete the resource field > > > > makes it > > > > > > > > > consistent > > > > > > > > > > > > > with > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > names > > > > > > > > > > > > > > > > > > > of > > > > > > > > > > > > > > > > > > > > > other > > > > > > > > > > > > > > > > > > > > > > > > > > resource-related > > > > configurations in > > > > > > > > > > > > > TaskManagerOption. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > e.g. > > > > > > > > > > > > > taskmanager.resource.gpu.discovery-script.path > > > > > > > > > > > > > > > -> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > taskmanager.gpu.discovery-script.path > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Xingbo > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Xintong Song < > > > > > > tonysong...@gmail.com> > > > > > > > > > > > > > 于2020年3月4日周三 > > > > > > > > > > > > > > > > > > 上午10:39写道: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > @Stephan, @Becket, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Actually, Yangze, Yang > > and I > > > > also > > > > > > > had > > > > > > > > > an > > > > > > > > > > > > > offline > > > > > > > > > > > > > > > > > > discussion > > > > > > > > > > > > > > > > > > > > > about > > > > > > > > > > > > > > > > > > > > > > > > > making > > > > > > > > > > > > > > > > > > > > > > > > > > > the "GPU Support" as some > > > > general > > > > > > > > > "Extended > > > > > > > > > > > > > > > Resource > > > > > > > > > > > > > > > > > > > > Support". > > > > > > > > > > > > > > > > > > > > > We > > > > > > > > > > > > > > > > > > > > > > > > > believe > > > > > > > > > > > > > > > > > > > > > > > > > > > supporting extended > > > > resources in > > > > > > a > > > > > > > > > general > > > > > > > > > > > > > > > mechanism > > > > > > > > > > > > > > > > is > > > > > > > > > > > > > > > > > > > > > definitely > > > > > > > > > > > > > > > > > > > > > > > a > > > > > > > > > > > > > > > > > > > > > > > > > good > > > > > > > > > > > > > > > > > > > > > > > > > > > and extensible way. The > > > > reason we > > > > > > > > > propose > > > > > > > > > > > > this > > > > > > > > > > > > > FLIP > > > > > > > > > > > > > > > > > > > narrowing > > > > > > > > > > > > > > > > > > > > > its > > > > > > > > > > > > > > > > > > > > > > > scope > > > > > > > > > > > > > > > > > > > > > > > > > > > down to GPU alone, is > > mainly > > > > for > > > > > > > the > > > > > > > > > > > concern > > > > > > > > > > > > on > > > > > > > > > > > > > > > extra > > > > > > > > > > > > > > > > > > > efforts > > > > > > > > > > > > > > > > > > > > > and > > > > > > > > > > > > > > > > > > > > > > > > > review > > > > > > > > > > > > > > > > > > > > > > > > > > > capacity needed for a > > general > > > > > > > > > mechanism. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > To come up with a well > > > > design on > > > > > > a > > > > > > > > > general > > > > > > > > > > > > > extended > > > > > > > > > > > > > > > > > > > resource > > > > > > > > > > > > > > > > > > > > > > > management > > > > > > > > > > > > > > > > > > > > > > > > > > > mechanism, we would need > > to > > > > > > > investigate > > > > > > > > > > > more > > > > > > > > > > > > > on how > > > > > > > > > > > > > > > > > > people > > > > > > > > > > > > > > > > > > > > use > > > > > > > > > > > > > > > > > > > > > > > > > different > > > > > > > > > > > > > > > > > > > > > > > > > > > kind of resources in > > > > practice. > > > > > > For > > > > > > > > > GPU, we > > > > > > > > > > > > > learnt > > > > > > > > > > > > > > > > such > > > > > > > > > > > > > > > > > > > > > knowledge > > > > > > > > > > > > > > > > > > > > > > > from > > > > > > > > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > > > > > > > > > experts, Becket and his > > team > > > > > > > members. > > > > > > > > > But > > > > > > > > > > > for > > > > > > > > > > > > > FPGA, > > > > > > > > > > > > > > > > or > > > > > > > > > > > > > > > > > > > other > > > > > > > > > > > > > > > > > > > > > > > potential > > > > > > > > > > > > > > > > > > > > > > > > > > > extended resources, we > > don't > > > > have > > > > > > > such > > > > > > > > > > > > > convenient > > > > > > > > > > > > > > > > > > > information > > > > > > > > > > > > > > > > > > > > > > > sources, > > > > > > > > > > > > > > > > > > > > > > > > > > > making the investigation > > > > requires > > > > > > > more > > > > > > > > > > > > efforts, > > > > > > > > > > > > > > > > which I > > > > > > > > > > > > > > > > > > > tend > > > > > > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > > > > > > > > > think > > > > > > > > > > > > > > > > > > > > > > > > > is > > > > > > > > > > > > > > > > > > > > > > > > > > > not necessary atm. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On the other hand, we > > also > > > > looked > > > > > > > into > > > > > > > > > how > > > > > > > > > > > > > Spark > > > > > > > > > > > > > > > > > > supports a > > > > > > > > > > > > > > > > > > > > > general > > > > > > > > > > > > > > > > > > > > > > > > > "Custom > > > > > > > > > > > > > > > > > > > > > > > > > > > Resource Scheduling". > > > > Assuming we > > > > > > > want > > > > > > > > > to > > > > > > > > > > > > have > > > > > > > > > > > > > a > > > > > > > > > > > > > > > > > similar > > > > > > > > > > > > > > > > > > > > > general > > > > > > > > > > > > > > > > > > > > > > > > > extended > > > > > > > > > > > > > > > > > > > > > > > > > > > resource mechanism in the > > > > future, > > > > > > > we > > > > > > > > > > > believe > > > > > > > > > > > > > that > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > current > > > > > > > > > > > > > > > > > > > > > GPU > > > > > > > > > > > > > > > > > > > > > > > > > support > > > > > > > > > > > > > > > > > > > > > > > > > > > design can be easily > > > > extended, in > > > > > > > an > > > > > > > > > > > > > incremental > > > > > > > > > > > > > > > way > > > > > > > > > > > > > > > > > > > without > > > > > > > > > > > > > > > > > > > > > too > > > > > > > > > > > > > > > > > > > > > > > many > > > > > > > > > > > > > > > > > > > > > > > > > > > reworks. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - The most important > > part is > > > > > > > probably > > > > > > > > > user > > > > > > > > > > > > > > > > interfaces. > > > > > > > > > > > > > > > > > > > Spark > > > > > > > > > > > > > > > > > > > > > > > offers > > > > > > > > > > > > > > > > > > > > > > > > > > > configuration options to > > > > define > > > > > > the > > > > > > > > > amount, > > > > > > > > > > > > > > > discovery > > > > > > > > > > > > > > > > > > > script > > > > > > > > > > > > > > > > > > > > > and > > > > > > > > > > > > > > > > > > > > > > > > > vendor > > > > > > > > > > > > > > > > > > > > > > > > > > > (on > > > > > > > > > > > > > > > > > > > > > > > > > > > k8s) in a per resource > > type > > > > bias > > > > > > > [1], > > > > > > > > > which > > > > > > > > > > > > is > > > > > > > > > > > > > very > > > > > > > > > > > > > > > > > > similar > > > > > > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > > > > > > > > > what > > > > > > > > > > > > > > > > > > > > > > > > > we > > > > > > > > > > > > > > > > > > > > > > > > > > > proposed in this FLIP. I > > > > think > > > > > > > it's not > > > > > > > > > > > > > necessary > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > > > > expose > > > > > > > > > > > > > > > > > > > > > > > config > > > > > > > > > > > > > > > > > > > > > > > > > > > options > > > > > > > > > > > > > > > > > > > > > > > > > > > in the general way atm, > > > > since we > > > > > > > do not > > > > > > > > > > > have > > > > > > > > > > > > > > > supports > > > > > > > > > > > > > > > > > for > > > > > > > > > > > > > > > > > > > > other > > > > > > > > > > > > > > > > > > > > > > > > > resource > > > > > > > > > > > > > > > > > > > > > > > > > > > types now. If later we > > > > decided to > > > > > > > have > > > > > > > > > per > > > > > > > > > > > > > resource > > > > > > > > > > > > > > > > > type > > > > > > > > > > > > > > > > > > > > config > > > > > > > > > > > > > > > > > > > > > > > > > > > options, we > > > > > > > > > > > > > > > > > > > > > > > > > > > can have backwards > > > > compatibility > > > > > > > on the > > > > > > > > > > > > current > > > > > > > > > > > > > > > > > proposed > > > > > > > > > > > > > > > > > > > > > options > > > > > > > > > > > > > > > > > > > > > > > > > with > > > > > > > > > > > > > > > > > > > > > > > > > > > simple key mapping. > > > > > > > > > > > > > > > > > > > > > > > > > > > - For the GPU Manager, if > > > > later > > > > > > > needed > > > > > > > > > we > > > > > > > > > > > can > > > > > > > > > > > > > > > change > > > > > > > > > > > > > > > > it > > > > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > > > > > a > > > > > > > > > > > > > > > > > > > > > > > > > "Extended > > > > > > > > > > > > > > > > > > > > > > > > > > > Resource Manager" (or > > > > whatever it > > > > > > > is > > > > > > > > > > > called). > > > > > > > > > > > > > That > > > > > > > > > > > > > > > > > should > > > > > > > > > > > > > > > > > > > be > > > > > > > > > > > > > > > > > > > > a > > > > > > > > > > > > > > > > > > > > > > > pure > > > > > > > > > > > > > > > > > > > > > > > > > > > component-internal > > > > refactoring. > > > > > > > > > > > > > > > > > > > > > > > > > > > - For ResourceProfile and > > > > > > > ResourceSpec, > > > > > > > > > > > there > > > > > > > > > > > > > are > > > > > > > > > > > > > > > > > already > > > > > > > > > > > > > > > > > > > > > > > fields for > > > > > > > > > > > > > > > > > > > > > > > > > > > general extended > > resource. > > > > We can > > > > > > > of > > > > > > > > > course > > > > > > > > > > > > > > > leverage > > > > > > > > > > > > > > > > > them > > > > > > > > > > > > > > > > > > > > when > > > > > > > > > > > > > > > > > > > > > > > > > > > supporting > > > > > > > > > > > > > > > > > > > > > > > > > > > fine grained GPU > > scheduling. > > > > That > > > > > > > is > > > > > > > > > also > > > > > > > > > > > not > > > > > > > > > > > > > in > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > scope > > > > > > > > > > > > > > > > > > > of > > > > > > > > > > > > > > > > > > > > > > > this > > > > > > > > > > > > > > > > > > > > > > > > > first > > > > > > > > > > > > > > > > > > > > > > > > > > > step proposal, and would > > > > require > > > > > > > > > FLIP-56 to > > > > > > > > > > > > be > > > > > > > > > > > > > > > > finished > > > > > > > > > > > > > > > > > > > > first. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > To summary up, I agree > > with > > > > > > Becket > > > > > > > that > > > > > > > > > > > have > > > > > > > > > > > > a > > > > > > > > > > > > > > > > separate > > > > > > > > > > > > > > > > > > > FLIP > > > > > > > > > > > > > > > > > > > > > for > > > > > > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > > > > > > > > > general extended resource > > > > > > > mechanism, > > > > > > > > > and > > > > > > > > > > > keep > > > > > > > > > > > > > it in > > > > > > > > > > > > > > > > > mind > > > > > > > > > > > > > > > > > > > when > > > > > > > > > > > > > > > > > > > > > > > > > discussing > > > > > > > > > > > > > > > > > > > > > > > > > > > and implementing the > > current > > > > one. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thank you~ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://spark.apache.org/docs/3.0.0-preview/configuration.html#custom-resource-scheduling-and-configuration-overview > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Mar 4, 2020 at > > 9:18 > > > > AM > > > > > > > Becket > > > > > > > > > Qin < > > > > > > > > > > > > > > > > > > > > > becket....@gmail.com> > > > > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > That's a good point, > > > > Stephan. > > > > > > It > > > > > > > > > makes > > > > > > > > > > > > total > > > > > > > > > > > > > > > sense > > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > > > > > > > generalize > > > > > > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > > > > > > > > > > resource management to > > > > support > > > > > > > custom > > > > > > > > > > > > > resources. > > > > > > > > > > > > > > > > > Having > > > > > > > > > > > > > > > > > > > > that > > > > > > > > > > > > > > > > > > > > > > > allows > > > > > > > > > > > > > > > > > > > > > > > > > users > > > > > > > > > > > > > > > > > > > > > > > > > > > > to add new resources by > > > > > > > themselves. > > > > > > > > > The > > > > > > > > > > > > > general > > > > > > > > > > > > > > > > > > resource > > > > > > > > > > > > > > > > > > > > > > > management > > > > > > > > > > > > > > > > > > > > > > > > > may > > > > > > > > > > > > > > > > > > > > > > > > > > > > involve two different > > > > aspects: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 1. The custom resource > > type > > > > > > > > > definition. > > > > > > > > > > > It > > > > > > > > > > > > is > > > > > > > > > > > > > > > > > supported > > > > > > > > > > > > > > > > > > > by > > > > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > > > > > > > extended > > > > > > > > > > > > > > > > > > > > > > > > > > > > resources in > > > > ResourceProfile > > > > > > and > > > > > > > > > > > > > ResourceSpec. > > > > > > > > > > > > > > > This > > > > > > > > > > > > > > > > > > will > > > > > > > > > > > > > > > > > > > > > likely > > > > > > > > > > > > > > > > > > > > > > > cover > > > > > > > > > > > > > > > > > > > > > > > > > > > > majority of the cases. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 2. The custom resource > > > > > > allocation > > > > > > > > > logic, > > > > > > > > > > > > > i.e. how > > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > > > > > assign > > > > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > > > > > > > resources > > > > > > > > > > > > > > > > > > > > > > > > > > > > to different tasks, > > > > operators, > > > > > > > and > > > > > > > > > so on. > > > > > > > > > > > > > This > > > > > > > > > > > > > > > may > > > > > > > > > > > > > > > > > > > require > > > > > > > > > > > > > > > > > > > > > two > > > > > > > > > > > > > > > > > > > > > > > > > levels / > > > > > > > > > > > > > > > > > > > > > > > > > > > > steps: > > > > > > > > > > > > > > > > > > > > > > > > > > > > a. Subtask level - make > > > > sure > > > > > > the > > > > > > > > > subtasks > > > > > > > > > > > > > are put > > > > > > > > > > > > > > > > > into > > > > > > > > > > > > > > > > > > > > > > > suitable > > > > > > > > > > > > > > > > > > > > > > > > > > > slots. > > > > > > > > > > > > > > > > > > > > > > > > > > > > It is done by the > > global > > > > RM and > > > > > > > is > > > > > > > > > not > > > > > > > > > > > > > > > customizable > > > > > > > > > > > > > > > > > > right > > > > > > > > > > > > > > > > > > > > > now. > > > > > > > > > > > > > > > > > > > > > > > > > > > > b. Operator level - > > map the > > > > > > exact > > > > > > > > > > > resource > > > > > > > > > > > > > to the > > > > > > > > > > > > > > > > > > > operators > > > > > > > > > > > > > > > > > > > > > > > in > > > > > > > > > > > > > > > > > > > > > > > > > TM. > > > > > > > > > > > > > > > > > > > > > > > > > > > e.g. > > > > > > > > > > > > > > > > > > > > > > > > > > > > GPU 1 for operator A, > > GPU > > > > 2 for > > > > > > > > > operator > > > > > > > > > > > B. > > > > > > > > > > > > > This > > > > > > > > > > > > > > > > step > > > > > > > > > > > > > > > > > > is > > > > > > > > > > > > > > > > > > > > > needed > > > > > > > > > > > > > > > > > > > > > > > > > assuming > > > > > > > > > > > > > > > > > > > > > > > > > > > > the global RM does not > > > > > > > distinguish > > > > > > > > > > > > individual > > > > > > > > > > > > > > > > > resources > > > > > > > > > > > > > > > > > > > of > > > > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > > > > > same > > > > > > > > > > > > > > > > > > > > > > > > > type. > > > > > > > > > > > > > > > > > > > > > > > > > > > > It is true for memory, > > but > > > > not > > > > > > > for > > > > > > > > > GPU. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The GPU manager is > > > > designed to > > > > > > > do 2.b > > > > > > > > > > > here. > > > > > > > > > > > > > So it > > > > > > > > > > > > > > > > > > should > > > > > > > > > > > > > > > > > > > > > > > discover the > > > > > > > > > > > > > > > > > > > > > > > > > > > > physical GPU > > information > > > > and > > > > > > > > > bind/match > > > > > > > > > > > > them > > > > > > > > > > > > > to > > > > > > > > > > > > > > > > each > > > > > > > > > > > > > > > > > > > > > operators. > > > > > > > > > > > > > > > > > > > > > > > > > Making > > > > > > > > > > > > > > > > > > > > > > > > > > > this > > > > > > > > > > > > > > > > > > > > > > > > > > > > general will fill in > > the > > > > > > missing > > > > > > > > > piece to > > > > > > > > > > > > > support > > > > > > > > > > > > > > > > > > custom > > > > > > > > > > > > > > > > > > > > > resource > > > > > > > > > > > > > > > > > > > > > > > > > type > > > > > > > > > > > > > > > > > > > > > > > > > > > > definition. But I'd > > avoid > > > > > > > calling it > > > > > > > > > a > > > > > > > > > > > > > "External > > > > > > > > > > > > > > > > > > Resource > > > > > > > > > > > > > > > > > > > > > > > Manager" to > > > > > > > > > > > > > > > > > > > > > > > > > > > avoid > > > > > > > > > > > > > > > > > > > > > > > > > > > > confusion with RM, > > maybe > > > > > > > something > > > > > > > > > like > > > > > > > > > > > > > "Operator > > > > > > > > > > > > > > > > > > > Resource > > > > > > > > > > > > > > > > > > > > > > > Assigner" > > > > > > > > > > > > > > > > > > > > > > > > > > > would > > > > > > > > > > > > > > > > > > > > > > > > > > > > be more accurate. So > > for > > > > each > > > > > > > > > resource > > > > > > > > > > > type > > > > > > > > > > > > > users > > > > > > > > > > > > > > > > can > > > > > > > > > > > > > > > > > > > have > > > > > > > > > > > > > > > > > > > > an > > > > > > > > > > > > > > > > > > > > > > > > > optional > > > > > > > > > > > > > > > > > > > > > > > > > > > > "Operator Resource > > > > Assigner" in > > > > > > > the > > > > > > > > > TM. > > > > > > > > > > > For > > > > > > > > > > > > > > > memory, > > > > > > > > > > > > > > > > > > users > > > > > > > > > > > > > > > > > > > > > don't > > > > > > > > > > > > > > > > > > > > > > > need > > > > > > > > > > > > > > > > > > > > > > > > > > > this, > > > > > > > > > > > > > > > > > > > > > > > > > > > > but for other extended > > > > > > resources, > > > > > > > > > users > > > > > > > > > > > may > > > > > > > > > > > > > need > > > > > > > > > > > > > > > > > that. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Personally I think a > > > > pluggable > > > > > > > > > "Operator > > > > > > > > > > > > > Resource > > > > > > > > > > > > > > > > > > > Assigner" > > > > > > > > > > > > > > > > > > > > > is > > > > > > > > > > > > > > > > > > > > > > > > > achievable > > > > > > > > > > > > > > > > > > > > > > > > > > > > in this FLIP. But I am > > > > also OK > > > > > > > with > > > > > > > > > > > having > > > > > > > > > > > > > that > > > > > > > > > > > > > > > in > > > > > > > > > > > > > > > > a > > > > > > > > > > > > > > > > > > > > separate > > > > > > > > > > > > > > > > > > > > > > > FLIP > > > > > > > > > > > > > > > > > > > > > > > > > > > because > > > > > > > > > > > > > > > > > > > > > > > > > > > > the interface between > > the > > > > > > > "Operator > > > > > > > > > > > > Resource > > > > > > > > > > > > > > > > > Assigner" > > > > > > > > > > > > > > > > > > > and > > > > > > > > > > > > > > > > > > > > > > > operator > > > > > > > > > > > > > > > > > > > > > > > > > may > > > > > > > > > > > > > > > > > > > > > > > > > > > > take a while to settle > > > > down if > > > > > > we > > > > > > > > > want to > > > > > > > > > > > > > make it > > > > > > > > > > > > > > > > > > > generic. > > > > > > > > > > > > > > > > > > > > > But I > > > > > > > > > > > > > > > > > > > > > > > > > think > > > > > > > > > > > > > > > > > > > > > > > > > > > our > > > > > > > > > > > > > > > > > > > > > > > > > > > > implementation should > > take > > > > this > > > > > > > > > future > > > > > > > > > > > work > > > > > > > > > > > > > into > > > > > > > > > > > > > > > > > > > > > consideration so > > > > > > > > > > > > > > > > > > > > > > > > > that we > > > > > > > > > > > > > > > > > > > > > > > > > > > > don't need to break > > > > backwards > > > > > > > > > > > compatibility > > > > > > > > > > > > > once > > > > > > > > > > > > > > > we > > > > > > > > > > > > > > > > > > have > > > > > > > > > > > > > > > > > > > > > that. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Jiangjie (Becket) Qin > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Mar 4, 2020 at > > > > 12:27 AM > > > > > > > > > Stephan > > > > > > > > > > > > Ewen > > > > > > > > > > > > > < > > > > > > > > > > > > > > > > > > > > > se...@apache.org> > > > > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thank you for writing > > > > this > > > > > > > FLIP. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I cannot really give > > much > > > > > > input > > > > > > > > > into > > > > > > > > > > > the > > > > > > > > > > > > > > > > mechanics > > > > > > > > > > > > > > > > > of > > > > > > > > > > > > > > > > > > > > > GPU-aware > > > > > > > > > > > > > > > > > > > > > > > > > > > > scheduling > > > > > > > > > > > > > > > > > > > > > > > > > > > > > and GPU allocation, > > as I > > > > have > > > > > > > no > > > > > > > > > > > > experience > > > > > > > > > > > > > > > with > > > > > > > > > > > > > > > > > > that. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > One thought I had > > when > > > > > > reading > > > > > > > the > > > > > > > > > > > > > proposal is > > > > > > > > > > > > > > > if > > > > > > > > > > > > > > > > > it > > > > > > > > > > > > > > > > > > > > makes > > > > > > > > > > > > > > > > > > > > > > > sense to > > > > > > > > > > > > > > > > > > > > > > > > > > > look > > > > > > > > > > > > > > > > > > > > > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > the "GPU Manager" as > > an > > > > > > > "External > > > > > > > > > > > > Resource > > > > > > > > > > > > > > > > > Manager", > > > > > > > > > > > > > > > > > > > and > > > > > > > > > > > > > > > > > > > > > GPU > > > > > > > > > > > > > > > > > > > > > > > is one > > > > > > > > > > > > > > > > > > > > > > > > > > > such > > > > > > > > > > > > > > > > > > > > > > > > > > > > > resource. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The way I understand > > the > > > > > > > > > > > ResourceProfile > > > > > > > > > > > > > and > > > > > > > > > > > > > > > > > > > > ResourceSpec, > > > > > > > > > > > > > > > > > > > > > > > that is > > > > > > > > > > > > > > > > > > > > > > > > > how > > > > > > > > > > > > > > > > > > > > > > > > > > > it > > > > > > > > > > > > > > > > > > > > > > > > > > > > > is done there. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > It has the advantage > > > > that it > > > > > > > looks > > > > > > > > > more > > > > > > > > > > > > > > > > extensible. > > > > > > > > > > > > > > > > > > > Maybe > > > > > > > > > > > > > > > > > > > > > > > there is > > > > > > > > > > > > > > > > > > > > > > > > > a > > > > > > > > > > > > > > > > > > > > > > > > > > > GPU > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Resource, a > > specialized > > > > > > NVIDIA > > > > > > > GPU > > > > > > > > > > > > > Resource, > > > > > > > > > > > > > > > and > > > > > > > > > > > > > > > > > FPGA > > > > > > > > > > > > > > > > > > > > > > > Resource, a > > > > > > > > > > > > > > > > > > > > > > > > > > > Alibaba > > > > > > > > > > > > > > > > > > > > > > > > > > > > > TPU Resource, etc. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Stephan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Mar 3, 2020 > > at > > > > 7:57 > > > > > > AM > > > > > > > > > Becket > > > > > > > > > > > > Qin < > > > > > > > > > > > > > > > > > > > > > > > becket....@gmail.com> > > > > > > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks for the FLIP > > > > Yangze. > > > > > > > GPU > > > > > > > > > > > > resource > > > > > > > > > > > > > > > > > management > > > > > > > > > > > > > > > > > > > > > support > > > > > > > > > > > > > > > > > > > > > > > is a > > > > > > > > > > > > > > > > > > > > > > > > > > > > > must-have > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > for machine > > learning > > > > use > > > > > > > cases. > > > > > > > > > > > > Actually > > > > > > > > > > > > > it > > > > > > > > > > > > > > > is > > > > > > > > > > > > > > > > > one > > > > > > > > > > > > > > > > > > of > > > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > > > > > mostly > > > > > > > > > > > > > > > > > > > > > > > > > > > asked > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > question from the > > > > users who > > > > > > > are > > > > > > > > > > > > > interested in > > > > > > > > > > > > > > > > > using > > > > > > > > > > > > > > > > > > > > Flink > > > > > > > > > > > > > > > > > > > > > > > for ML. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Some quick > > comments / > > > > > > > questions > > > > > > > > > to > > > > > > > > > > > the > > > > > > > > > > > > > wiki. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 1. The WebUI / > > REST API > > > > > > > should > > > > > > > > > > > probably > > > > > > > > > > > > > also > > > > > > > > > > > > > > > be > > > > > > > > > > > > > > > > > > > > > mentioned in > > > > > > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > > > > > > > > > public > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > interface section. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 2. Is the data > > > > structure > > > > > > that > > > > > > > > > holds > > > > > > > > > > > GPU > > > > > > > > > > > > > info > > > > > > > > > > > > > > > > > also a > > > > > > > > > > > > > > > > > > > > > public > > > > > > > > > > > > > > > > > > > > > > > API? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Jiangjie (Becket) > > Qin > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Mar 3, > > 2020 at > > > > > > 10:15 > > > > > > > AM > > > > > > > > > > > Xintong > > > > > > > > > > > > > Song > > > > > > > > > > > > > > > < > > > > > > > > > > > > > > > > > > > > > > > > > tonysong...@gmail.com> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks for > > drafting > > > > the > > > > > > > FLIP > > > > > > > > > and > > > > > > > > > > > > > kicking > > > > > > > > > > > > > > > off > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > > > > > discussion, > > > > > > > > > > > > > > > > > > > > > > > > > > > Yangze. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Big +1 for this > > > > feature. > > > > > > > > > Supporting > > > > > > > > > > > > > using > > > > > > > > > > > > > > > of > > > > > > > > > > > > > > > > > GPU > > > > > > > > > > > > > > > > > > in > > > > > > > > > > > > > > > > > > > > > Flink > > > > > > > > > > > > > > > > > > > > > > > is > > > > > > > > > > > > > > > > > > > > > > > > > > > > > significant, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > especially for > > the ML > > > > > > > > > scenarios. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I've reviewed the > > > > FLIP > > > > > > wiki > > > > > > > > > doc and > > > > > > > > > > > > it > > > > > > > > > > > > > > > looks > > > > > > > > > > > > > > > > > good > > > > > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > > > > > > > me. I > > > > > > > > > > > > > > > > > > > > > > > > > think > > > > > > > > > > > > > > > > > > > > > > > > > > > > it's a > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > very good first > > step > > > > for > > > > > > > > > Flink's > > > > > > > > > > > GPU > > > > > > > > > > > > > > > > supports. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thank you~ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Mar 2, > > 2020 > > > > at > > > > > > > 12:06 PM > > > > > > > > > > > > Yangze > > > > > > > > > > > > > Guo > > > > > > > > > > > > > > > < > > > > > > > > > > > > > > > > > > > > > > > karma...@gmail.com > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi everyone, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > We would like > > to > > > > start > > > > > > a > > > > > > > > > > > discussion > > > > > > > > > > > > > > > thread > > > > > > > > > > > > > > > > on > > > > > > > > > > > > > > > > > > > > > "FLIP-108: > > > > > > > > > > > > > > > > > > > > > > > Add > > > > > > > > > > > > > > > > > > > > > > > > > GPU > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > support in > > > > Flink"[1]. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > This FLIP > > mainly > > > > > > > discusses > > > > > > > > > the > > > > > > > > > > > > > following > > > > > > > > > > > > > > > > > > issues: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - Enable user > > to > > > > > > > configure > > > > > > > > > how > > > > > > > > > > > many > > > > > > > > > > > > > GPUs > > > > > > > > > > > > > > > > in a > > > > > > > > > > > > > > > > > > > task > > > > > > > > > > > > > > > > > > > > > > > executor > > > > > > > > > > > > > > > > > > > > > > > > > and > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > forward such > > > > > > > requirements to > > > > > > > > > the > > > > > > > > > > > > > external > > > > > > > > > > > > > > > > > > > resource > > > > > > > > > > > > > > > > > > > > > > > managers > > > > > > > > > > > > > > > > > > > > > > > > > (for > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Kubernetes/Yarn/Mesos > > > > > > > > > setups). > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - Provide > > > > information > > > > > > of > > > > > > > > > > > available > > > > > > > > > > > > > GPU > > > > > > > > > > > > > > > > > > resources > > > > > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > > > > > > > > > > > operators. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Key changes > > > > proposed in > > > > > > > the > > > > > > > > > FLIP > > > > > > > > > > > > are > > > > > > > > > > > > > as > > > > > > > > > > > > > > > > > > follows: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - Forward GPU > > > > resource > > > > > > > > > > > requirements > > > > > > > > > > > > > to > > > > > > > > > > > > > > > > > > > > > Yarn/Kubernetes. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - Introduce > > > > GPUManager > > > > > > as > > > > > > > > > one of > > > > > > > > > > > > the > > > > > > > > > > > > > task > > > > > > > > > > > > > > > > > > manager > > > > > > > > > > > > > > > > > > > > > > > services to > > > > > > > > > > > > > > > > > > > > > > > > > > > > > discover > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > and expose GPU > > > > resource > > > > > > > > > > > information > > > > > > > > > > > > > to > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > context > > > > > > > > > > > > > > > > > > > > of > > > > > > > > > > > > > > > > > > > > > > > > > functions. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - Introduce the > > > > default > > > > > > > > > script > > > > > > > > > > > for > > > > > > > > > > > > > GPU > > > > > > > > > > > > > > > > > > discovery, > > > > > > > > > > > > > > > > > > > > in > > > > > > > > > > > > > > > > > > > > > > > which we > > > > > > > > > > > > > > > > > > > > > > > > > > > > provide > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > the privilege > > mode > > > > to > > > > > > > help > > > > > > > > > user > > > > > > > > > > > to > > > > > > > > > > > > > > > achieve > > > > > > > > > > > > > > > > > > > > > worker-level > > > > > > > > > > > > > > > > > > > > > > > > > isolation > > > > > > > > > > > > > > > > > > > > > > > > > > > > in > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > standalone > > mode. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Please find > > more > > > > > > details > > > > > > > in > > > > > > > > > the > > > > > > > > > > > > FLIP > > > > > > > > > > > > > wiki > > > > > > > > > > > > > > > > > > > document > > > > > > > > > > > > > > > > > > > > > [1]. > > > > > > > > > > > > > > > > > > > > > > > > > Looking > > > > > > > > > > > > > > > > > > > > > > > > > > > > > forward > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > your feedbacks. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-108%3A+Add+GPU+support+in+Flink > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Yangze Guo > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >