Re: [DISCUSS] FLIP-108: Add GPU support in Flink
On Fri, 13 Mar 2020 15:58:20 + se...@apache.org wrote > > Can we somehow keep this out of the TaskManager services > I fear that we could not. IMO, the GPUManager(or > ExternalServicesManagers in future) is conceptually one of the task > manager services, just like MemoryManager before 1.10. > - It maintains/holds the GPU resource at TM level and all of the > operators allocate the GPU resources from it. So, it should be > exclusive to a single TaskExecutor. > - We could add a collection called ExternalResourceManagers to hold > all managers of other external resources in the future. > Can you help me understand why this needs the addition in TaskMagerServices or in the RuntimeContext? Are you worried about the case when multiple Task Executors run in the same JVM? That's not common, but wouldn't it actually be good in that case to share the GPU Manager, given that the GPU is shared? Thanks, Stephan --- > What parts need information about this? > In this FLIP, operators need the information. Thus, we expose GPU > information to the RuntimeContext/FunctionContext. The slot profile is > not aware of GPU resources as GPU is TM level resource now. > > > Can the GPU Manager be a "self contained" thing that simply takes the > configuration, and then abstracts everything internally? > Yes, we just pass the path/args of the discover script and how many > GPUs per TM to it. It takes the responsibility to get the GPU > information and expose them to the RuntimeContext/FunctionContext of > Operators. Meanwhile, we'd better not allow operators to directly > access GPUManager, it should get what they want from Context. We could > then decouple the interface/implementation of GPUManager and Public > API. > > Best, > Yangze Guo > > On Fri, Mar 13, 2020 at 7:26 PM Stephan Ewen wrote: > > > > It sounds fine to initially start with GPU specific support and think > about > > generalizing this once we better understand the space. > > > > About the implementation suggested in FLIP-108: > > - Can we somehow keep this out of the TaskManager services? Anything we > > have to pull through all layers of the TM makes the TM components yet > more > > complex and harder to maintain. > > > > - What parts need information about this? > > -> do the slot profiles need information about the GPU? > > -> Can the GPU Manager be a "self contained" thing that simply takes > > the configuration, and then abstracts everything internally? Operators > can > > access it via "GPUManager.get()" or so? > > > > > > > > On Wed, Mar 4, 2020 at 4:19 AM Yangze Guo wrote: > > > > > Thanks for all the feedbacks. > > > > > > @Becket > > > Regarding the WebUI and GPUInfo, you're right, I'll add them to the > > > Public API section. > > > > > > > > > @Stephan @Becket > > > Regarding the general extended resource mechanism, I second Xintong's > > > suggestion. > > > - It's better to leverage ResourceProfile and ResourceSpec after we > > > supporting fine-grained GPU scheduling. As a first step proposal, I > > > prefer to not include it in the scope of this FLIP. > > > - Regarding the "Extended Resource Manager", if I understand > > > correctly, it just a code refactoring atm, we could extract the > > > open/close/allocateExtendResources of GPUManager to that interface. If > > > that is the case, +1 to do it during implementation. > > > > > > @Xingbo > > > As Xintong said, we looked into how Spark supports a general "Custom > > > Resource Scheduling" before and decided to introduce a common resource > > > configuration > > > schema(taskmanager.resource.{resourceName}.amount/discovery-script) > > > to make it more extensible. I think the "resource" is a proper level > > > to contain all the configs of extended resources. > > > > > > Best, > > > Yangze Guo > > > > > > On Wed, Mar 4, 2020 at 10:48 AM Xingbo Huang > wrote: > > > > > > > > Thanks a lot for the FLIP, Yangze. > > > > > > > > There is no doubt that GPU resource management support will greatly > > > > facilitate the development of AI-related applications by PyFlink > users. > > > > > > > > I have only one comment about this wiki: > > > > > > > > Regarding the names of several GPU configurations, I think it is > better > > > to > > > > delete the resource field makes it consistent with the names of other > > > > resource-related configurations in TaskManagerOption. > > > > > > > > e.g. taskmanager.resource.gpu.discovery-script.path -> > > > > taskmanager.gpu.discovery-script.path > > > > > > > > Best, > > > > > > > > Xingbo > > > > > > > > > > > > Xintong Song 于2020年3月4日周三 上午10:39写道: > > > > > > > > > @Stephan, @Becket, > > > > > > > > > > Actually, Yangze, Yang and I also had an offline discussion about > > > making > > > > > the "GPU Support" as some general "Extended Resource Support". We > > > believe > > > > > supporting extended resources in a general mechanism is definitely > a > > > good > > > > > and extensible way. The reason we propose this FL
Re: [DISCUSS] FLIP-108: Add GPU support in Flink
On Fri, 13 Mar 2020 15:58:20 + se...@apache.org wrote > > Can we somehow keep this out of the TaskManager services > I fear that we could not. IMO, the GPUManager(or > ExternalServicesManagers in future) is conceptually one of the task > manager services, just like MemoryManager before 1.10. > - It maintains/holds the GPU resource at TM level and all of the > operators allocate the GPU resources from it. So, it should be > exclusive to a single TaskExecutor. > - We could add a collection called ExternalResourceManagers to hold > all managers of other external resources in the future. > Can you help me understand why this needs the addition in TaskMagerServices or in the RuntimeContext? Are you worried about the case when multiple Task Executors run in the same JVM? That's not common, but wouldn't it actually be good in that case to share the GPU Manager, given that the GPU is shared? Thanks, Stephan --- > What parts need information about this? > In this FLIP, operators need the information. Thus, we expose GPU > information to the RuntimeContext/FunctionContext. The slot profile is > not aware of GPU resources as GPU is TM level resource now. > > > Can the GPU Manager be a "self contained" thing that simply takes the > configuration, and then abstracts everything internally? > Yes, we just pass the path/args of the discover script and how many > GPUs per TM to it. It takes the responsibility to get the GPU > information and expose them to the RuntimeContext/FunctionContext of > Operators. Meanwhile, we'd better not allow operators to directly > access GPUManager, it should get what they want from Context. We could > then decouple the interface/implementation of GPUManager and Public > API. > > Best, > Yangze Guo > > On Fri, Mar 13, 2020 at 7:26 PM Stephan Ewen wrote: > > > > It sounds fine to initially start with GPU specific support and think > about > > generalizing this once we better understand the space. > > > > About the implementation suggested in FLIP-108: > > - Can we somehow keep this out of the TaskManager services? Anything we > > have to pull through all layers of the TM makes the TM components yet > more > > complex and harder to maintain. > > > > - What parts need information about this? > > -> do the slot profiles need information about the GPU? > > -> Can the GPU Manager be a "self contained" thing that simply takes > > the configuration, and then abstracts everything internally? Operators > can > > access it via "GPUManager.get()" or so? > > > > > > > > On Wed, Mar 4, 2020 at 4:19 AM Yangze Guo wrote: > > > > > Thanks for all the feedbacks. > > > > > > @Becket > > > Regarding the WebUI and GPUInfo, you're right, I'll add them to the > > > Public API section. > > > > > > > > > @Stephan @Becket > > > Regarding the general extended resource mechanism, I second Xintong's > > > suggestion. > > > - It's better to leverage ResourceProfile and ResourceSpec after we > > > supporting fine-grained GPU scheduling. As a first step proposal, I > > > prefer to not include it in the scope of this FLIP. > > > - Regarding the "Extended Resource Manager", if I understand > > > correctly, it just a code refactoring atm, we could extract the > > > open/close/allocateExtendResources of GPUManager to that interface. If > > > that is the case, +1 to do it during implementation. > > > > > > @Xingbo > > > As Xintong said, we looked into how Spark supports a general "Custom > > > Resource Scheduling" before and decided to introduce a common resource > > > configuration > > > schema(taskmanager.resource.{resourceName}.amount/discovery-script) > > > to make it more extensible. I think the "resource" is a proper level > > > to contain all the configs of extended resources. > > > > > > Best, > > > Yangze Guo > > > > > > On Wed, Mar 4, 2020 at 10:48 AM Xingbo Huang > wrote: > > > > > > > > Thanks a lot for the FLIP, Yangze. > > > > > > > > There is no doubt that GPU resource management support will greatly > > > > facilitate the development of AI-related applications by PyFlink > users. > > > > > > > > I have only one comment about this wiki: > > > > > > > > Regarding the names of several GPU configurations, I think it is > better > > > to > > > > delete the resource field makes it consistent with the names of other > > > > resource-related configurations in TaskManagerOption. > > > > > > > > e.g. taskmanager.resource.gpu.discovery-script.path -> > > > > taskmanager.gpu.discovery-script.path > > > > > > > > Best, > > > > > > > > Xingbo > > > > > > > > > > > > Xintong Song 于2020年3月4日周三 上午10:39写道: > > > > > > > > > @Stephan, @Becket, > > > > > > > > > > Actually, Yangze, Yang and I also had an offline discussion about > > > making > > > > > the "GPU Support" as some general "Extended Resource Support". We > > > believe > > > > > supporting extended resources in a general mechanism is definitely > a > > > good > > > > > and extensible way. The reason we propose this FL