Thank you all for your participation! I'll start voting for this FLIP.
Best,
Yangze Guo
On Wed, Apr 1, 2020 at 4:55 PM Stephan Ewen wrote:
>
> Sounds good!
>
> On Tue, Mar 31, 2020 at 4:32 AM Yangze Guo wrote:
>
> > Hi everyone,
> > I've updated the FLIP accordingly. The key change is replacing
Sounds good!
On Tue, Mar 31, 2020 at 4:32 AM Yangze Guo wrote:
> Hi everyone,
> I've updated the FLIP accordingly. The key change is replacing two
> resource allocation interfaces to config options.
>
> If there are no further comments, I would like to start a voting
> thread by tomorrow.
>
> Be
Hi everyone,
I've updated the FLIP accordingly. The key change is replacing two
resource allocation interfaces to config options.
If there are no further comments, I would like to start a voting
thread by tomorrow.
Best,
Yangze Guo
On Mon, Mar 30, 2020 at 9:15 PM Till Rohrmann wrote:
>
> If the
If there is no need for the ExternalResourceDriver on the RM side, then it
is always a good idea to keep it simple and don't introduce it. One can
always change things once one realizes that there is a need for it.
Cheers,
Till
On Mon, Mar 30, 2020 at 12:00 PM Yangze Guo wrote:
> Hi @Till, @Xin
Hi @Till, @Xintong
I think even without the credential concerns, replacing the interfaces
with configuration options is a good idea from my side.
- Currently, I don't see any external resource does not compatible
with this mechanism
- It reduces the burden of users to implement a plugin themselves
I also agree that the pluggable ExternalResourceDriver should be loaded by
the cluster class loader. Despite the plugin might be implemented by users,
external resources (as part of task executor resources) should be cluster
configurations, unlike job-level user codes such as UDFs, because the task
At the moment the RM does not have a user code class loader and I agree
with Stephan that it should stay like this. This, however, does not mean
that we cannot support pluggable components in the RM. As long as the
plugins are on the system's class path, it should be fine for the RM to
load them. F
Hi, Stephan,
I see your concern and I totally agree with you.
The interface on RM side is now `Map
getYarn/KubernetesExternalResource()`. The only valid information RM
get from it is the configuration key of that external resource in
Yarn/K8s. The "String/Long value" would be the same as the
exte
Maybe one final comment: It is probably not an issue, but let's try and
keep user code (via user code classloader) out of the ResourceManager, if
possible.
As background:
There were thoughts in the past to support setups where the RM must run
with "superuser" credentials, but we cannot run JM/TM
Thanks for the feedback, @Till and @Xintong.
Regarding separating the interface, I'm also +1 with it.
Regarding the resource allocation interface, true, it's dangerous to
give much access to user codes. Changing the return type to Map makes sense to me. AFAIK, it is compatible
with all the first-
Thanks for updating the FLIP, Yangze.
I agree with Till that we probably want to separate the K8s/Yarn decorator
calls. Users can still configure one driver class, and we can use
`instanceof` to check whether the driver implemented K8s/Yarn specific
interfaces.
Moreover, I'm not sure about exposi
Hi everyone,
I'm a bit late to the party. I think the current proposal looks good.
Concerning the ExternalResourceDriver interface defined in the FLIP [1], I
would suggest to not include the decorator calls for Kubernetes and Yarn in
the base interface. Instead I would suggest to segregate the de
Nice, thanks a lot!
On Thu, Mar 26, 2020 at 10:21 AM Yangze Guo wrote:
> Thanks for the suggestion, @Stephan, @Becket and @Xintong.
>
> I've updated the FLIP accordingly. I do not add a
> ResourceInfoProvider. Instead, I introduce the ExternalResourceDriver,
> which takes the responsibility of a
Thanks for the suggestion, @Stephan, @Becket and @Xintong.
I've updated the FLIP accordingly. I do not add a
ResourceInfoProvider. Instead, I introduce the ExternalResourceDriver,
which takes the responsibility of all relevant operations on both RM
and TM sides.
After a rethink about decoupling th
This sounds good to go ahead from my side.
I like the approach that Becket suggested - in that case the core
abstraction that everyone would need to understand would be "external
resource allocation" and the "ResourceInfoProvider", and the GPU specific
code would be a specific implementation only
Thanks for the comments, Stephan & Becket.
@Stephan
I see your concern, and I completely agree with you that we should first
think about the "library" / "plugin" / "extension" style if possible.
If GPUs are sliced and assigned during scheduling, there may be reason,
> although it looks that it w
Thanks for the comment, Stephan.
- If everything becomes a "core feature", it will make the project hard
> to develop in the future. Thinking "library" / "plugin" / "extension" style
> where possible helps.
Completely agree. It is much more important to design a mechanism than
focusing on a sp
Hi all!
The main point I wanted to throw into the discussion is the following:
- With more and more use cases, more and more tools go into Flink
- If everything becomes a "core feature", it will make the project hard
to develop in the future. Thinking "library" / "plugin" / "extension" style
w
Thanks for the feedback, Becket.
IMO, eventually an operator should only see info of GPUs that are dedicated
for it, instead of all GPUs on the machine/container in the current design.
It does not make sense to let the user who writes a UDF to worry about
coordination among multiple operators run
It probably make sense for us to first agree on the final state. More
specifically, will the resource info be exposed through runtime context
eventually?
If that is the final state and we have a seamless migration story from this
FLIP to that final state, Personally I think it is OK to expose the
@Yangze,
I think what Stephan means (@Stephan, please correct me if I'm wrong) is
that, we might not need to hold and maintain the GPUManager as a service in
TaskManagerServices or RuntimeContext. An alternative is to create /
retrieve the GPUManager only in the operators that need it, e.g., with a
@Shephan
Do you mean Minicluster? Yes, it makes sense to share the GPU Manager
in such scenario.
If that's what you worry about, I'm +1 for holding
GPUManager(ExternalResourceManagers) in TaskExecutor instead of
TaskManagerServices.
Regarding the RuntimeContext/FunctionContext, it just holds the G
On Fri, 13 Mar 2020 15:58:20 + se...@apache.org wrote
> > Can we somehow keep this out of the TaskManager services
> I fear that we could not. IMO, the GPUManager(or
> ExternalServicesManagers in future) is conceptually one of the task
> manager services, just like MemoryManage
On Fri, 13 Mar 2020 15:58:20 + se...@apache.org wrote
> > Can we somehow keep this out of the TaskManager services
> I fear that we could not. IMO, the GPUManager(or
> ExternalServicesManagers in future) is conceptually one of the task
> manager services, just like MemoryManage
> > Can we somehow keep this out of the TaskManager services
> I fear that we could not. IMO, the GPUManager(or
> ExternalServicesManagers in future) is conceptually one of the task
> manager services, just like MemoryManager before 1.10.
> - It maintains/holds the GPU resource at TM level and all
Thanks for the feedback, Stephan.
> Can we somehow keep this out of the TaskManager services
I fear that we could not. IMO, the GPUManager(or
ExternalServicesManagers in future) is conceptually one of the task
manager services, just like MemoryManager before 1.10.
- It maintains/holds the GPU reso
It sounds fine to initially start with GPU specific support and think about
generalizing this once we better understand the space.
About the implementation suggested in FLIP-108:
- Can we somehow keep this out of the TaskManager services? Anything we
have to pull through all layers of the TM mak
Thanks for all the feedbacks.
@Becket
Regarding the WebUI and GPUInfo, you're right, I'll add them to the
Public API section.
@Stephan @Becket
Regarding the general extended resource mechanism, I second Xintong's
suggestion.
- It's better to leverage ResourceProfile and ResourceSpec after we
sup
Thanks a lot for the FLIP, Yangze.
There is no doubt that GPU resource management support will greatly
facilitate the development of AI-related applications by PyFlink users.
I have only one comment about this wiki:
Regarding the names of several GPU configurations, I think it is better to
delet
@Stephan, @Becket,
Actually, Yangze, Yang and I also had an offline discussion about making
the "GPU Support" as some general "Extended Resource Support". We believe
supporting extended resources in a general mechanism is definitely a good
and extensible way. The reason we propose this FLIP narrow
That's a good point, Stephan. It makes total sense to generalize the
resource management to support custom resources. Having that allows users
to add new resources by themselves. The general resource management may
involve two different aspects:
1. The custom resource type definition. It is suppor
Thank you for writing this FLIP.
I cannot really give much input into the mechanics of GPU-aware scheduling
and GPU allocation, as I have no experience with that.
One thought I had when reading the proposal is if it makes sense to look at
the "GPU Manager" as an "External Resource Manager", and G
Thanks for the FLIP Yangze. GPU resource management support is a must-have
for machine learning use cases. Actually it is one of the mostly asked
question from the users who are interested in using Flink for ML.
Some quick comments / questions to the wiki.
1. The WebUI / REST API should probably a
Thanks for drafting the FLIP and kicking off the discussion, Yangze.
Big +1 for this feature. Supporting using of GPU in Flink is significant,
especially for the ML scenarios.
I've reviewed the FLIP wiki doc and it looks good to me. I think it's a
very good first step for Flink's GPU supports.
Th
Hi everyone,
We would like to start a discussion thread on "FLIP-108: Add GPU
support in Flink"[1].
This FLIP mainly discusses the following issues:
- Enable user to configure how many GPUs in a task executor and
forward such requirements to the external resource managers (for
Kubernetes/Yarn/Me
35 matches
Mail list logo