Re: Review Request 75074: [cgroups2] Support DeviceManager in GPU isolator

Benjamin Mahler Sat, 03 Aug 2024 12:39:25 -0700

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/75074/#review226771
-----------------------------------------------------------



Fix it, then Ship it!





src/slave/containerizer/mesos/containerizer.cpp
Line 526 (original), 526 (patched)
<https://reviews.apache.org/r/75074/#comment315062>

    nit: preserve space between ] (



src/slave/containerizer/mesos/isolators/gpu/isolator.hpp
Lines 40 (patched)
<https://reviews.apache.org/r/75074/#comment315061>

    nit: alphabetize



src/slave/containerizer/mesos/isolators/gpu/isolator.cpp
Lines 109-118 (patched)
<https://reviews.apache.org/r/75074/#comment315063>

    nit: indentation is off here?



src/slave/containerizer/mesos/isolators/gpu/isolator.cpp
Lines 160-162 (original), 199-203 (patched)
<https://reviews.apache.org/r/75074/#comment315064>

    this looks wrong? it can call hierarchy.error() when hierarchy isSome() or 
isNone()? as long as `!(*cgroups2Mounted)` it will call it?
    
    did you want && here?



src/slave/containerizer/mesos/isolators/gpu/isolator.cpp
Lines 308-310 (patched)
<https://reviews.apache.org/r/75074/#comment315065>

    this can be solved btw, but it seems ok to get the entire state anyway



src/slave/containerizer/mesos/isolators/gpu/isolator.cpp
Lines 372 (patched)
<https://reviews.apache.org/r/75074/#comment315067>

    on the surface this looks wrong, we're passing larger and larger vectors in 
on each call since we're assigning back to the future variable
    
    the `__recover` call is going to copy the vector and append an entry? 
there's no reason to pass the vector at all then? instead the caller can do:
    
    ```
    futures.push_back(__recover(containerId, containerGpus));
    ```



src/slave/containerizer/mesos/isolators/gpu/isolator.cpp
Lines 451-459 (patched)
<https://reviews.apache.org/r/75074/#comment315068>

    we can simplify this by changing controlDeviceEntries to a hashmap or 
linkedhashmap and calling .values() to get a vector out



src/slave/containerizer/mesos/isolators/gpu/isolator.cpp
Lines 528-534 (original), 668-674 (patched)
<https://reviews.apache.org/r/75074/#comment315070>

    looks like a helper would shorten this and the same conversion in 
`_update`? I'll leave as is, but would be good to follow up



src/slave/containerizer/mesos/isolators/gpu/isolator.cpp
Line 534 (original), 674 (patched)
<https://reviews.apache.org/r/75074/#comment315069>

    do we still want mknod here? I'll leave it but seems worth removing it in 
v1 too?



src/slave/containerizer/mesos/isolators/gpu/isolator.cpp
Lines 566-572 (original), 716-722 (patched)
<https://reviews.apache.org/r/75074/#comment315071>

    (see note above, this could be more concise with a helper)


- Benjamin Mahler


On Aug. 3, 2024, 1:20 a.m., Jason Zhou wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/75074/
> -----------------------------------------------------------
> 
> (Updated Aug. 3, 2024, 1:20 a.m.)
> 
> 
> Review request for mesos and Benjamin Mahler.
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently, the GPU isolator assumes we are only using cgroups v1, and
> makes use of the cgroups::devices::allow and deny functions to control
> GPU access.
> 
> In Cgroups2, we need to attach ebpf programs for the specific cgroups,
> which is done for us in the DeviceManager. Hence, we need to use the
> DeviceManager in the GPU isolator depending on whether cgroups v1 or v2
> is currently mounted
> 
> 
> Diffs
> -----
> 
>   src/slave/containerizer/mesos/containerizer.cpp 
> 9bafb137ba67bfed5cfb24c1a3af9203fde9e821 
>   src/slave/containerizer/mesos/isolators/gpu/isolator.hpp 
> e4f221d5f63ed1db044acabbbd427a30a0f69ced 
>   src/slave/containerizer/mesos/isolators/gpu/isolator.cpp 
> 99119f938e2eb5ca6a8b64d073c87ca5032a00b8 
> 
> 
> Diff: https://reviews.apache.org/r/75074/diff/8/
> 
> 
> Testing
> -------
> 
> Existing GPU isolator tests pass, with the exception of the 
> DefaultExecutorVerifyDeviceAccess test which is because we currently don't 
> support nested containers.
> 
> 
> Thanks,
> 
> Jason Zhou
> 
>

Re: Review Request 75074: [cgroups2] Support DeviceManager in GPU isolator

Reply via email to