----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/75074/#review226759 -----------------------------------------------------------
src/slave/containerizer/mesos/isolators/gpu/isolator.hpp Lines 87 (patched) <https://reviews.apache.org/r/75074/#comment315015> use a ref here btw this is no longer an Owned since we're definitely holding it in multiple places, but it's a shared pointer so this works src/slave/containerizer/mesos/isolators/gpu/isolator.hpp Lines 119 (patched) <https://reviews.apache.org/r/75074/#comment315016> ditto here, use a ref src/slave/containerizer/mesos/isolators/gpu/isolator.cpp Lines 95 (patched) <https://reviews.apache.org/r/75074/#comment315017> use a ref here rather than taking a copy s/gpuToNonWildcard/asDeviceEntry/ src/slave/containerizer/mesos/isolators/gpu/isolator.cpp Lines 107 (patched) <https://reviews.apache.org/r/75074/#comment315018> nit: two newlines between top level definitions src/slave/containerizer/mesos/isolators/gpu/isolator.cpp Lines 177 (patched) <https://reviews.apache.org/r/75074/#comment315019> maybe return Error on error here same as if cgroups::hierarchy fails? src/slave/containerizer/mesos/isolators/gpu/isolator.cpp Line 159 (original), 179 (patched) <https://reviews.apache.org/r/75074/#comment315020> what does hierarchy == none here mean? src/slave/containerizer/mesos/isolators/gpu/isolator.cpp Line 244 (original), 266 (patched) <https://reviews.apache.org/r/75074/#comment315021> do you need a TODO here? src/slave/containerizer/mesos/isolators/gpu/isolator.cpp Line 252 (original), 274 (patched) <https://reviews.apache.org/r/75074/#comment315022> nit: two newlines between top level definitions src/slave/containerizer/mesos/isolators/gpu/isolator.cpp Lines 276 (patched) <https://reviews.apache.org/r/75074/#comment315023> nit: open parens wrap at 4 spaces src/slave/containerizer/mesos/isolators/gpu/isolator.cpp Lines 284-285 (patched) <https://reviews.apache.org/r/75074/#comment315024> assigning to itself..? why not use your entry creation helper here? should we make an overload of is_access_granted for non-whitelist entry? src/slave/containerizer/mesos/isolators/gpu/isolator.cpp Lines 294-298 (patched) <https://reviews.apache.org/r/75074/#comment315025> nit: use the same indentation as the cgroups v1 logic below here src/slave/containerizer/mesos/isolators/gpu/isolator.cpp Lines 285-311 (original), 345-370 (patched) <https://reviews.apache.org/r/75074/#comment315026> this logic looks the same, can you not re-use it? the difference is how to check if access is granted? v1: if gpu in devices.list v2: if gpu in DeviceManager::CgroupDeviceAccess so couldn't you have something like: ``` // if v2: return deviceManager->state(cgroup) .then(defer(self(), [=](const DeviceManager::CgroupDeviceAccess&) { // compute set<Gpu> containerGpus return containerGpus; }) .then(deviceManager->state(cgroup).then(defer( PID<NvidiaGpuIsolatorProcess>(this), &NvidiaGpuIsolatorProcess::_recover, containerId, available, lambda::_1)); // if v1: // compute set<Gpu> containerGpus return _recover(containerId, available, containerGpus)); ``` src/slave/containerizer/mesos/isolators/gpu/isolator.cpp Lines 411 (patched) <https://reviews.apache.org/r/75074/#comment315027> nit: indentation src/slave/containerizer/mesos/isolators/gpu/isolator.cpp Lines 412-440 (patched) <https://reviews.apache.org/r/75074/#comment315028> can't all this code just be a simple loop to convert to the needed format via a helper? ``` vector<DeviceManager::NonWildcardEntry> entries; foreachvalue (const Entry& entry, controlDeviceEntries) { entries.push_back(toNonWildcardEntry(entry)); } ``` src/slave/containerizer/mesos/isolators/gpu/isolator.cpp Line 351 (original), 454 (patched) <https://reviews.apache.org/r/75074/#comment315029> // Cgroups v1: src/slave/containerizer/mesos/isolators/gpu/isolator.cpp Lines 629-635 (patched) <https://reviews.apache.org/r/75074/#comment315031> let's keep the same naming to avoid confusion, so: deallocated deallocated_entries (rather than gpus_for_deallocation) src/slave/containerizer/mesos/isolators/gpu/isolator.cpp Line 525 (original), 644 (patched) <https://reviews.apache.org/r/75074/#comment315030> // Cgroups v1: src/slave/containerizer/mesos/isolators/gpu/isolator.cpp Lines 683 (patched) <https://reviews.apache.org/r/75074/#comment315035> nit: add newline src/slave/containerizer/mesos/isolators/gpu/isolator.cpp Lines 684 (patched) <https://reviews.apache.org/r/75074/#comment315034> ditto naming, maybe just allocation_entries src/slave/containerizer/mesos/isolators/gpu/isolator.cpp Lines 685 (patched) <https://reviews.apache.org/r/75074/#comment315033> nit: remove extra newline src/slave/containerizer/mesos/isolators/gpu/isolator.cpp Line 565 (original), 698 (patched) <https://reviews.apache.org/r/75074/#comment315032> // Cgroups v1: - Benjamin Mahler On Aug. 1, 2024, 7:42 p.m., Jason Zhou wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/75074/ > ----------------------------------------------------------- > > (Updated Aug. 1, 2024, 7:42 p.m.) > > > Review request for mesos and Benjamin Mahler. > > > Repository: mesos > > > Description > ------- > > Currently, the GPU isolator assumes we are only using cgroups v1, and > makes use of the cgroups::devices::allow and deny functions to control > GPU access. > > In Cgroups2, we need to attach ebpf programs for the specific cgroups, > which is done for us in the DeviceManager. Hence, we need to use the > DeviceManager in the GPU isolator depending on whether cgroups v1 or v2 > is currently mounted > > > Diffs > ----- > > src/slave/containerizer/mesos/containerizer.cpp > 9bafb137ba67bfed5cfb24c1a3af9203fde9e821 > src/slave/containerizer/mesos/isolators/gpu/isolator.hpp > e4f221d5f63ed1db044acabbbd427a30a0f69ced > src/slave/containerizer/mesos/isolators/gpu/isolator.cpp > 99119f938e2eb5ca6a8b64d073c87ca5032a00b8 > > > Diff: https://reviews.apache.org/r/75074/diff/3/ > > > Testing > ------- > > Existing GPU isolator tests pass, with the exception of the > DefaultExecutorVerifyDeviceAccess test which is because we currently don't > support nested containers. > > > Thanks, > > Jason Zhou > >
