[GitHub] [mesos] jblache commented on pull request #454: Support for nvidia MIG in Mesos containerizer

via GitHub Tue, 31 Jan 2023 08:31:58 -0800


jblache commented on PR #454:
URL: https://github.com/apache/mesos/pull/454#issuecomment-1410699238


   Pushed a more granular version of the branch, which at least splits the 
discovery vs. isolation bits. Original postback was squashed for internal 
review reasons mostly.
   
   On the approach, the NVML wrapper was updated to add support for the calls 
needed to discover/query MIG GPUs, and a couple of those added interfaces have 
a bit of logic included that keeps the calling code cleaner and overall makes 
sense.
   
   GPU enumeration will check for MIG and enumerate MIG devices if present, 
instead of the underlying GPU.
   
   The isolator now has to include all the nvidia-caps device nodes, which is a 
lot of them, but anything more granular seems pretty involved and brittle (IIRC 
one of the reasons I didn't look into it more is that it would likely get in 
the the way of a dynamic configuration).
   
   The tricky bit is reconciling device allocations upon restart with running 
jobs. It turned out to be not too bad, but the code grew accordingly there. I 
forget the details, but there's a bit more data that needs to be kept around to 
be able to match everything up, and the matching logic needs to look for MIG 
vs. not MIG, basically.
   
   This was a tactical patch for us and, as you can see, it sat on my TODO list 
for bit before I could get it out to you. I don't have bandwidth currently to 
engage in a full-blown review process, but wanted to get the code out for 
anyone who might have similar needs, and as groundwork for more evolved 
capabilities around MIG.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [mesos] jblache commented on pull request #454: Support for nvidia MIG in Mesos containerizer

Reply via email to