jblache commented on PR #454: URL: https://github.com/apache/mesos/pull/454#issuecomment-1410699238
Pushed a more granular version of the branch, which at least splits the discovery vs. isolation bits. Original postback was squashed for internal review reasons mostly. On the approach, the NVML wrapper was updated to add support for the calls needed to discover/query MIG GPUs, and a couple of those added interfaces have a bit of logic included that keeps the calling code cleaner and overall makes sense. GPU enumeration will check for MIG and enumerate MIG devices if present, instead of the underlying GPU. The isolator now has to include all the nvidia-caps device nodes, which is a lot of them, but anything more granular seems pretty involved and brittle (IIRC one of the reasons I didn't look into it more is that it would likely get in the the way of a dynamic configuration). The tricky bit is reconciling device allocations upon restart with running jobs. It turned out to be not too bad, but the code grew accordingly there. I forget the details, but there's a bit more data that needs to be kept around to be able to match everything up, and the matching logic needs to look for MIG vs. not MIG, basically. This was a tactical patch for us and, as you can see, it sat on my TODO list for bit before I could get it out to you. I don't have bandwidth currently to engage in a full-blown review process, but wanted to get the code out for anyone who might have similar needs, and as groundwork for more evolved capabilities around MIG. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
