MIG is interesting. It's something we partly devised to deal with sharing interactive workloads on the same GPU. Time-slicing is very difficult on a GPU because the state is so large. You wind up having to drain the workload and that means you're holding on to resources until the longest job ends. Utilization becomes very poor, and even worse if your application uses a long (read, non-terminating) compute task. So you want some way to partition the GPU more along spatial lines - giving each process a subset of the hardware resources. For some workloads this scheduling is easy - you give each workload enough front end resources, memory, and threads to donuts job. But I think we're still a long way off from the GPU being able to do these allocations dynamically and efficiently. There are some really interesting non-uniform co-scheduling constraints that I've not seen addressed in the literature.
Paul On Sat, Dec 28, 2024, 4:19 p.m. Skip Tavakkolian <skip.tavakkol...@gmail.com> wrote: > I think there is a conflict between two different types of usage that > GPU architectures need to support. One is full-speed performance, > where the resource is fully owned and utilized for a single purpose > for a large chunk of time, and the other is where the GPU is a rare > resource that needs to be shared in short intervals (e.g. ML/AI > inference mode) > > If I understand correctly, NIX's approach addresses the first case (at > least for cores). Nvidia's Multiple-Instance GPU (MIG) architecture > seems to help to handle the second type of usage. The first case > requires maximizing data transfer rate, and the second needs clever > scheduling to minimize switching (large/huge) model parameters in/out. > > > On Fri, Dec 27, 2024 at 8:33 AM Paul Lalonde <paul.a.lalo...@gmail.com> > wrote: > > > > GPUs have been my bread and butter for 20+ years. > > > > The best introductory source continues to be Kayvon Fatahalian and Mike > Houston's 2008 CACM paper: https://dl.acm.org/doi/10.1145/1400181.1400197 > > > > It says little about the software interface to the GPU, but does a very > good job of motivating and describing the architecture. > > > > The in-depth resource for modern GPU architecture is the Nvidia A100 > tensor architecture paper: > https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf. > It's a slog, but clearly shows how compute has changed. Particularly, much > of the success is in turning branchy workloads with scattered memory > accesses into much more bulk-oriented data streams that match well to the > "natural" structure of the tensor cores. The performance gains can be > astronomical. I've personally made > 1000x - yes, that's *times* not > percentages - speedups with some workloads. There is very little compute > that's "cpu-limited" at multi-second scales that can't benefit from these > approaches, hence the death of non-GPU supercomputing. > > > > Paul > > > > > > > > On Fri, Dec 27, 2024 at 7:13 AM <tlaro...@kergis.com> wrote: > >> > >> On Thu, Dec 26, 2024 at 10:24:23PM -0800, Ron Minnich wrote: > >> [very interesting stuff] > >> > > >> > Finally, why did something like this not ever happen? Because GPUs > >> > came along a few years later and that's where all the parallelism in > >> > HPC is nowadays. NIX was a nice idea, but it did not survive in the > >> > GPU era. > >> > > >> > >> GPUs are actually wreaking havoc other kernels, with, in the Unix > >> world, X11 being in a bad shape for several reasons, one being that > >> GPU are not limited to graphical display---this tends to be > >> anecdoctical in some sense. > >> > >> Can you elaborate on the GPUs paradigm break? I tend to think that > >> there is a main difference between "equals" sharing a same address > >> space via MMU, and auxiliary processors that are using another address > >> space. A GPU, as far as I know (this is not much), is an auxiliary > >> processor when the GPU is discrete, and is a specialized processor > >> sharing the same address space when integrated (but I guess that a > >> HPC have discrete GPUs with perhaps a specialized connection). > >> > >> Do you know good references about: > >> > >> - organizing processors depending on memory connection---I found > >> mainly M. J. Flynn's paper(s) about this, but nothing more > >> recent---and the impact on an OS design; > >> > >> - IPC vs threads---from your description, it seems that your solution > >> was multiplying processes so IPC instead of multiplying threads---but > >> nonetheless the sharing of differing memories remains, and is more > >> easy to solve with IPC than with threads; > >> > >> - Present GPU's architecture (supposing it is documented; it seems not > >> totally from "General-Purpose Graphics Processor Archictectures", > >> Aamodt, Lun Fung, Rogers, SpringerVerlag) and the RISC-V approach, > >> composing hardware by connecting dedicated elements, and vectors vs > >> SIMT. > >> > >> Thanks for sharing (what can be shared)! > >> -- > >> Thierry Laronde <tlaronde +AT+ kergis +dot+ com> > >> http://www.kergis.com/ > >> http://kertex.kergis.com/ > >> Key fingerprint = 0FF7 E906 FBAF FE95 FD89 250D 52B1 AE95 6006 F40C > > > > 9fans / 9fans / see discussions + participants + delivery options > Permalink ------------------------------------------ 9fans: 9fans Permalink: https://9fans.topicbox.com/groups/9fans/T7692a612f26c8ec5-M5030d3a8098ce71249fd4d89 Delivery options: https://9fans.topicbox.com/groups/9fans/subscription