Re: [9fans] NIX experience

2024-12-28 Thread tlaronde
On Fri, Dec 27, 2024 at 01:31:24PM -0800, Paul Lalonde wrote: > > That said, now that NVDA has moved a bunch of their "resource manager" > (read, OS) to the GPU itself and simplified the linux DRM module, the > driver layer has simplified significantly. I'm not sure I have anywhere > near the ban

Re: [9fans] NIX experience

2024-12-28 Thread Stuart Morrow
On Sat, 28 Dec 2024 at 04:44, Cyber Fonic wrote: > Whilst there are many HPC workloads that are well supported by GPGPUs, we > also have multi-core systems such as Ampere One and AMD EPYC with 192 cores > (and soon even more). > I would think that some of the Blue Gene and NIX work might be rele

Re: [9fans] NIX experience

2024-12-28 Thread Frank D. Engel, Jr.
Apple Silicon chips may be an interesting counter-example to your view of the architecture.  They work directly from system memory; data is not copied between different sets of memory or different areas in memory to make it available to the GPU.  Consequently the CPU and GPU work together much

Re: [9fans] NIX experience

2024-12-28 Thread Paul Lalonde
This data-shuttling is one of the things that GPU vendors have been working on. Most of the data the GPU needs is never touched by the CPU, except to move it to GPU memory. This is wasteful. But the GPU already sits on the PCIe bus, as does the storage device. Why not move the data directly from

Re: [9fans] NIX experience

2024-12-28 Thread Skip Tavakkolian
I think there is a conflict between two different types of usage that GPU architectures need to support. One is full-speed performance, where the resource is fully owned and utilized for a single purpose for a large chunk of time, and the other is where the GPU is a rare resource that needs to be s

Re: [9fans] NIX experience

2024-12-28 Thread Paul Lalonde
MIG is interesting. It's something we partly devised to deal with sharing interactive workloads on the same GPU. Time-slicing is very difficult on a GPU because the state is so large. You wind up having to drain the workload and that means you're holding on to resources until the longest job ends.