Stefan Monnier wrote: > > manufacturers in different memory banks, but since it's always > > possible to power down, replace or just remove memory, and power > > up again, > > Hmm... "always"? What about long running computations like that > simulation (or LLM training) launched a month ago and that's expected to > finish in another month or so?
If the job is that big, it's being run on multiple machines. This machine's current chunk is corrupt, so you can't use it anyway. The orchestrator stops using this machine, someone comes in to replace the RAM. Later the machine is re-added to the pool. > Some mainframes have supported hot (un)plugging RAM modules as well and > I wouldn't be surprised if some x86 servers also support it nowadays. https://www.kernel.org/doc/html/latest/admin-guide/mm/memory-hotplug.html That said, you won't find this feature without specifying it when you buy it, and very few have a use case for it. -dsr-