[slurm-users] Prevent CLOUD node from being shutdown after startup

2023-05-12 Thread Xaver Stiensmeier
Dear slurm-users, I am currently looking into options how I can deactivate suspending for nodes. I am both interested in the general case: Allowing all nodes to be powered up, but for all nodes without automatic suspending except when triggering power down manually. And the special case: Allow

Re: [slurm-users] Prevent CLOUD node from being shutdown after startup

2023-05-12 Thread Brian Andrus
Xaver, Your descriptions of cases is a bit difficult to understand. It seems you want to have exceptions for power_up. That could be done by filtering the list of nodes yourself with any script/method you like and then do power_up on the remaining list. For excluding nodes from being suspend

[slurm-users] Invalid device ordinal

2023-05-12 Thread Henderson, Cornelius J. (GSFC-606.2)[InuTeq, LLC]
Hello - I'm trying to get gpu container jobs working on virtual nodes. The jobs fail with "Test CUDA failure common.cu:893 'invalid device ordinal'" in the output file and "slurmstepd: error: mpi/pmix_v3: _errhandler: n4 [0]: pmixp_client_v2.c:211: Error handler invoked: status = -25, source =