Thanks for the suggestion Ole - I tried this out yesterday with RHEL 9.4 with
two slightly different setups.
1) Using the stock ice driver that comes with RHEL 9.4 for the card still saw
the issue.
2) There was not a pre-built version of the ice driver on the intel download
site, so I buil
Thanks for the suggestion Ole - I'll see if I can get that in the mix to try
over the next few days.
I can report that 23.02.7 tree had the same issues, so going backwards on the
slurm bits did not have any impact.
Brent
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubs
ood way to tickle
the real issue. Over the next few days, I'll try to roll everything back to
RHEL 8.9 and see how that goes.
Brent
From: Henderson, Brent via slurm-users [mailto:slurm-users@lists.schedmd.com]
Sent: Thursday, May 2, 2024 11:32 AM
To: slurm-users@lists.schedmd.com
Subject
uggestions.
Thanks,
Brent
From: Henderson, Brent via slurm-users [mailto:slurm-users@lists.schedmd.com]
Sent: Wednesday, May 1, 2024 11:21 AM
To: slurm-users@lists.schedmd.com
Subject: [slurm-users] srun launched mpi job occasionally core dumps
Greetings Slurm gurus --
I've been having an i
Greetings Slurm gurus --
I've been having an issue where very occasionally an srun launched OpenMPI job
launched will die during startup within MPI_Init(). E.g. srun -N 8
--ntasks-per-node=1 ./hello_world_mpi. Same binary launched with mpirun does
not experience the issue. E.g. mpirun -n 64
ocess the epilog a Bash process must be created so perhaps look at .bashrc.
Try timing running the epilog yourself on a compute node. I presume it is
owned by an account local to the compute nodes, not a directory service account?
William
On Fri, 1 Apr 2022, 17:25 Henderson,
I've hit an issue with binding using slurm 21.08.5 that I'm hoping someone
might be able to help with. I took a scan through the e-mail list but didn't
see this one - apologies if I missed it. Maybe I just need a better
understanding on why this is happening but feels like a bug.
The issue is
Hi slurm experts -
I've gotten temporary access to a cluster with 1k nodes - so of course I setup
slurm on it (v20.11.8). :) Small jobs are fine and go back to idle rather
quickly. Jobs that use all the nodes will have some 'linger' in the completing
state for over a minute while others may