[slurm-users] Re: srun launched mpi job occasionally core dumps

2024-05-10 Thread Henderson, Brent via slurm-users
Thanks for the suggestion Ole - I tried this out yesterday with RHEL 9.4 with two slightly different setups. 1) Using the stock ice driver that comes with RHEL 9.4 for the card still saw the issue. 2) There was not a pre-built version of the ice driver on the intel download site, so I buil

[slurm-users] Re: srun launched mpi job occasionally core dumps

2024-05-08 Thread Henderson, Brent via slurm-users
Thanks for the suggestion Ole - I'll see if I can get that in the mix to try over the next few days. I can report that 23.02.7 tree had the same issues, so going backwards on the slurm bits did not have any impact. Brent -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubs

[slurm-users] Re: srun launched mpi job occasionally core dumps

2024-05-07 Thread Henderson, Brent via slurm-users
ood way to tickle the real issue. Over the next few days, I'll try to roll everything back to RHEL 8.9 and see how that goes. Brent From: Henderson, Brent via slurm-users [mailto:slurm-users@lists.schedmd.com] Sent: Thursday, May 2, 2024 11:32 AM To: slurm-users@lists.schedmd.com Subject

[slurm-users] Re: srun launched mpi job occasionally core dumps

2024-05-02 Thread Henderson, Brent via slurm-users
uggestions. Thanks, Brent From: Henderson, Brent via slurm-users [mailto:slurm-users@lists.schedmd.com] Sent: Wednesday, May 1, 2024 11:21 AM To: slurm-users@lists.schedmd.com Subject: [slurm-users] srun launched mpi job occasionally core dumps Greetings Slurm gurus -- I've been having an i

[slurm-users] srun launched mpi job occasionally core dumps

2024-05-01 Thread Henderson, Brent via slurm-users
Greetings Slurm gurus -- I've been having an issue where very occasionally an srun launched OpenMPI job launched will die during startup within MPI_Init(). E.g. srun -N 8 --ntasks-per-node=1 ./hello_world_mpi. Same binary launched with mpirun does not experience the issue. E.g. mpirun -n 64

Re: [slurm-users] nodes lingering in completion

2023-03-27 Thread Henderson, Brent
ocess the epilog a Bash process must be created so perhaps look at .bashrc. Try timing running the epilog yourself on a compute node. I presume it is owned by an account local to the compute nodes, not a directory service account? William On Fri, 1 Apr 2022, 17:25 Henderson,

[slurm-users] odd binding interaction with hint=nomultithread

2022-08-08 Thread Henderson, Brent
I've hit an issue with binding using slurm 21.08.5 that I'm hoping someone might be able to help with. I took a scan through the e-mail list but didn't see this one - apologies if I missed it. Maybe I just need a better understanding on why this is happening but feels like a bug. The issue is

[slurm-users] nodes lingering in completion

2022-04-01 Thread Henderson, Brent
Hi slurm experts - I've gotten temporary access to a cluster with 1k nodes - so of course I setup slurm on it (v20.11.8). :) Small jobs are fine and go back to idle rather quickly. Jobs that use all the nodes will have some 'linger' in the completing state for over a minute while others may