[slurm-users] /etc/passwd sync?

2025-02-10 Thread mark.w.moorcroft--- via slurm-users
If you set up slurm elastic cloud in EC2 without LDAP, what is the recommended method for sync of the passwd/group files? Is this necessary to get openmpi jobs to run. I would swear I had this working last week without synced passwd on two nodes. But thinking about it now I'm not sure how this c

[slurm-users] Force a refresh for sreport

2025-02-10 Thread Jackson, Gary L. via slurm-users
As I understand it, slurmdbd will compile statistics for sreport once an hour. Is there any way I can force that to happen immediately? Restarting slurmdbd doesn’t seem to do anything. I don’t want to use this for anything operational. Just testing. -- Gary smime.p7s Description: S/MIME

[slurm-users] Re: jobs getting stuck in CG

2025-02-10 Thread Christopher Samuel via slurm-users
On 2/10/25 7:05 am, Michał Kadlof via slurm-users wrote: I observed similar symptoms when we had issues with the shared Lustre file system. When the file system couldn't complete an I/O operation, the process in Slurm remained in the CG state until the file system became responsive again. An a

[slurm-users] Re: jobs getting stuck in CG

2025-02-10 Thread John Hearns via slurm-users
ps -eaf --forest is your friend with Slurm On Mon, Feb 10, 2025, 12:08 PM Michał Kadlof via slurm-users < slurm-users@lists.schedmd.com> wrote: > I observed similar symptoms when we had issues with the shared Lustre file > system. When the file system couldn't complete an I/O operation, the > pro

[slurm-users] Re: jobs getting stuck in CG

2025-02-10 Thread Michał Kadlof via slurm-users
I observed similar symptoms when we had issues with the shared Lustre file system. When the file system couldn't complete an I/O operation, the process in Slurm remained in the CG state until the file system became responsive again. An additional symptom was that the blocking process was stuck

[slurm-users] Re: jobs getting stuck in CG

2025-02-10 Thread John Hearns via slurm-users
Belay that reply. Different issue. In that case salloc works OK but stun says user has no job on the node On Mon, Feb 10, 2025, 9:24 AM John Hearns wrote: > I have had something similar. > The fix was to run a > scontrol reconfig > Which causes a reread of the Slurmd config > Give that a try > >

[slurm-users] Re: jobs getting stuck in CG

2025-02-10 Thread John Hearns via slurm-users
I have had something similar. The fix was to run a scontrol reconfig Which causes a reread of the Slurmd config Give that a try It might be scontrol reread. Use the manual On Mon, Feb 10, 2025, 8:32 AM Ricardo Román-Brenes via slurm-users < slurm-users@lists.schedmd.com> wrote: > Hello everyone.

[slurm-users] jobs getting stuck in CG

2025-02-10 Thread Ricardo Román-Brenes via slurm-users
Hello everyone. I have a cluster composed of 16 nodes, with 4 of them having GPUs with no particular configuration to manage them. The filesystem is gluster, authentication via slapd/munge. My problem is that very frequently, let's say at least a job daily, gets stuck in CG. I have no idea why th