If you set up slurm elastic cloud in EC2 without LDAP, what is the recommended
method for sync of the passwd/group files? Is this necessary to get openmpi
jobs to run. I would swear I had this working last week without synced passwd
on two nodes. But thinking about it now I'm not sure how this c
As I understand it, slurmdbd will compile statistics for sreport once an hour.
Is there any way I can force that to happen immediately? Restarting slurmdbd
doesn’t seem to do anything. I don’t want to use this for anything operational.
Just testing.
--
Gary
smime.p7s
Description: S/MIME
On 2/10/25 7:05 am, Michał Kadlof via slurm-users wrote:
I observed similar symptoms when we had issues with the shared Lustre
file system. When the file system couldn't complete an I/O operation,
the process in Slurm remained in the CG state until the file system
became responsive again. An a
ps -eaf --forest is your friend with Slurm
On Mon, Feb 10, 2025, 12:08 PM Michał Kadlof via slurm-users <
slurm-users@lists.schedmd.com> wrote:
> I observed similar symptoms when we had issues with the shared Lustre file
> system. When the file system couldn't complete an I/O operation, the
> pro
I observed similar symptoms when we had issues with the shared Lustre
file system. When the file system couldn't complete an I/O operation,
the process in Slurm remained in the CG state until the file system
became responsive again. An additional symptom was that the blocking
process was stuck
Belay that reply. Different issue.
In that case salloc works OK but stun says user has no job on the node
On Mon, Feb 10, 2025, 9:24 AM John Hearns wrote:
> I have had something similar.
> The fix was to run a
> scontrol reconfig
> Which causes a reread of the Slurmd config
> Give that a try
>
>
I have had something similar.
The fix was to run a
scontrol reconfig
Which causes a reread of the Slurmd config
Give that a try
It might be scontrol reread. Use the manual
On Mon, Feb 10, 2025, 8:32 AM Ricardo Román-Brenes via slurm-users <
slurm-users@lists.schedmd.com> wrote:
> Hello everyone.
Hello everyone.
I have a cluster composed of 16 nodes, with 4 of them having GPUs with no
particular configuration to manage them.
The filesystem is gluster, authentication via slapd/munge.
My problem is that very frequently, let's say at least a job daily, gets
stuck in CG. I have no idea why th