[slurm-users] Re: Randomly draining nodes

2024-10-11 Thread laddaoui--- via slurm-users
Hi Laura, Thank you for your reply. Indeed, Prolog is not configured on my machine $ scontrol show config |grep -i prolog Prolog = (null) PrologEpilogTimeout = 65534 PrologSlurmctld = (null) PrologFlags = Alloc,Contain ResvProlog = (null) Sr

[slurm-users] Re: Randomly draining nodes

2024-10-21 Thread laddaoui--- via slurm-users
You were right, I found that the slurm.conf file was different between the controller node and the computes, so I've synchronized it now. I was also considering setting up an epilogue script to help debug what happens after the job finishes. Do you happen to have any examples of what an epilogue

[slurm-users] Re: [EXTERNAL] enforce Qos to users

2025-06-24 Thread laddaoui--- via slurm-users
Thanks! It works perfectly now. Nacereddine -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] enforce Qos to users

2025-06-24 Thread laddaoui--- via slurm-users
Hello everyone, I'm trying to use QoS to enforce resource limits on an association, but I'm having trouble with proper enforcement. I created a QoS with resource limits: ``` sacctmgr add qos qos_gpus flags=denyonlimit,overpartqos maxjobsperuser=4 maxtresperjob=gres/gpu=1 ``` Then I assigned it

[slurm-users] Setting memory by assigned node with a plugin

2025-07-09 Thread laddaoui--- via slurm-users
Hello everyone, I'm writing a job_submit/lua plugin to set the memory allocated to a job depending on which node is assigned by Slurm. However, it appears that the node where the job will run is not available at this step of the submission process. Would a SPANK plugin be able to access this in

[slurm-users] Randomly draining nodes

2024-10-07 Thread Nacereddine Laddaoui via slurm-users
Hello everyone, I’ve recently encountered an issue where some nodes in our cluster enter a drain state randomly, typically after completing long-running jobs. Below is the output from the |sinfo| command showing the reason *“Prolog error”* : |root@controller-node:~# sinfo -R REASON USER TIME