Hello everyone,
I’ve recently encountered an issue where some nodes in our cluster enter
a drain state randomly, typically after completing long-running jobs.
Below is the output from the |sinfo| command showing the reason *“Prolog
error”* :
|root@controller-node:~# sinfo -R REASON USER TIME
Would appreciate any leads on the above query. Thanks in advance.
On Fri, 20 Sept 2024 at 14:31, Minulakshmi S
wrote:
> Hello,
>
> *Issue 1:*
> I am using slurm version 24.05.1 , my slurmd has a single node where I
> connect multiple gres by enabling the overscribe feature.
> I am able to use th
Something odd is going on on our cluster. User has a lot of pending jobs in a
job array (a few thousand).
squeue -u kmnx005 -r -t PD | head -5
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
3045324_875 core run_scp_ kmnx005 PD 0:00 1
Hi Tim,
On 10/7/24 11:13, Cutts, Tim via slurm-users wrote:
Something odd is going on on our cluster. User has a lot of pending jobs
in a job array (a few thousand).
squeue -u kmnx005 -r -t PD | head -5
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
I should be clear, the JobArrayTaskLimit isn’t the issue (the user’s submitted
with %1, which is why we’re getting that). What I don’t understand is why the
jobs remaining in the queue have no priority at all associated with them. It’s
as though the scheduler has forgotten the job array exists
On 10/7/24 12:28, Cutts, Tim wrote:
I should be clear, the JobArrayTaskLimit isn’t the issue (the user’s
submitted with %1, which is why we’re getting that). What I don’t
understand is why the jobs remaining in the queue have no priority at all
associated with them. It’s as though the schedul