On 10/7/24 12:28, Cutts, Tim wrote:
I should be clear, the JobArrayTaskLimit isn’t the issue (the user’s
submitted with %1, which is why we’re getting that). What I don’t
understand is why the jobs remaining in the queue have no priority at all
associated with them. It’s as though the schedul
I should be clear, the JobArrayTaskLimit isn’t the issue (the user’s submitted
with %1, which is why we’re getting that). What I don’t understand is why the
jobs remaining in the queue have no priority at all associated with them. It’s
as though the scheduler has forgotten the job array exists
Hello everyone,
I’ve recently encountered an issue where some nodes in our cluster enter
a drain state randomly, typically after completing long-running jobs.
Below is the output from the |sinfo| command showing the reason *“Prolog
error”* :
|root@controller-node:~# sinfo -R REASON USER TIME
Hi Tim,
On 10/7/24 11:13, Cutts, Tim via slurm-users wrote:
Something odd is going on on our cluster. User has a lot of pending jobs
in a job array (a few thousand).
squeue -u kmnx005 -r -t PD | head -5
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
Something odd is going on on our cluster. User has a lot of pending jobs in a
job array (a few thousand).
squeue -u kmnx005 -r -t PD | head -5
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
3045324_875 core run_scp_ kmnx005 PD 0:00 1
Would appreciate any leads on the above query. Thanks in advance.
On Fri, 20 Sept 2024 at 14:31, Minulakshmi S
wrote:
> Hello,
>
> *Issue 1:*
> I am using slurm version 24.05.1 , my slurmd has a single node where I
> connect multiple gres by enabling the overscribe feature.
> I am able to use th