[slurm-users] Re: Why is my job killed when ResumeTimeout is reached instead of it being requeued?

Xaver Stiensmeier via slurm-users Mon, 09 Dec 2024 05:06:26 -0800

Dear Slurm-user list,

Sadly, my question got no answers. If the question is unclear and you
have ideas how I can improve it, please let me know. We will soon try to
update Slurm to see if the unwanted behavior disappears with that.


Best regards,
Xaver Stiensmeier

On 11/18/24 12:03, Xaver Stiensmeier wrote:


Dear Slurm-user list,

when a job fails because the node startup fails (cloud scheduling),
the job should be re-queued:

    Resume Timeout
    Maximum time permitted (in seconds) between when a node resume
    request is issued and when the node is actually available for use.
    Nodes which fail to respond in this time frame will be marked DOWN
    and the jobs scheduled on the node requeued.

however, instead of requeuing the job, it is killed.

[2024-11-18T10:41:52.003] node bibigrid-worker-wubqboa1z2kkgx0-0 not
resumed by ResumeTimeout(1200) - marking down and power_save
[2024-11-18T10:41:52.003] Killing JobId=1 on failed node
bibigrid-worker-wubqboa1z2kkgx0-0
[2024-11-18T10:41:52.046] update_node: node
bibigrid-worker-wubqboa1z2kkgx0-0 reason set to: FailedStartup
[2024-11-18T10:41:52.046] power down request repeating for node
bibigrid-worker-wubqboa1z2kkgx0-0

Our ResumeProgram does not change the state of the underlying workers,
I think we should set the nodes to DOWN explicitly if the startup
fails given:

    *ResumeProgram* is unable to restore a node to service with a
    responding slurmd and an updated BootTime, it should set the node
    state to DOWN, which will result in a requeue of any job
    associated with the node - this will happen automatically if the
    node doesn't register within ResumeTimeout

but in any case as we can see in the log the job should be requeued
based on it reaching the ResumeTimeout alone. I am unsure why that is
not happening. The power down request is sent by the
ResumeFailProgram. We have SlurmctldParameters=idle_on_node_suspend
activated, but that shouldn't affect Resume, I guess.

My Slurm version is slurm 23.11.5

Best regards,
Xaver

# More context

## Slurmctld from submitting job to failure

[2024-11-18T10:21:45.490] sched: _slurm_rpc_allocate_resources JobId=1
NodeList=bibigrid-worker-wubqboa1z2kkgx0-0 usec=1221
[2024-11-18T10:21:45.499] debug:  sackd_mgr_dump_state: saved state of
0 nodes
[2024-11-18T10:21:58.387] debug:  sched/backfill: _attempt_backfill:
beginning
[2024-11-18T10:21:58.387] debug:  sched/backfill: _attempt_backfill:
no jobs to backfill
[2024-11-18T10:22:20.003] debug:  sched: Running job scheduler for
full queue.
[2024-11-18T10:23:20.003] debug:  sched: Running job scheduler for
full queue.
[2024-11-18T10:23:20.009] debug:  sackd_mgr_dump_state: saved state of
0 nodes
[2024-11-18T10:23:23.003] debug:  Spawning ping agent for
bibigrid-master-wubqboa1z2kkgx0
[2024-11-18T10:23:23.398] debug:  sched/backfill: _attempt_backfill:
beginning
[2024-11-18T10:23:23.398] debug:  sched/backfill: _attempt_backfill:
no jobs to backfill
[2024-11-18T10:23:53.398] debug:  sched/backfill: _attempt_backfill:
beginning
[2024-11-18T10:23:53.398] debug:  sched/backfill: _attempt_backfill:
no jobs to backfill
[2024-11-18T10:24:21.000] debug:  sched: Running job scheduler for
full queue.
[2024-11-18T10:24:21.484] slurmscriptd: error: _run_script: JobId=0
resumeprog exit status 1:0
[2024-11-18T10:25:20.003] debug:  sched: Running job scheduler for
full queue.
[2024-11-18T10:26:02.000] debug:  Spawning ping agent for
bibigrid-master-wubqboa1z2kkgx0
[2024-11-18T10:26:02.417] debug:  sched/backfill: _attempt_backfill:
beginning
[2024-11-18T10:26:02.417] debug:  sched/backfill: _attempt_backfill:
no jobs to backfill
[2024-11-18T10:26:20.007] debug:  sched: Running job scheduler for
full queue.
[2024-11-18T10:26:32.417] debug:  sched/backfill: _attempt_backfill:
beginning
[2024-11-18T10:26:32.417] debug:  sched/backfill: _attempt_backfill:
no jobs to backfill
[2024-11-18T10:27:20.003] debug:  sched: Running job scheduler for
full queue.
[2024-11-18T10:28:20.003] debug:  Updating partition uid access list
[2024-11-18T10:28:20.003] debug:  sched: Running job scheduler for
full queue.
[2024-11-18T10:28:20.008] debug:  sackd_mgr_dump_state: saved state of
0 nodes
[2024-11-18T10:29:20.003] debug:  sched: Running job scheduler for
full queue.
[2024-11-18T10:29:22.000] debug:  Spawning ping agent for
bibigrid-master-wubqboa1z2kkgx0
[2024-11-18T10:29:22.448] debug:  sched/backfill: _attempt_backfill:
beginning
[2024-11-18T10:29:22.448] debug:  sched/backfill: _attempt_backfill:
no jobs to backfill
[2024-11-18T10:30:20.007] debug:  sched: Running job scheduler for
full queue.
[2024-11-18T10:31:20.003] debug:  sched: Running job scheduler for
full queue.
[2024-11-18T10:32:21.000] debug:  sched: Running job scheduler for
full queue.
[2024-11-18T10:32:42.000] debug:  Spawning ping agent for
bibigrid-master-wubqboa1z2kkgx0
[2024-11-18T10:32:42.478] debug:  sched/backfill: _attempt_backfill:
beginning
[2024-11-18T10:32:42.478] debug:  sched/backfill: _attempt_backfill:
no jobs to backfill
[2024-11-18T10:33:12.479] debug:  sched/backfill: _attempt_backfill:
beginning
[2024-11-18T10:33:12.479] debug:  sched/backfill: _attempt_backfill:
no jobs to backfill
[2024-11-18T10:33:20.003] debug:  sched: Running job scheduler for
full queue.
[2024-11-18T10:33:20.010] debug:  sackd_mgr_dump_state: saved state of
0 nodes
[2024-11-18T10:34:20.003] debug:  sched: Running job scheduler for
full queue.
[2024-11-18T10:35:20.007] debug:  sched: Running job scheduler for
full queue.
[2024-11-18T10:36:01.004] debug:  Spawning ping agent for
bibigrid-master-wubqboa1z2kkgx0
[2024-11-18T10:36:01.504] debug:  sched/backfill: _attempt_backfill:
beginning
[2024-11-18T10:36:01.504] debug:  sched/backfill: _attempt_backfill:
no jobs to backfill
[2024-11-18T10:36:20.003] debug:  sched: Running job scheduler for
full queue.
[2024-11-18T10:36:31.505] debug:  sched/backfill: _attempt_backfill:
beginning
[2024-11-18T10:36:31.505] debug:  sched/backfill: _attempt_backfill:
no jobs to backfill
[2024-11-18T10:37:21.000] debug:  sched: Running job scheduler for
full queue.
[2024-11-18T10:38:20.008] debug:  Updating partition uid access list
[2024-11-18T10:38:20.008] debug:  sched: Running job scheduler for
full queue.
[2024-11-18T10:38:20.017] debug:  sackd_mgr_dump_state: saved state of
0 nodes
[2024-11-18T10:39:20.003] debug:  sched: Running job scheduler for
full queue.
[2024-11-18T10:39:21.003] debug:  Spawning ping agent for
bibigrid-master-wubqboa1z2kkgx0
[2024-11-18T10:39:21.530] debug:  sched/backfill: _attempt_backfill:
beginning
[2024-11-18T10:39:21.530] debug:  sched/backfill: _attempt_backfill:
no jobs to backfill
[2024-11-18T10:39:51.531] debug:  sched/backfill: _attempt_backfill:
beginning
[2024-11-18T10:39:51.531] debug:  sched/backfill: _attempt_backfill:
no jobs to backfill
[2024-11-18T10:40:21.000] debug:  sched: Running job scheduler for
full queue.
[2024-11-18T10:41:20.003] debug:  sched: Running job scheduler for
full queue.
[2024-11-18T10:41:52.003] node bibigrid-worker-wubqboa1z2kkgx0-0 not
resumed by ResumeTimeout(1200) - marking down and power_save
[2024-11-18T10:41:52.003] Killing JobId=1 on failed node
bibigrid-worker-wubqboa1z2kkgx0-0
[2024-11-18T10:41:52.046] update_node: node
bibigrid-worker-wubqboa1z2kkgx0-0 reason set to: FailedStartup
[2024-11-18T10:41:52.046] power down request repeating for node
bibigrid-worker-wubqboa1z2kkgx0-0
[2024-11-18T10:41:52.047] debug:  sackd_mgr_dump_state: saved state of
0 nodes
[2024-11-18T10:41:52.549] debug:  sched/backfill: _attempt_backfill:
beginning
[2024-11-18T10:41:52.549] debug:  sched/backfill: _attempt_backfill:
no jobs to backfill
[2024-11-18T10:41:52.736] _slurm_rpc_complete_job_allocation: JobId=1
error Job/step already completing or completed
[2024-11-18T10:41:53.000] debug:  Spawning ping agent for
bibigrid-master-wubqboa1z2kkgx0
[2024-11-18T10:41:53.000] debug:  sched: Running job scheduler for
default depth.
[2024-11-18T10:41:53.014] update_node: node
bibigrid-worker-wubqboa1z2kkgx0-0 reason set to: FailedStartup
[2024-11-18T10:41:53.014] update_node: node
bibigrid-worker-wubqboa1z2kkgx0-0 state set to IDLE

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Why is my job killed when ResumeTimeout is reached instead of it being requeued?

Reply via email to