Re: [slurm-users] Kill task failed, state set to DRAINING, UnkillableStepTimeout=120

William Markuske Tue, 01 Dec 2020 09:25:40 -0800

Hello Robert,

I've been having the same issue with BCM, CentOS 8.2 BCM 9.0 Slurm 20.02.3. It seems to have started to occur when I enabled proctrack/cgroup and changed select/linear to select/con_tres.

Are you using cgroup process tracking and have you manipulated the cgroup.conf file? Do jobs complete correctly when not cancelled?


Regards,

Willy Markuske

HPC Systems Engineer

        

Research Data Services

P: (858) 246-5593

On 11/30/20 10:54 AM, Alex Chekholko wrote:

This may be more "cargo cult" but I've advised users to add a "sleep 60" to the end of their job scripts if they are "I/O intensive". Sometimes they are somehow able to generate I/O in a way that slurm thinks the job is finished, but the OS is still catching up on the I/O, and then slurm tries to kill the job...

On Mon, Nov 30, 2020 at 10:49 AM Robert Kudyba <rkud...@fordham.edu <mailto:rkud...@fordham.edu>> wrote:


    Sure I've seen that in some of the posts here, e.g., a NAS. But in
    this case it's a NFS share to the local RAID10 storage. There
    aren't any other settings that deal with this to not drain a node?

    On Mon, Nov 30, 2020 at 1:02 PM Paul Edmon <ped...@cfa.harvard.edu
    <mailto:ped...@cfa.harvard.edu>> wrote:

        That can help.  Usually this happens due to laggy storage the
        job is
        using taking time flushing the job's data.  So making sure
        that your
        storage is up, responsive, and stable will also cut these down.

        -Paul Edmon-

        On 11/30/2020 12:52 PM, Robert Kudyba wrote:
        > I've seen where this was a bug that was fixed
        >
        
https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D3941&d=DwIDaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=uhj_tXWcDUyyhKZogEh3zXEjkcPHj3Yzkzh7dZnMLiI&s=Chhfs3vBdTd3SG3KKgQmrBf3W_B6tjn5lP4qS-YRrh8&e=
        
<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D3941&d=DwIDaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=uhj_tXWcDUyyhKZogEh3zXEjkcPHj3Yzkzh7dZnMLiI&s=Chhfs3vBdTd3SG3KKgQmrBf3W_B6tjn5lP4qS-YRrh8&e=>

        >
        
<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D3941&d=DwIDaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=uhj_tXWcDUyyhKZogEh3zXEjkcPHj3Yzkzh7dZnMLiI&s=Chhfs3vBdTd3SG3KKgQmrBf3W_B6tjn5lP4qS-YRrh8&e=
        
<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D3941&d=DwIDaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=uhj_tXWcDUyyhKZogEh3zXEjkcPHj3Yzkzh7dZnMLiI&s=Chhfs3vBdTd3SG3KKgQmrBf3W_B6tjn5lP4qS-YRrh8&e=>
        > but this happens
        > occasionally still. A user cancels his/her job and a node gets
        > drained. UnkillableStepTimeout=120 is set in slurm.conf
        >
        > Slurm 20.02.3 on Centos 7.9 running on Bright Cluster 8.2
        >
        > Slurm Job_id=6908 Name=run.sh Ended, Run time 7-17:50:36,
        CANCELLED,
        > ExitCode 0
        > Resending TERMINATE_JOB request JobId=6908 Nodelist=node001
        > update_node: node node001 reason set to: Kill task failed
        > update_node: node node001 state set to DRAINING
        > error: slurmd error running JobId=6908 on node(s)=node001:
        Kill task
        > failed
        >
        > update_node: node node001 reason set to: hung
        > update_node: node node001 state set to DOWN
        > update_node: node node001 state set to IDLE
        > error: Nodes node001 not responding
        >
        > scontrol show config | grep kill
        > UnkillableStepProgram   = (null)
        > UnkillableStepTimeout   = 120 sec
        >
        > Do we just increase the timeout value?

OpenPGP_0xD42F81D406AC0BA2.asc
Description: application/pgp-keys

OpenPGP_signature
Description: OpenPGP digital signature

Re: [slurm-users] Kill task failed, state set to DRAINING, UnkillableStepTimeout=120

Reply via email to