Hello Robert,I've been having the same issue with BCM, CentOS 8.2 BCM 9.0 Slurm 20.02.3. It seems to have started to occur when I enabled proctrack/cgroup and changed select/linear to select/con_tres.
Are you using cgroup process tracking and have you manipulated the cgroup.conf file? Do jobs complete correctly when not cancelled?
Regards, Willy Markuske HPC Systems Engineer Research Data Services P: (858) 246-5593 On 11/30/20 10:54 AM, Alex Chekholko wrote:
This may be more "cargo cult" but I've advised users to add a "sleep 60" to the end of their job scripts if they are "I/O intensive". Sometimes they are somehow able to generate I/O in a way that slurm thinks the job is finished, but the OS is still catching up on the I/O, and then slurm tries to kill the job...On Mon, Nov 30, 2020 at 10:49 AM Robert Kudyba <rkud...@fordham.edu <mailto:rkud...@fordham.edu>> wrote:Sure I've seen that in some of the posts here, e.g., a NAS. But in this case it's a NFS share to the local RAID10 storage. There aren't any other settings that deal with this to not drain a node? On Mon, Nov 30, 2020 at 1:02 PM Paul Edmon <ped...@cfa.harvard.edu <mailto:ped...@cfa.harvard.edu>> wrote: That can help. Usually this happens due to laggy storage the job is using taking time flushing the job's data. So making sure that your storage is up, responsive, and stable will also cut these down. -Paul Edmon- On 11/30/2020 12:52 PM, Robert Kudyba wrote: > I've seen where this was a bug that was fixed > https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D3941&d=DwIDaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=uhj_tXWcDUyyhKZogEh3zXEjkcPHj3Yzkzh7dZnMLiI&s=Chhfs3vBdTd3SG3KKgQmrBf3W_B6tjn5lP4qS-YRrh8&e= <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D3941&d=DwIDaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=uhj_tXWcDUyyhKZogEh3zXEjkcPHj3Yzkzh7dZnMLiI&s=Chhfs3vBdTd3SG3KKgQmrBf3W_B6tjn5lP4qS-YRrh8&e=> > <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D3941&d=DwIDaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=uhj_tXWcDUyyhKZogEh3zXEjkcPHj3Yzkzh7dZnMLiI&s=Chhfs3vBdTd3SG3KKgQmrBf3W_B6tjn5lP4qS-YRrh8&e= <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D3941&d=DwIDaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=uhj_tXWcDUyyhKZogEh3zXEjkcPHj3Yzkzh7dZnMLiI&s=Chhfs3vBdTd3SG3KKgQmrBf3W_B6tjn5lP4qS-YRrh8&e=> > but this happens > occasionally still. A user cancels his/her job and a node gets > drained. UnkillableStepTimeout=120 is set in slurm.conf > > Slurm 20.02.3 on Centos 7.9 running on Bright Cluster 8.2 > > Slurm Job_id=6908 Name=run.sh Ended, Run time 7-17:50:36, CANCELLED, > ExitCode 0 > Resending TERMINATE_JOB request JobId=6908 Nodelist=node001 > update_node: node node001 reason set to: Kill task failed > update_node: node node001 state set to DRAINING > error: slurmd error running JobId=6908 on node(s)=node001: Kill task > failed > > update_node: node node001 reason set to: hung > update_node: node node001 state set to DOWN > update_node: node node001 state set to IDLE > error: Nodes node001 not responding > > scontrol show config | grep kill > UnkillableStepProgram = (null) > UnkillableStepTimeout = 120 sec > > Do we just increase the timeout value?
OpenPGP_0xD42F81D406AC0BA2.asc
Description: application/pgp-keys
OpenPGP_signature
Description: OpenPGP digital signature