[slurm-users] Jobs Immediately Fail for Certain Users

Jason Simms Tue, 07 Jul 2020 08:17:20 -0700

Hello all,

Two users on my system experience job failures every time they submit a job
via sbatch. When I run their exact submission script, or when I create a
local system user and launch from there, the jobs run fine. Here is an
example of what I see in the slurmd log:


[2020-07-06T15:02:41.284] task_p_slurmd_batch_request: 1421
[2020-07-06T15:02:41.284] task/affinity: job 1421 CPU input mask for node:
0x00000F0000
[2020-07-06T15:02:41.284] task/affinity: job 1421 CPU final HW mask for
node: 0x00000F0000
[2020-07-06T15:02:41.295] _run_prolog: prolog with lock for job 1421 ran
for 0 seconds
[2020-07-06T15:02:41.295] error: [job 1421] prolog failed status=1:0
[2020-07-06T15:02:41.295] Job 1421 already killed, do not launch batch job

The prolog file is simply:

#!/bin/bash
loginctl enable-linger $SLURM_JOB_USER

There seems to be some reason why certain users always encounter this, but
I can't figure out why. Their accounts are no "different" than anyone else
(not in a different group, etc.), so I don't think permissions are an issue.

Anyway, the job failure immediately puts the node into a DRAINED/DRAINING
state (which is expected). But for now, these users cannot submit any jobs
at all.

Any insights would be welcomed!

Warmest regards,
Jason

-- 
*Jason L. Simms, Ph.D., M.P.H.*
Manager of Research and High-Performance Computing
XSEDE Campus Champion
Lafayette College
Information Technology Services
710 Sullivan Rd | Easton, PA 18042
Office: 112 Skillman Library
p: (610) 330-5632

[slurm-users] Jobs Immediately Fail for Certain Users

Reply via email to