Now that is interesting. If I do: loginctl enable-linger weissp
Then I get the following error: Failed to look up user weissp: No such process This is one of the users that always fails. But if I run it for myself with: loginctl enable-linger simmsj Everything works (as expected). Any thoughts? Warmest regards, Jason On Tue, Jul 7, 2020 at 8:47 PM Sean Crosby <scro...@unimelb.edu.au> wrote: > Hi Jason, > > What happens when you try to run that command on the node? Is the exit > status of the command 0? > > e.g. for my servers, where lingering is masked, I get > > [root@thespian-gpgpu001 ~]# loginctl enable-linger scrosby > Could not enable linger: Unit is masked. > [root@thespian-gpgpu001 ~]# echo $? > 1 > > Sean > > -- > Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead > Research Computing Services | Business Services > The University of Melbourne, Victoria 3010 Australia > > > > On Wed, 8 Jul 2020 at 01:14, Jason Simms <sim...@lafayette.edu> wrote: > >> *UoM notice: External email. Be cautious of links, attachments, or >> impersonation attempts.* >> ------------------------------ >> Hello all, >> >> Two users on my system experience job failures every time they submit a >> job via sbatch. When I run their exact submission script, or when I create >> a local system user and launch from there, the jobs run fine. Here is an >> example of what I see in the slurmd log: >> >> [2020-07-06T15:02:41.284] task_p_slurmd_batch_request: 1421 >> [2020-07-06T15:02:41.284] task/affinity: job 1421 CPU input mask for >> node: 0x00000F0000 >> [2020-07-06T15:02:41.284] task/affinity: job 1421 CPU final HW mask for >> node: 0x00000F0000 >> [2020-07-06T15:02:41.295] _run_prolog: prolog with lock for job 1421 ran >> for 0 seconds >> [2020-07-06T15:02:41.295] error: [job 1421] prolog failed status=1:0 >> [2020-07-06T15:02:41.295] Job 1421 already killed, do not launch batch job >> >> The prolog file is simply: >> >> #!/bin/bash >> loginctl enable-linger $SLURM_JOB_USER >> >> There seems to be some reason why certain users always encounter this, >> but I can't figure out why. Their accounts are no "different" than anyone >> else (not in a different group, etc.), so I don't think permissions are an >> issue. >> >> Anyway, the job failure immediately puts the node into a DRAINED/DRAINING >> state (which is expected). But for now, these users cannot submit any jobs >> at all. >> >> Any insights would be welcomed! >> >> Warmest regards, >> Jason >> >> -- >> *Jason L. Simms, Ph.D., M.P.H.* >> Manager of Research and High-Performance Computing >> XSEDE Campus Champion >> Lafayette College >> Information Technology Services >> 710 Sullivan Rd | Easton, PA 18042 >> Office: 112 Skillman Library >> p: (610) 330-5632 >> > -- *Jason L. Simms, Ph.D., M.P.H.* Manager of Research and High-Performance Computing XSEDE Campus Champion Lafayette College Information Technology Services 710 Sullivan Rd | Easton, PA 18042 Office: 112 Skillman Library p: (610) 330-5632