Re: [slurm-users] [EXT] Jobs Immediately Fail for Certain Users

2020-07-07 Thread Christopher Samuel
On 7/7/20 5:57 pm, Jason Simms wrote: Failed to look up user weissp: No such process That looks like the user isn't known to the node. What do these say: id weissp getent passwd weissp Which version of Slurm is this? All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Ber

Re: [slurm-users] [EXT] Jobs Immediately Fail for Certain Users

2020-07-07 Thread Jason Simms
Now that is interesting. If I do: loginctl enable-linger weissp Then I get the following error: Failed to look up user weissp: No such process This is one of the users that always fails. But if I run it for myself with: loginctl enable-linger simmsj Everything works (as expected). Any though

Re: [slurm-users] [EXT] Jobs Immediately Fail for Certain Users

2020-07-07 Thread Sean Crosby
Hi Jason, What happens when you try to run that command on the node? Is the exit status of the command 0? e.g. for my servers, where lingering is masked, I get [root@thespian-gpgpu001 ~]# loginctl enable-linger scrosby Could not enable linger: Unit is masked. [root@thespian-gpgpu001 ~]# echo $?

Re: [slurm-users] [EXT] Weird issues with slurm's Priority

2020-07-07 Thread Sean Crosby
On Wed, 8 Jul 2020 at 00:47, zaxs84 wrote: > *UoM notice: External email. Be cautious of links, attachments, or > impersonation attempts.* > -- > Hi Sean, > thank you very much for your reply. > > > If a lower priority job can start AND finish before the resources a >

Re: [slurm-users] Allow certain users to run over partition limit

2020-07-07 Thread Sebastian T Smith
Hi, We use Job QOS and Resource Reservations for this purpose. QOS is a good option for a "permanent" change to a user's resource limits. We use reservations similar to how you're currently using partitions to "temporarily" provide a resource boost without the complexities of re-partitioning

[slurm-users] Allow certain users to run over partition limit

2020-07-07 Thread Matthew BETTINGER
Hello, We have a slurm system with partitions set for max runtime of 24hours. What would be the proper way to allow a certain set of users to run jobs on the current partitions beyond the partition limits? In the past we would isolate some nodes based on their job requirements , make a new pa

[slurm-users] Automatically stop low priority jobs when submitting high priority jobs

2020-07-07 Thread zaxs84
Hi all. Is there a scheduler option that allows low priority jobs to be immediately paused (or even stopped) when jobs with higher priority are submitted? Related to this question, I am also a bit confused about how "scontrol suspend" works; my understanding is that a job that gets suspended rece

[slurm-users] Jobs Immediately Fail for Certain Users

2020-07-07 Thread Jason Simms
Hello all, Two users on my system experience job failures every time they submit a job via sbatch. When I run their exact submission script, or when I create a local system user and launch from there, the jobs run fine. Here is an example of what I see in the slurmd log: [2020-07-06T15:02:41.284]

Re: [slurm-users] [EXT] Weird issues with slurm's Priority

2020-07-07 Thread zaxs84
Hi Sean, thank you very much for your reply. > If a lower priority job can start AND finish before the resources a higher priority job requires are available, the backfill scheduler will start the lower priority job. That's very interesting, but how can the scheduler predict how long a low-priori

Re: [slurm-users] [EXT] Weird issues with slurm's Priority

2020-07-07 Thread Sean Crosby
Hi, What you have described is how the backfill scheduler works. If a lower priority job can start AND finish before the resources a higher priority job requires are available, the backfill scheduler will start the lower priority job. Your high priority job requires 24 cores, whereas the lower pr

[slurm-users] Weird issues with slurm's Priority

2020-07-07 Thread zaxs84
Hi all. We want to achieve a simple thing with slurm: launch "normal" jobs, and be able to launch "high priority" jobs that run as soon as possible. End of it. However we cannot achieve this in a reliable way, meaning that our current config sometimes works, sometimes not, and this is driving us c