[slurm-users] Grid engine slaughtering parallel jobs when any one of them fails (copy)

2021-04-16 Thread Robert Peck
Excuse me, I am trying to run some software on a cluster which uses the SLURM grid engine. IT support at my institution have exhausted their knowledge of SLURM in trying to debug this rather nasty bug with a specific feature of the grid engine and suggested I try here for tips. I am using jobs of

Re: [slurm-users] Jobs that may still be running at X time?

2021-04-16 Thread Ryan Novosielski
I knew we weren’t alone! Thanks, Juergen! If the scheduling engine was slightly better for reservations (eg. “Third Tuesday” type stuff), it would probably happen a little less often. I know it’s sort of getting there. -- #BlackLivesMatter || \\UTGERS, |---*O*--

Re: [slurm-users] Jobs that may still be running at X time?

2021-04-16 Thread Juergen Salk
* Ryan Novosielski [210416 21:33]: > Does anyone have a particularly clever way, either built-in or > scripted, to find out which jobs will still be running at > such-and-such time? Hi Ryan, coincidentally, I just did this today. For exactly the same reason. squeue does have a "%L" format opti

Re: [slurm-users] Grid engine slaughtering parallel jobs when any one of them fails (copy)

2021-04-16 Thread Renfro, Michael
I can't speak to what happens on node failure, but I can at least get you a greatly simplified pair of scripts that will run only one copy on each node allocated: #!/bin/bash # notarray.sh #SBATCH --nodes=28 #SBATCH --ntasks-per-node=1 #SBATCH --no-kill echo "notarray.sh is running on $(hostnam

[slurm-users] Jobs that may still be running at X time?

2021-04-16 Thread Ryan Novosielski
Hi there, Does anyone have a particularly clever way, either built-in or scripted, to find out which jobs will still be running at such-and-such time? I bet anyone who’s made the mistake of not entering a maintenance reservation soon enough knows the feeling. I know that jobs /may/ end earlier

Re: [slurm-users] Grp* Resource Limits on User Associations

2021-04-16 Thread Juergen Salk
* Matthias Leopold [210416 19:35]: > can someone please explain to me why it's possible to set Grp* resource > limits on user associations? What's the use for this? Hi Matthias, this probably does not fully answer your question, but Grp* limits on user associations provide the ability to impos

[slurm-users] Grp* Resource Limits on User Associations

2021-04-16 Thread Matthias Leopold
Hi, can someone please explain to me why it's possible to set Grp* resource limits on user associations? What's the use for this? As far as I understood documentation accounts can have children, but not users. I'm still a newbie exploring Slurm in a test environment, please excuse maybe stup

[slurm-users] Slurm reservation for migrating user home directories

2021-04-16 Thread Ole Holm Nielsen
I need to migrate several sets of user home directories from an old NFS file server to a new NFS file server. Each group of users belong to specific Slurm accounts organized in a hierarchical tree. I want to make the migration while the cluster is in full production mode for all the other acc

Re: [slurm-users] Slurm reservation for migrating user home directories

2021-04-16 Thread Ole Holm Nielsen
Hi Niels Carl, On 16-04-2021 14:41, Niels Carl Hansen wrote: For each account do sacctmgr modify account name= set GrpJobs=0 After sync'ing, resume with sacctmgr modify account name= set GrpJobs=-1 Yes, but this would block all jobs from immediately. If this account had a week-

Re: [slurm-users] Slurm reservation for migrating user home directories

2021-04-16 Thread Tina Friedrich
Had to do home directory migrations a couple of times without 'full' downtimes. Similar process, only I don't think we ever bothered disabling users in LDAP or blocking their jobs. Generally, we told them we'd move their directory at time X and would they please log out everywhere; at time X, w

Re: [slurm-users] Slurm reservation for migrating user home directories

2021-04-16 Thread Ward Poelmans
Hi Ole, On 16/04/2021 14:23, Ole Holm Nielsen wrote: > Question:  Does anyone have experiences with this type of scenario?  Any > good ideas or suggestions for other methods for data migration? We once did something like that. Basically it did something like that: - Process is kicked off per use

Re: [slurm-users] AutoDetect=nvml throwing an error message

2021-04-16 Thread Stephan Roth
Hi Cristóbal Under Debian Stretch/Buster I had to set LDFLAGS=-L/usr/lib/x86_64-linux-gnu/nvidia/current for configure to find the NVML shared library. Best, Stephan On 15.04.21 19:46, Cristóbal Navarro wrote: Hi Michael, Thanks, Indeed I don't have it. Slurm must have not detected it. I do

Re: [slurm-users] derived counters

2021-04-16 Thread Ole Holm Nielsen
Hi Jürgen, On 4/13/21 6:29 PM, Juergen Salk wrote: * Heckes, Frank [210413 12:04]: This result from a mgmt. - question. How long jobs have to wait (in s, min, h, day) before they getting executed and how many jobs are waiting (are queued) for each partition in a certain time interval. The f