Re: [slurm-users] Very large job getting starved out

2019-03-21 Thread Cyrus Proctor
Hi David, You might have a look at the thread "Large job starvation on cloud cluster" that started on Feb 27; there's some good tidbits in there. Off the top without more information, I would venture that settings you have in slurm.conf end up backfilling the smaller jobs at the expense of sch

Re: [slurm-users] problems with slurm and openmpi

2019-03-12 Thread Cyrus Proctor
Both your Slurm and OpenMPI config.logs would be helpful in debugging here. Throw in your slurm.conf as well for good measure. Also, what type of system are you running, what type of high speed fabric are you trying to run on, and what's your driver stack look like? I know the feeling and will

Re: [slurm-users] sacct end time for failed jobs

2019-03-06 Thread Cyrus Proctor
Hi Brian, Others probably have better suggestions before going the route I'm about to detail. If you do go this route, be warned, you definitely have the ability to irrevocably lose data or destroy your Slurm accounting database. Do so at your own risk. I got here with Google-foo after being ou

Re: [slurm-users] NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ???

2019-02-08 Thread Cyrus Proctor
Xiang, >From what I've of the original question, gres.conf may be another place to >verify the setup that only one core is being allocated per gpu request: >https://slurm.schedmd.com/gres.conf.html Seeing the run submission line and gres.conf might help others give you further advise. To Jeff

Re: [slurm-users] Configuration recommendations for heterogeneous cluster

2019-01-23 Thread Cyrus Proctor
as soon as it goes into the pending state, they > scancel it, change the partition name to a less utilized partition, > and resubmit it in the hopes it will start running immediately. > > Yes, there needs to be a lot of user training, and there's a lot I can > do to improve the en

Re: [slurm-users] Configuration recommendations for heterogeneous cluster

2019-01-22 Thread Cyrus Proctor
Hi Prentice, Have you considered Slurm features and constraints at all? You provide features (arbitrary strings in your slurm.conf) of what your hardware can provide ("amd", "ib", "FAST", "whatever"). A user then will list constraints using typical and/or/regex notation ( --constraint=amd&ib ).

[slurm-users] Held jobs age priority accrual

2018-10-06 Thread Cyrus Proctor
Hello, We've recently made the transition from version 17.11.5 up to 18.08.0. Anecdotally, we think we're seeing a change in behavior regarding the priority of held (user or admin) jobs. For discussion, take the example where a user submits a job, it waits for a day in the queues, the user doe

Re: [slurm-users] Transparently assign different walltime limit to a group of nodes ?

2018-08-13 Thread Cyrus Proctor
e the real time partition becomes busy. I think this would be a nice solution that does not involve job preemption. Cheers, Cyrus On 08/13/2018 11:20 AM, Jens Dreger wrote: Hi Cyrus! On Mon, Aug 13, 2018 at 08:44:15AM -0500, Cyrus Proctor wrote: Hi Jens, Check out https://slurm.schedmd.com/re

Re: [slurm-users] Transparently assign different walltime limit to a group of nodes ?

2018-08-13 Thread Cyrus Proctor
Hi Jens, Check out https://slurm.schedmd.com/reservations.html specifically the " Reservations Floating Through Time" section. In your case, set a walltime of 14 days for your partition that contains n[01-10]. Then, create a floating reservation on node n[06-10] for n + 1 day where "n" is alw