Hi David,
You might have a look at the thread "Large job starvation on cloud cluster"
that started on Feb 27; there's some good tidbits in there. Off the top without
more information, I would venture that settings you have in slurm.conf end up
backfilling the smaller jobs at the expense of sch
Both your Slurm and OpenMPI config.logs would be helpful in debugging
here. Throw in your slurm.conf as well for good measure. Also, what type
of system are you running, what type of high speed fabric are you trying
to run on, and what's your driver stack look like?
I know the feeling and will
Hi Brian,
Others probably have better suggestions before going the route I'm about to
detail. If you do go this route, be warned, you definitely have the ability to
irrevocably lose data or destroy your Slurm accounting database. Do so at your
own risk. I got here with Google-foo after being ou
Xiang,
>From what I've of the original question, gres.conf may be another place to
>verify the setup that only one core is being allocated per gpu request:
>https://slurm.schedmd.com/gres.conf.html
Seeing the run submission line and gres.conf might help others give you further
advise.
To Jeff
as soon as it goes into the pending state, they
> scancel it, change the partition name to a less utilized partition,
> and resubmit it in the hopes it will start running immediately.
>
> Yes, there needs to be a lot of user training, and there's a lot I can
> do to improve the en
Hi Prentice,
Have you considered Slurm features and constraints at all? You provide
features (arbitrary strings in your slurm.conf) of what your hardware
can provide ("amd", "ib", "FAST", "whatever"). A user then will list
constraints using typical and/or/regex notation ( --constraint=amd&ib ).
Hello,
We've recently made the transition from version 17.11.5 up to 18.08.0.
Anecdotally, we think we're seeing a change in behavior regarding the
priority of held (user or admin) jobs. For discussion, take the example
where a user submits a job, it waits for a day in the queues, the user
doe
e the real time partition
becomes busy. I think this would be a nice solution that does not
involve job preemption.
Cheers,
Cyrus
On 08/13/2018 11:20 AM, Jens Dreger wrote:
Hi Cyrus!
On Mon, Aug 13, 2018 at 08:44:15AM -0500, Cyrus Proctor wrote:
Hi Jens,
Check out https://slurm.schedmd.com/re
Hi Jens,
Check out https://slurm.schedmd.com/reservations.html specifically the "
Reservations Floating Through Time" section. In your case, set a
walltime of 14 days for your partition that contains n[01-10]. Then,
create a floating reservation on node n[06-10] for n + 1 day where "n"
is alw