[slurm-users] Re: SlurmDBD errors

Ryan Novosielski via slurm-users Wed, 18 Sep 2024 21:09:30 -0700

I don’t think you should expect this from overlapping nodes in partitions, but 
instead whe you’re allowing hardware itself to be oversubscribed.


Was your upgrade in this window?

I would suggest looking for runaway jobs, which you’ve done, and am not sure 
what else.

--
#BlackLivesMatter
____
|| \\UTGERS,     |---------------------------*O*---------------------------
||_// the State  |         Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\    of NJ  | Office of Advanced Research Computing - MSB A555B, Newark
     `'

On Sep 18, 2024, at 23:25, Sajesh Singh via slurm-users 
<slurm-users@lists.schedmd.com> wrote:

OS: CentOS 8.5
Slurm: 22.05

Recently upgraded to 22.05. Upgrade was successful, but after a while I started 
to see the following messages in the slurmdbd.log file:

error: We have more time than is possible (9344745+7524000+0)(16868745) > 
12362400 for cluster CLUSTERNAME(3434) from 2024-09-18T13:00:00 - 
2024-09-18T14:00:00 tres 1 (this may happen if oversubscription of resources is 
allowed without Gang)

We do have partitions with overlapping nodes, but do not have “Suspend,Gang” 
set as the global PreemptMode mode. It is currently set to requeue.

I have also check sacct and there are no runaway jobs listed.

Oversubscription is not enabled on any of the queues as well.

Do I need to modify my slurm config to address or is this an error condition 
caused by the upgrade?

Thank you,

SS




--
slurm-users mailing list -- 
slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>
To unsubscribe send an email to 
slurm-users-le...@lists.schedmd.com<mailto:slurm-users-le...@lists.schedmd.com>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: SlurmDBD errors

Reply via email to