The upgrade was a couple of hours prior to the messages appearing in the logs.
SS
From: Ryan Novosielski
Sent: Thursday, September 19, 2024 12:08:42 AM
To: Sajesh Singh
Cc: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] SlurmDBD errors
EXTERNAL
OS: CentOS 8.5
Slurm: 22.05
Recently upgraded to 22.05. Upgrade was successful, but after a while I started
to see the following messages in the slurmdbd.log file:
error: We have more time than is possible (9344745+7524000+0)(16868745) >
12362400 for cluster CLUSTERNAME(3434) from 2024-09-18T13
, Sajesh Singh wrote:
Some additional information after enabling debug3 on slurmctld daemon:
Logs show that there are enough usable nodes for the job:
[2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config containing
node-11
[2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
usable nodes from config containing
node-71
But then the following line is in the log as well:
debug3: select_nodes: JobId=67171529 required nodes not avail
--
-Sajesh-
From: slurm-users On Behalf Of Sajesh
Singh
Sent: Thursday, March 25, 2021 9:02 AM
To: Slurm User Community List
Subject: Re
: Wednesday, March 24, 2021 11:02 PM
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] Limit on number of nodes user able to request
EXTERNAL SENDER
Do 'sinfo -R' and see if you have any down or drained nodes.
Brian Andrus
On 3/24/2021 6:31 PM, Sajesh Singh wrote:
Slurm 20.02
C
Slurm 20.02
CentOS 8
I just recently noticed a strange behavior when using the powersave plugin for
bursting to AWS. I have a queue configured with 60 nodes, but if I submit a job
to use all of the nodes I get the error:
(Nodes required for job are DOWN, DRAINED or reserved for jobs in higher
Brian,
Thank you for the reply. Quite strange as I installed from RPM, but the shell
is set to /sbin/nologin. I will change see if the cloud scheduling works as
expected
--
Sajesh Singh
From: slurm-users On Behalf Of Brian
Andrus
Sent: Tuesday, March 9, 2021 8:45 PM
To: slurm-users
,
Sajesh Singh
public IP address of the controller it may
be simpler to use only the public IP for the controller, but I don't know how
your routing is set up.
HTH
- Michael
On Fri, Jan 22, 2021 at 11:26 AM Sajesh Singh
mailto:ssi...@amnh.org>> wrote:
How would I deal with the address of the head
22, 2021 1:45 PM
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] Cluster nodes on multiple cluster networks
EXTERNAL SENDER
You would need to have a direct connect/vpn so the cloud nodes can connect to
your head node.
Brian Andrus
On 1/22/2021 10:37 AM, Sajesh Singh wrote:
We are
We are looking at rolling out cloud bursting to our on-prem Slurm cluster and I
am wondering how to deal with the slurm.conf variable SlurmctldHost. It is
currently configured with the private cluster network address that the on-prem
nodes use to contact it. The nodes in the cloud would contact
ey are making so aren't surprised. Also, avoid network mounts on
nodes. Performance takes a big hit when you have that going over a
direct-connect or VPN.
Brian Andrus
On 12/15/2020 12:02 PM, Sajesh Singh wrote:
We are currently investigating the use of the cloud scheduling featu
We are currently investigating the use of the cloud scheduling features within
an on-site Slurm installation and was wondering if anyone had any experiences
that they wish to share of trying to use this feature. In particular I am
interested to know:
https://slurm.schedmd.com/elastic_computing.
EXTERNAL SENDER
On 10/8/20 3:48 pm, Sajesh Singh wrote:
>Thank you. Looks like the fix is indeed the missing file
> /etc/slurm/cgroup_allowed_devices_file.conf
No, you don't want that, that will allow all access to GPUs whether people have
requested them or not.
What you want is i
] CUDA environment variable not being set
EXTERNAL SENDER
Hi Sajesh,
On 10/8/20 11:57 am, Sajesh Singh wrote:
> debug: common_gres_set_env: unable to set env vars, no device files
> configured
I suspect the clue is here - what does your gres.conf look like?
Does it list the devices i
nodes also?
Brian Andrus
On 10/8/2020 11:57 AM, Sajesh Singh wrote:
Slurm 18.08
CentOS 7.7.1908
I have 2 M500 GPUs in a compute node which is defined in the slurm.conf and
gres.conf of the cluster, but if I launch a job requesting GPUs the environment
variable CUDA_VISIBLE_DEVICES Is never set
I only get a line returned for “Gres=”, but this is the same behavior on
another cluster that has GPUs and the variable gets set on that cluster.
-Sajesh-
--
_
Sajesh Singh
Manager, Systems and Scientific Computing
American Museum of Natural
's no driver installed.
Relu
On 2020-10-08 14:57, Sajesh Singh wrote:
Slurm 18.08
CentOS 7.7.1908
I have 2 M500 GPUs in a compute node which is defined in the slurm.conf and
gres.conf of the cluster, but if I launch a job requesting GPUs the environment
variable CUDA_VISIBLE_DEVICES Is never set
Slurm 18.08
CentOS 7.7.1908
I have 2 M500 GPUs in a compute node which is defined in the slurm.conf and
gres.conf of the cluster, but if I launch a job requesting GPUs the environment
variable CUDA_VISIBLE_DEVICES Is never set and I see the following messages in
the slurmd.log file:
debug: co
We are deploying 2 compute nodes with Nvidia v100 GPUs and would like to use
the CUDA MPS feature. I am not sure as to where to get the number to use for
mps when defining the node in the slurm.conf?
Any advise would be greatly appreciated.
Regards,
SS
CentOS 7.7
Slurm 20.02
Having a bit of a time with jobs that are configured with a walltime of more
than 365 days. The job is accepted for run, but the squeue -l output shows the
TIME_LIMIT is INVALID.
If I look at the job through scontrol it shows the correct TimeLImit.
Any ideas as to what c
CentOS 7.7.1908
Slurm 18.08.8
When trying to run an interactive job I am getting the following error:
srun: error: task 0 launch failed: Slurmd could not connect IO
Checking the log file on the compute node I see the following error:
[2020-03-25T01:42:08.262] launch task 13.0 request from UID:1
ved=0>
Barbara
On 5/6/19 5:52 PM, Sajesh Singh wrote:
Good day fellow Slurm users.
Coming from a PBSpro system which has the following variables to limit a job to
the resources it requested:
$enforce mem
$enforce cpuaverage
$enforce cpuburst
If a user exceeded any of the above limits their job w
Good day fellow Slurm users.
Coming from a PBSpro system which has the following variables to limit a job to
the resources it requested:
$enforce mem
$enforce cpuaverage
$enforce cpuburst
If a user exceeded any of the above limits their job would be terminated.
Looking through the slurm.conf m
Fixed the problem. I had an incorrect config the slurm.conf needed the
following entry and all now works as expected:
AccountingStoragePort=7031
--
-SS-
-Original Message-
From: Sajesh Singh
Sent: Thursday, January 17, 2019 12:26 PM
To: Slurm User Community List
Subject: RE: [slurm
for table names in mariadb, or used to be.
On 1/17/19, 11:07 AM, "slurm-users on behalf of Sajesh Singh"
wrote:
Trying to setup accounting using the MySQL backend and I am getting errors
from the slurmctld and slurm tools when trying to interact with the accounting
database. Tried
Trying to setup accounting using the MySQL backend and I am getting errors from
the slurmctld and slurm tools when trying to interact with the accounting
database. Tried starting in debug as well, but could not see anything else that
could point to what could be causing this issue. I have follow
We are currently investigating a switch to SLURM from PBS and I have a question
on the interoperability of two features. Would the implementation of federation
in SLURM affect any of the clusters' ability to burst to the cloud?
-SS-
28 matches
Mail list logo