[slurm-users] Re: SlurmDBD errors

2024-09-18 Thread Sajesh Singh via slurm-users
The upgrade was a couple of hours prior to the messages appearing in the logs. SS From: Ryan Novosielski Sent: Thursday, September 19, 2024 12:08:42 AM To: Sajesh Singh Cc: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] SlurmDBD errors EXTERNAL

[slurm-users] SlurmDBD errors

2024-09-18 Thread Sajesh Singh via slurm-users
OS: CentOS 8.5 Slurm: 22.05 Recently upgraded to 22.05. Upgrade was successful, but after a while I started to see the following messages in the slurmdbd.log file: error: We have more time than is possible (9344745+7524000+0)(16868745) > 12362400 for cluster CLUSTERNAME(3434) from 2024-09-18T13

Re: [slurm-users] Limit on number of nodes user able to request

2021-04-01 Thread Sajesh Singh
, Sajesh Singh wrote: Some additional information after enabling debug3 on slurmctld daemon: Logs show that there are enough usable nodes for the job: [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config containing node-11 [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config

Re: [slurm-users] Limit on number of nodes user able to request

2021-04-01 Thread Sajesh Singh
usable nodes from config containing node-71 But then the following line is in the log as well: debug3: select_nodes: JobId=67171529 required nodes not avail -- -Sajesh- From: slurm-users On Behalf Of Sajesh Singh Sent: Thursday, March 25, 2021 9:02 AM To: Slurm User Community List Subject: Re

Re: [slurm-users] Limit on number of nodes user able to request

2021-03-25 Thread Sajesh Singh
: Wednesday, March 24, 2021 11:02 PM To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] Limit on number of nodes user able to request EXTERNAL SENDER Do 'sinfo -R' and see if you have any down or drained nodes. Brian Andrus On 3/24/2021 6:31 PM, Sajesh Singh wrote: Slurm 20.02 C

[slurm-users] Limit on number of nodes user able to request

2021-03-24 Thread Sajesh Singh
Slurm 20.02 CentOS 8 I just recently noticed a strange behavior when using the powersave plugin for bursting to AWS. I have a queue configured with 60 nodes, but if I submit a job to use all of the nodes I get the error: (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher

Re: [slurm-users] Using Cloud bursting/ powersave options

2021-03-10 Thread Sajesh Singh
Brian, Thank you for the reply. Quite strange as I installed from RPM, but the shell is set to /sbin/nologin. I will change see if the cloud scheduling works as expected -- Sajesh Singh From: slurm-users On Behalf Of Brian Andrus Sent: Tuesday, March 9, 2021 8:45 PM To: slurm-users

[slurm-users] Using Cloud bursting/ powersave options

2021-03-09 Thread Sajesh Singh
, Sajesh Singh

Re: [slurm-users] Cluster nodes on multiple cluster networks

2021-01-22 Thread Sajesh Singh
public IP address of the controller it may be simpler to use only the public IP for the controller, but I don't know how your routing is set up. HTH - Michael On Fri, Jan 22, 2021 at 11:26 AM Sajesh Singh mailto:ssi...@amnh.org>> wrote: How would I deal with the address of the head

Re: [slurm-users] Cluster nodes on multiple cluster networks

2021-01-22 Thread Sajesh Singh
22, 2021 1:45 PM To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] Cluster nodes on multiple cluster networks EXTERNAL SENDER You would need to have a direct connect/vpn so the cloud nodes can connect to your head node. Brian Andrus On 1/22/2021 10:37 AM, Sajesh Singh wrote: We are

[slurm-users] Cluster nodes on multiple cluster networks

2021-01-22 Thread Sajesh Singh
We are looking at rolling out cloud bursting to our on-prem Slurm cluster and I am wondering how to deal with the slurm.conf variable SlurmctldHost. It is currently configured with the private cluster network address that the on-prem nodes use to contact it. The nodes in the cloud would contact

Re: [slurm-users] Burst to AWS cloud

2020-12-15 Thread Sajesh Singh
ey are making so aren't surprised. Also, avoid network mounts on nodes. Performance takes a big hit when you have that going over a direct-connect or VPN. Brian Andrus On 12/15/2020 12:02 PM, Sajesh Singh wrote: We are currently investigating the use of the cloud scheduling featu

[slurm-users] Burst to AWS cloud

2020-12-15 Thread Sajesh Singh
We are currently investigating the use of the cloud scheduling features within an on-site Slurm installation and was wondering if anyone had any experiences that they wish to share of trying to use this feature. In particular I am interested to know: https://slurm.schedmd.com/elastic_computing.

Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Sajesh Singh
EXTERNAL SENDER On 10/8/20 3:48 pm, Sajesh Singh wrote: >Thank you. Looks like the fix is indeed the missing file > /etc/slurm/cgroup_allowed_devices_file.conf No, you don't want that, that will allow all access to GPUs whether people have requested them or not. What you want is i

Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Sajesh Singh
] CUDA environment variable not being set EXTERNAL SENDER Hi Sajesh, On 10/8/20 11:57 am, Sajesh Singh wrote: > debug: common_gres_set_env: unable to set env vars, no device files > configured I suspect the clue is here - what does your gres.conf look like? Does it list the devices i

Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Sajesh Singh
nodes also? Brian Andrus On 10/8/2020 11:57 AM, Sajesh Singh wrote: Slurm 18.08 CentOS 7.7.1908 I have 2 M500 GPUs in a compute node which is defined in the slurm.conf and gres.conf of the cluster, but if I launch a job requesting GPUs the environment variable CUDA_VISIBLE_DEVICES Is never set

Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Sajesh Singh
I only get a line returned for “Gres=”, but this is the same behavior on another cluster that has GPUs and the variable gets set on that cluster. -Sajesh- -- _ Sajesh Singh Manager, Systems and Scientific Computing American Museum of Natural

Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Sajesh Singh
's no driver installed. Relu On 2020-10-08 14:57, Sajesh Singh wrote: Slurm 18.08 CentOS 7.7.1908 I have 2 M500 GPUs in a compute node which is defined in the slurm.conf and gres.conf of the cluster, but if I launch a job requesting GPUs the environment variable CUDA_VISIBLE_DEVICES Is never set

[slurm-users] CUDA environment variable not being set

2020-10-08 Thread Sajesh Singh
Slurm 18.08 CentOS 7.7.1908 I have 2 M500 GPUs in a compute node which is defined in the slurm.conf and gres.conf of the cluster, but if I launch a job requesting GPUs the environment variable CUDA_VISIBLE_DEVICES Is never set and I see the following messages in the slurmd.log file: debug: co

[slurm-users] Specifying MPS when using GPUs

2020-07-22 Thread Sajesh Singh
We are deploying 2 compute nodes with Nvidia v100 GPUs and would like to use the CUDA MPS feature. I am not sure as to where to get the number to use for mps when defining the node in the slurm.conf? Any advise would be greatly appreciated. Regards, SS

[slurm-users] MaxTime and partition config

2020-03-30 Thread Sajesh Singh
CentOS 7.7 Slurm 20.02 Having a bit of a time with jobs that are configured with a walltime of more than 365 days. The job is accepted for run, but the squeue -l output shows the TIME_LIMIT is INVALID. If I look at the job through scontrol it shows the correct TimeLImit. Any ideas as to what c

[slurm-users] Cannot run interactive jobs

2020-03-24 Thread Sajesh Singh
CentOS 7.7.1908 Slurm 18.08.8 When trying to run an interactive job I am getting the following error: srun: error: task 0 launch failed: Slurmd could not connect IO Checking the log file on the compute node I see the following error: [2020-03-25T01:42:08.262] launch task 13.0 request from UID:1

Re: [slurm-users] Slurm resource limits on jobs

2019-05-07 Thread Sajesh Singh
ved=0> Barbara On 5/6/19 5:52 PM, Sajesh Singh wrote: Good day fellow Slurm users. Coming from a PBSpro system which has the following variables to limit a job to the resources it requested: $enforce mem $enforce cpuaverage $enforce cpuburst If a user exceeded any of the above limits their job w

[slurm-users] Slurm resource limits on jobs

2019-05-06 Thread Sajesh Singh
Good day fellow Slurm users. Coming from a PBSpro system which has the following variables to limit a job to the resources it requested: $enforce mem $enforce cpuaverage $enforce cpuburst If a user exceeded any of the above limits their job would be terminated. Looking through the slurm.conf m

Re: [slurm-users] SlurmDBD setup with mysql

2019-01-18 Thread Sajesh Singh
Fixed the problem. I had an incorrect config the slurm.conf needed the following entry and all now works as expected: AccountingStoragePort=7031 -- -SS- -Original Message- From: Sajesh Singh Sent: Thursday, January 17, 2019 12:26 PM To: Slurm User Community List Subject: RE: [slurm

Re: [slurm-users] SlurmDBD setup with mysql

2019-01-17 Thread Sajesh Singh
for table names in mariadb, or used to be. On 1/17/19, 11:07 AM, "slurm-users on behalf of Sajesh Singh" wrote: Trying to setup accounting using the MySQL backend and I am getting errors from the slurmctld and slurm tools when trying to interact with the accounting database. Tried

[slurm-users] SlurmDBD setup with mysql

2019-01-17 Thread Sajesh Singh
Trying to setup accounting using the MySQL backend and I am getting errors from the slurmctld and slurm tools when trying to interact with the accounting database. Tried starting in debug as well, but could not see anything else that could point to what could be causing this issue. I have follow

[slurm-users] Federation and bursting to cloud

2018-12-04 Thread Sajesh Singh
We are currently investigating a switch to SLURM from PBS and I have a question on the interoperability of two features. Would the implementation of federation in SLURM affect any of the clusters' ability to burst to the cloud? -SS-