[slurm-users] Re: Slurm not running on a warewulf node

2024-12-03 Thread Jeffrey R. Lang via slurm-users
Steve Trying running the failing process from the command line and use the -D option. Per man page: Run slurmd in the foreground. Error and debug messages will be copied to stderr. Jeffrey R. Lang Advanced Research Computing Center University of Wyoming, Information Technology Center 1000 E

[slurm-users] Re: [EXT] RE: [EXT] RE: [EXT] RE: [EXT] RE: Nodes required for job are down, drained or reserved

2024-04-09 Thread Jeffrey R. Lang via slurm-users
Alison I’m glad I was able to help. Good luck. Jeff From: Alison Peterson Sent: Tuesday, April 9, 2024 4:09 PM To: Jeffrey R. Lang Cc: slurm-users@lists.schedmd.com Subject: Re: [EXT] RE: [EXT] RE: [EXT] RE: [EXT] RE: [slurm-users] Nodes required for job are down, drained or reserved

[slurm-users] Re: [EXT] RE: [EXT] RE: [EXT] RE: Nodes required for job are down, drained or reserved

2024-04-09 Thread Jeffrey R. Lang via slurm-users
use scontrol update node=head state=resume and then check the status again. Hopwfully the node with show idle meaning that it’s should be ready to accept jobs. Jeff From: Alison Peterson Sent: Tuesday, April 9, 2024 3:40 PM To: Jeffrey R. Lang Cc: slurm-users

[slurm-users] Re: [EXT] RE: [EXT] RE: Nodes required for job are down, drained or reserved

2024-04-09 Thread Jeffrey R. Lang via slurm-users
. I need to see what’s in the test.sh file to get an idea of how your job is setup. jeff From: Alison Peterson Sent: Tuesday, April 9, 2024 3:15 PM To: Jeffrey R. Lang Cc: slurm-users@lists.schedmd.com Subject: Re: [EXT] RE: [EXT] RE: [slurm-users] Nodes required for job are down, drained or

[slurm-users] Re: [EXT] RE: Nodes required for job are down, drained or reserved

2024-04-09 Thread Jeffrey R. Lang via slurm-users
Alison Can you provide the output of the following commands: * sinfo * scontrol show node name=head and the job command that your trying to run? From: Alison Peterson Sent: Tuesday, April 9, 2024 3:03 PM To: Jeffrey R. Lang Cc: slurm-users@lists.schedmd.com Subject: Re: [EXT] RE

[slurm-users] Re: Nodes required for job are down, drained or reserved

2024-04-09 Thread Jeffrey R. Lang via slurm-users
Alison The error message indicates that there are no resources to execute jobs. Since you haven’t defined any compute nodes you will get this error. I would suggest that you create at least one compute node. Once, you do that this error should go away. Jeff From: Alison Peterson via slurm-

[slurm-users] Cleanup of old clusters in database

2024-01-10 Thread Jeffrey R. Lang
We have shuttered two clusters and need to remove them from the database. To do this, do we remove the table spaces associated with the cluster names from the Slurm database? Thanks, Jeff

Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-10-30 Thread Jeffrey R. Lang
The service is available in RHEL 8 via the EPEL package repository as system-networkd, i.e. systemd-networkd.x86_64 253.4-1.el8epel -Original Message- From: slurm-users On Behalf Of Ole Holm Nielsen Sent: Monday, October 30, 2023 1:56 PM T

Re: [slurm-users] Per-user TRES summary?

2022-11-28 Thread Jeffrey R. Lang
You might try the slurmuserjobs command as part of the Slurm_tools package found here https://github.com/OleHolmNielsen/Slurm_tools From: slurm-users On Behalf Of Djamil Lakhdar-Hamina Sent: Monday, November 28, 2022 5:49 PM To: Slurm User Community List Subject: Re: [slurm-users] Per-user T

[slurm-users] How to open a slurm support case

2022-03-24 Thread Jeffrey R. Lang
Can someone provide me with instructions on how to open a support case with SchedMD? We have a support contract, but no where on their website can I find a link to open a case with them. Thanks, Jeff

[slurm-users] Help with failing job execution

2022-03-24 Thread Jeffrey R. Lang
My site recently updated to Slurm 21.08.6 and for the most part everything went fine. Two Ubuntu nodes however are having issues.Slurmd cannot execve the jobs on the nodes. As an example: [jrlang@tmgt1 ~]$ salloc -A ARCC --nodes=1 --ntasks=20 -t 1:00:00 --bell --nodelist=mdgx01 --partitio

[slurm-users] Where is the documentation for saving batch script

2022-03-17 Thread Jeffrey R. Lang
Hello I want to look into the new feature of saving job scripts in the Slurm database but have been unable to find documentation on how to do it. Can someone please point me in the right direction for the documentation or slurm configuration changes that need to be implemented? Thanks jeff

Re: [slurm-users] systemctl enable slurmd.service Failed to execute operation: No such file or directory

2022-01-27 Thread Jeffrey R. Lang
The missing file error has nothing to do with slurm. The systemctl command is part of the systems service management. The error message indicates that you haven’t copied the slurmd.service file on your compute node to /etc/systemd/system or /usr/lib/systemd/system. /etc/systemd/system is usua

Re: [slurm-users] Fwd: useradd: group 'slurm' does not exist

2022-01-25 Thread Jeffrey R. Lang
Looking at what you provided in your email the groupadd commands are failing, due to the requested GID 991 and 992 already being assigned by the system your installing on. Check the /etc/group file and find two GID numbers lower than 991 that are unused and use those instead. Keep them in the

Re: [slurm-users] How to avoid a feature?

2021-07-02 Thread Jeffrey R. Lang
How about using node weights.Weight the non-gpu nodes so that they are scheduled first. The GPU nodes could have a very high weight so that the scheduler would consider them last for allocation. This would allow the non-gpu nodes to be filled first and when full schedule the GPU nodes. Us

[slurm-users] Question about determining pre-empted jobs

2020-02-28 Thread Jeffrey R. Lang
I need your help. We have had a request to generate a report showing the number of jobs by date showing pre-empted jobs. We used sacct to try to gather the data but we only found a few jobs with the state "PREEMPTED". Scanning the slurmd logs we find there are a lot of job that show pre-empte

[slurm-users] question about partition definition

2019-12-09 Thread Jeffrey R. Lang
I need to set up a partition that limits the number of jobs allowed to run at one time. Looking at the slurm.conf page for partition definitions I don't see a MaxJobs option. Is there a way to limit the number of jobs in a partition? Thanks, Jeff

Re: [slurm-users] scontrol for a heterogenous job appears incorrect

2019-04-24 Thread Jeffrey R. Lang
◆ This message was sent from a non-UWYO address. Please exercise caution when clicking links or opening attachments from external sources. On 23/4/19 3:02 pm, Jeffrey R. Lang wrote: > Looking at the nodelist and the NumNodes they are both incorrect. They > should show the first node an

[slurm-users] scontrol for a heterogenous job appears incorrect

2019-04-23 Thread Jeffrey R. Lang
I'm testing using heterogenous jobs for a user on out cluster, but seeing I think incorrect output from "scontrol show job XXX" for the job. The cluster is currently using Slurm 18.08. So my job script looks like this: #!/bin/sh ### This is a general SLURM script. You'll need to make modificat

[slurm-users] Why is this command not working

2019-01-16 Thread Jeffrey R. Lang
I'm trying to set a maxjobs limit on a specific user in my cluster, but following the example in the sacctmgr man page I keep getting this error. sacctmgr -v modify user where name=jrlang cluster=teton account=microbiome set maxjobs=30 sacctmgr: Accounting storage SLURMDBD plugin loaded with Au

Re: [slurm-users] Simple question but I can't find the answer

2019-01-10 Thread Jeffrey R. Lang
-UWYO address. Please exercise caution when clicking links or opening attachments from external sources. Is it following a host name, or a partition name? If the latter, it just means that it's the default partition. ____ From: Jeffrey R. Lang <mailto:jrl...@u

[slurm-users] Simple question but I can't find the answer

2019-01-10 Thread Jeffrey R. Lang
Guys When I run sinfo some of the nodes in the list show there hostname with a following asterisk. I've looked through the man pages and what I can find on the web but nothing provides an answer. So what does the asterisk after the hostname mean? Jeff