[slurm-users] Re: Elastic Computing: Is it possible to incentivize grouping power_up calls?

2024-04-09 Thread Xaver Stiensmeier via slurm-users
Thank you Brian, while ResumeRate might be able to keep the CPU usage within an acceptable margin, it's not really a fix, but a workaround. I would prefer a solution that groups resume requests and therefore makes use of a single Ansible playbook run per second instead of <=ResumeRate. As we com

[slurm-users] Re: Avoiding fragmentation

2024-04-09 Thread Cutts, Tim via slurm-users
Agree with that. Plus, of course, even if the jobs run a bit slower by not having all the cores on a single node, they will be scheduled sooner, so the overall turnaround time for the user will be better, and ultimately that's what they care about. I've always been of the view, for any schedu

[slurm-users] DefCpuPerGPU and multiple partitions

2024-04-09 Thread Ansgar Esztermann-Kirchner via slurm-users
Hello List, does anyone have experience with DefCpuPerGPU and jobs requesting multiple partitions? I would expect Slurm to select a partition from those requested by the job, then assign CPUs based on that partition's DefCpuPerGPU. But according to my observations, it appears that (at least someti

[slurm-users] Re: Avoiding fragmentation

2024-04-09 Thread Paul Edmon via slurm-users
I wrote a little blog post on this topic a few years back: https://www.rc.fas.harvard.edu/blog/cluster-fragmentation/ It's a vexing problem, but as noted by the other responders it is something that depends on your cluster policy and job performance needs. Well written MPI code should be able

[slurm-users] Trouble Running Slurm C Extension Plugin

2024-04-09 Thread Glen MacLachlan via slurm-users
Hi, We have a plugin in Lua that mostly does what we want but there are features available in the C extension that are not available to lua. For that reason, we are attempting to convert to C using the guidance found here: https://slurm.schedmd.com/job_submit_plugins.html#building. We arrived here

[slurm-users] Re: Avoiding fragmentation

2024-04-09 Thread Juergen Salk via slurm-users
Hi Gerhard, I am not sure if this counts as administrative measure, but we do highly encourage our users to always explicitely specify --nodes=n together with --ntasks-per-node=m (rather than just --ntasks=n*m and omitting --nodes option, which may lead to cores allocated here and there and eve

[slurm-users] single node configuration

2024-04-09 Thread Alison Peterson via slurm-users
Hi everyone, I'm conducting some tests. I've just set up SLURM on the head node and haven't added any compute nodes yet. I'm trying to test it to ensure it's working, but I'm encountering an error: 'Nodes required for the job are DOWN, DRAINED, or reserved for jobs in higher priority partitions. *

[slurm-users] Nodes required for job are down, drained or reserved

2024-04-09 Thread Alison Peterson via slurm-users
Hi everyone, I'm conducting some tests. I've just set up SLURM on the head node and haven't added any compute nodes yet. I'm trying to test it to ensure it's working, but I'm encountering an error: 'Nodes required for the job are DOWN, DRAINED, or reserved for jobs in higher priority partitions. A

[slurm-users] Re: Nodes required for job are down, drained or reserved

2024-04-09 Thread Jeffrey R. Lang via slurm-users
Alison The error message indicates that there are no resources to execute jobs. Since you haven’t defined any compute nodes you will get this error. I would suggest that you create at least one compute node. Once, you do that this error should go away. Jeff From: Alison Peterson via slurm-

[slurm-users] Re: [EXT] RE: Nodes required for job are down, drained or reserved

2024-04-09 Thread Alison Peterson via slurm-users
Hi Jeffrey, I'm sorry I did add the head node in the compute nodes configuration, this is the slurm.conf # COMPUTE NODES NodeName=head CPUs=24 RealMemory=184000 Sockets=2 CoresPerSocket=6 ThreadsPerCore=2 State=UNKNOWN PartitionName=lab Nodes=ALL Default=YES MaxTime=INFINITE State=UP OverSubscr

[slurm-users] Re: [EXT] RE: Nodes required for job are down, drained or reserved

2024-04-09 Thread Jeffrey R. Lang via slurm-users
Alison Can you provide the output of the following commands: * sinfo * scontrol show node name=head and the job command that your trying to run? From: Alison Peterson Sent: Tuesday, April 9, 2024 3:03 PM To: Jeffrey R. Lang Cc: slurm-users@lists.schedmd.com Subject: Re: [EXT] RE: [

[slurm-users] Re: [EXT] RE: [EXT] RE: Nodes required for job are down, drained or reserved

2024-04-09 Thread Alison Peterson via slurm-users
Yes! here is the information: [stsadmin@head ~]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST lab* up infinite 1 down* head [stsadmin@head ~]$ scontrol show node name=head Node name=head not found [stsadmin@head ~]$ sbatch ~/Downloads/test.sh Submitted batch job 7 [st

[slurm-users] Re: [EXT] RE: [EXT] RE: Nodes required for job are down, drained or reserved

2024-04-09 Thread Jeffrey R. Lang via slurm-users
Alison The sinfo shows that your head node is down due to come configuration error. Are you running slurmd on the head node? If slurmd, is running find the log file for it and pass along the entries from it. Can you redo the scontrol command and “node name” should be “nodename” one word.

[slurm-users] Re: [EXT] RE: [EXT] RE: [EXT] RE: Nodes required for job are down, drained or reserved

2024-04-09 Thread Alison Peterson via slurm-users
Aha! That is probably the issue slurmd ! I know slurmd runs on the compute nodes, I need to deploy this for a lab but I only have one of the servers with me. I will be adding them 1 by 1 after the first one is set up, to not disrupt their current setup. I want to be able to use the resources from t

[slurm-users] Re: Trouble Running Slurm C Extension Plugin

2024-04-09 Thread Ryan Cox via slurm-users
Glen, I don't think I see it in your message, but are you pointing to the plugin in slurm.conf with JobSubmitPlugins=?  I assume you are but it's worth checking. Ryan On 4/9/24 10:19, Glen MacLachlan via slurm-users wrote: Hi, We have a plugin in Lua that mostly does what we want but there

[slurm-users] Re: [EXT] RE: [EXT] RE: [EXT] RE: Nodes required for job are down, drained or reserved

2024-04-09 Thread Jeffrey R. Lang via slurm-users
Alison In your case since you are using head as both a slurm management node and a compute node you’ll need to setup slurmd on the head node. Once the slurmd is running use “sinfo” to see what the status of the node is. Most likely down hopefully without an astrick. If that’s the case then

[slurm-users] Re: [EXT] RE: [EXT] RE: [EXT] RE: [EXT] RE: Nodes required for job are down, drained or reserved

2024-04-09 Thread Alison Peterson via slurm-users
Thank you so much!!! I have installed slurmd on the head node. Started and enabled the service, restarted slurmctld. I sent 2 jobs and they are running! [stsadmin@head ~]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 10 lab test_

[slurm-users] Re: [EXT] RE: [EXT] RE: [EXT] RE: [EXT] RE: Nodes required for job are down, drained or reserved

2024-04-09 Thread Jeffrey R. Lang via slurm-users
Alison I’m glad I was able to help. Good luck. Jeff From: Alison Peterson Sent: Tuesday, April 9, 2024 4:09 PM To: Jeffrey R. Lang Cc: slurm-users@lists.schedmd.com Subject: Re: [EXT] RE: [EXT] RE: [EXT] RE: [EXT] RE: [slurm-users] Nodes required for job are down, drained or reserved Than

[slurm-users] Jobs of a user are stuck in Completing stage for a long time and cannot cancel them

2024-04-09 Thread archisman.pathak--- via slurm-users
We are running a slurm cluster with version `slurm 22.05.8`. One of our users has reported that their jobs have been stuck at the completion stage for a long time. Referring to Slurm Workload Manager - Slurm Troubleshooting Guide we found that indeed the batchhost for the job was removed from th