[slurm-users] Re: [EXT] RE: [EXT] RE: [EXT] RE: [EXT] RE: Nodes required for job are down, drained or reserved

Alison Peterson via slurm-users Tue, 09 Apr 2024 14:09:52 -0700

Thank you so much!!! I have installed slurmd on the head node. Started and
enabled the service, restarted slurmctld. I sent 2 jobs and they are
running!


[stsadmin@head ~]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES
NODELIST(REASON)
                10       lab test_slu stsadmin  R       0:01      1 head
                 9       lab test_slu stsadmin  R       0:09      1 head

On Tue, Apr 9, 2024 at 1:54 PM Jeffrey R. Lang <jrl...@uwyo.edu> wrote:

> Alison
>
>
>
>   In your case since you are using head as both a slurm management node
> and a compute node you’ll need to setup slurmd on the head node.
>
>
>
> Once the slurmd is running use “sinfo” to see what the status of the node
> is.  Most likely down hopefully without an astrick.  If that’s the case
> then use
>
>
>
>                 scontrol update node=head state=resume
>
>
>
> and then check the status again.  Hopwfully the node with show idle
> meaning that it’s should be ready to accept jobs.
>
>
>
>
>
> Jeff
>
>
>
> *From:* Alison Peterson <apeters...@sdsu.edu>
> *Sent:* Tuesday, April 9, 2024 3:40 PM
> *To:* Jeffrey R. Lang <jrl...@uwyo.edu>
> *Cc:* slurm-users@lists.schedmd.com
> *Subject:* Re: [EXT] RE: [EXT] RE: [EXT] RE: [slurm-users] Nodes required
> for job are down, drained or reserved
>
>
>
> Aha! That is probably the issue slurmd ! I know slurmd runs on the compute
> nodes, I need to deploy this for a lab but I only have one of the servers
> with me. I will be adding them 1 by 1 after the first one is set up, to not
> disrupt their current setup. I want to be able to use the resources from
> the head and also the compute nodes once it's completed.
>
>
>
> [stsadmin@head ~]$ sudo systemctl status slurmd
> Unit slurmd.service could not be found.
>
>
> [stsadmin@head ~]$ scontrol show node head
> NodeName=head CoresPerSocket=6
>    CPUAlloc=0 CPUEfctv=24 CPUTot=24 CPULoad=0.00
>    AvailableFeatures=(null)
>    ActiveFeatures=(null)
>    Gres=(null)
>    NodeAddr=head NodeHostName=head
>    RealMemory=184000 AllocMem=0 FreeMem=N/A Sockets=2 Boards=1
>    State=DOWN+NOT_RESPONDING ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A
> MCS_label=N/A
>    Partitions=lab
>    BootTime=None SlurmdStartTime=None
>    LastBusyTime=2024-04-09T13:20:04 ResumeAfterTime=None
>    CfgTRES=cpu=24,mem=184000M,billing=24
>    AllocTRES=
>    CapWatts=n/a
>    CurrentWatts=0 AveWatts=0
>    ExtSensorsJoules=n/a ExtSensorsWatts=0 ExtSensorsTemp=n/a
>    Reason=Not responding [slurm@2024-04-09T10:14:10]
>
> [stsadmin@head ~]$ cat ~/Downloads/test.sh
> #!/bin/bash
> #SBATCH --job-name=test_slurm
> #SBATCH --nodes=1
> #SBATCH --ntasks=1
> #SBATCH --time=01:00:00
> #SBATCH --output=test_slurm_output.txt
>
> echo "Starting the SLURM test job on: $(date)"
> echo "Running on hostname: $(hostname)"
> echo "SLURM_JOB_ID: $SLURM_JOB_ID"
> echo "SLURM_JOB_NODELIST: $SLURM_JOB_NODELIST"
> echo "SLURM_NTASKS: $SLURM_NTASKS"
>
> # Here you can place the commands you want to run on the compute node
> # For example, a simple sleep command or any application that needs to be
> tested
> sleep 60
>
> echo "SLURM test job completed on: $(date)"
>
>
>
> On Tue, Apr 9, 2024 at 1:21 PM Jeffrey R. Lang <jrl...@uwyo.edu> wrote:
>
> Alison
>
>
>
>   The sinfo shows that your head node is down due to come configuration
> error.
>
>
>
>   Are you running slurmd on the head node?  If slurmd, is running find the
> log file for it and pass along the entries from it.
>
>
>
> Can you redo the scontrol command and “node name” should be “nodename” one
> word.
>
>
>
> I need to see what’s in the test.sh file to get an idea of how your job is
> setup.
>
>
>
> jeff
>
>
>
> *From:* Alison Peterson <apeters...@sdsu.edu>
> *Sent:* Tuesday, April 9, 2024 3:15 PM
> *To:* Jeffrey R. Lang <jrl...@uwyo.edu>
> *Cc:* slurm-users@lists.schedmd.com
> *Subject:* Re: [EXT] RE: [EXT] RE: [slurm-users] Nodes required for job
> are down, drained or reserved
>
>
>
> Yes! here is the information:
>
>
>
> [stsadmin@head ~]$ sinfo
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
> lab*         up   infinite      1  down* head
>
>
> [stsadmin@head ~]$ scontrol show node name=head
> Node name=head not found
>
>
> [stsadmin@head ~]$ sbatch ~/Downloads/test.sh
> Submitted batch job 7
>
>
> [stsadmin@head ~]$ squeue
>              JOBID PARTITION     NAME     USER ST       TIME  NODES
> NODELIST(REASON)
>                  7       lab test_slu stsadmin PD       0:00      1
> (ReqNodeNotAvail, UnavailableNodes:head)
>
>
>
> On Tue, Apr 9, 2024 at 1:07 PM Jeffrey R. Lang <jrl...@uwyo.edu> wrote:
>
> Alison
>
>
>
> Can you provide the output of the following commands:
>
>
>
> ·         sinfo
>
> ·         scontrol show node name=head
>
>
>
> and the job command that your trying to run?
>
>
>
>
>
>
>
> *From:* Alison Peterson <apeters...@sdsu.edu>
> *Sent:* Tuesday, April 9, 2024 3:03 PM
> *To:* Jeffrey R. Lang <jrl...@uwyo.edu>
> *Cc:* slurm-users@lists.schedmd.com
> *Subject:* Re: [EXT] RE: [slurm-users] Nodes required for job are down,
> drained or reserved
>
>
>
> Hi Jeffrey,
>
>  I'm sorry I did add the head node in the compute nodes configuration,
> this is the slurm.conf
>
>
>
> # COMPUTE NODES
> NodeName=head CPUs=24 RealMemory=184000 Sockets=2  CoresPerSocket=6
> ThreadsPerCore=2 State=UNKNOWN
> PartitionName=lab  Nodes=ALL Default=YES MaxTime=INFINITE State=UP
> OverSubscribe=Force
>
>
>
>
>
> On Tue, Apr 9, 2024 at 12:57 PM Jeffrey R. Lang <jrl...@uwyo.edu> wrote:
>
> Alison
>
>
>
> The error message indicates that there are no resources to execute jobs.
> Since you haven’t defined any compute nodes you will get this error.
>
>
>
> I would suggest that you create at least one compute node.  Once, you do
> that this error should go away.
>
>
>
> Jeff
>
>
>
> *From:* Alison Peterson via slurm-users <slurm-users@lists.schedmd.com>
> *Sent:* Tuesday, April 9, 2024 2:52 PM
> *To:* slurm-users@lists.schedmd.com
> *Subject:* [slurm-users] Nodes required for job are down, drained or
> reserved
>
>
>
> ◆ This message was sent from a non-UWYO address. Please exercise caution
> when clicking links or opening attachments from external sources.
>
>
>
> Hi everyone, I'm conducting some tests. I've just set up SLURM on the head
> node and haven't added any compute nodes yet. I'm trying to test it to
> ensure it's working, but I'm encountering an error: 'Nodes required for the
> job are DOWN, DRAINED, or reserved for jobs in higher priority partitions.
>
>
>
> Any guidance will be appreciated thank you!
>
>
>
> --
>
> *Alison Peterson*
>
> IT Research Support Analyst
> *Information Technology*
>
> apeters...@sdsu.edu <mfar...@sdsu.edu>
>
> O: 619-594-3364
>
> *San Diego State University | **SDSU.edu <http://sdsu.edu/>*
>
> 5500 Campanile Drive | San Diego, CA 92182-8080
>
>
>
>
>
>
> --
>
> *Alison Peterson*
>
> IT Research Support Analyst
> *Information Technology*
>
> apeters...@sdsu.edu <mfar...@sdsu.edu>
>
> O: 619-594-3364
>
> *San Diego State University | **SDSU.edu <http://sdsu.edu/>*
>
> 5500 Campanile Drive | San Diego, CA 92182-8080
>
>
>
>
>
>
> --
>
> *Alison Peterson*
>
> IT Research Support Analyst
> *Information Technology*
>
> apeters...@sdsu.edu <mfar...@sdsu.edu>
>
> O: 619-594-3364
>
> *San Diego State University | **SDSU.edu <http://sdsu.edu/>*
>
> 5500 Campanile Drive | San Diego, CA 92182-8080
>
>
>
>
>
>
> --
>
> *Alison Peterson*
>
> IT Research Support Analyst
> *Information Technology*
>
> apeters...@sdsu.edu <mfar...@sdsu.edu>
>
> O: 619-594-3364
>
> *San Diego State University | **SDSU.edu <http://sdsu.edu/>*
>
> 5500 Campanile Drive | San Diego, CA 92182-8080
>
>
>


-- 
*Alison Peterson*
IT Research Support Analyst
*Information Technology*
apeters...@sdsu.edu <mfar...@sdsu.edu>
O: 619-594-3364
*San Diego State University | SDSU.edu <http://sdsu.edu/>*
5500 Campanile Drive | San Diego, CA 92182-8080

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: [EXT] RE: [EXT] RE: [EXT] RE: [EXT] RE: Nodes required for job are down, drained or reserved

Reply via email to