[slurm-users] Cloud elastic help

mark.w.moorcroft--- via slurm-users Wed, 29 Jan 2025 11:20:59 -0800

I have a new Slurm setup in AWS gov cloud that is not quite working. I will 
list a few factoids and maybe someone can suggest where to look next. The 
Troubleshooting page really has nothing relevant for elastic cloud deployments. 
The nodes are getting set to DOWN+CLOUD+POWERED_DOWN. Running a job does not 
launch a node in this state. I can force the nodes to launch with scontrol 
POWER_UP. The jobs will claim to run, then re-queue, but never complete. When 
the nodes boot I see them appear in slurmctl, but it soon reports connection 
lost. The node slurmd claims to be healthy, but the controller eventually just 
terminates them. You can ping in both directions with the hostnames. I've done 
4 clusters. The first was torque/maui and the rest were slurm, but all were 
bare metal. This is my first attempt at cloud. We have ITAR data, so I can't 
use Amazon Parallel Computing because it is not offered in GovCloud.
https://cluster-in-the-cloud.readthedocs.io/en/latest/running.html
I had to fork this project because so much is obsolete, but it's mostly working 
for me now.
https://github.com/mntbighker


[root@mgmt ~]# journalctl -fu slurmctld
Jan 29 19:16:15 mgmt.many-antelope.citc.local slurmctld[3403]: slurmctld: 
debug:  sched/backfill: _attempt_backfill: 1 jobs to backfill
Jan 29 19:16:45 mgmt.many-antelope.citc.local slurmctld[3403]: slurmctld: 
debug:  Updating partition uid access list
Jan 29 19:16:45 mgmt.many-antelope.citc.local slurmctld[3403]: slurmctld: 
debug:  purge_old_job: job file deletion is falling behind, 1 left to remove
Jan 29 19:16:45 mgmt.many-antelope.citc.local slurmctld[3403]: slurmctld: 
debug:  sched: Running job scheduler for full queue.
Jan 29 19:16:45 mgmt.many-antelope.citc.local slurmctld[3403]: slurmctld: 
debug:  sched/backfill: _attempt_backfill: beginning
Jan 29 19:16:45 mgmt.many-antelope.citc.local slurmctld[3403]: slurmctld: 
debug:  sched/backfill: _attempt_backfill: 1 jobs to backfill
Jan 29 19:16:50 mgmt.many-antelope.citc.local slurmctld[3403]: slurmctld: 
debug:  sackd_mgr_dump_state: saved state of 0 nodes
Jan 29 19:17:15 mgmt.many-antelope.citc.local slurmctld[3403]: slurmctld: 
debug:  sched/backfill: _attempt_backfill: beginning
Jan 29 19:17:15 mgmt.many-antelope.citc.local slurmctld[3403]: slurmctld: 
debug:  sched/backfill: _attempt_backfill: 1 jobs to backfill
Jan 29 19:17:26 mgmt.many-antelope.citc.local slurmctld[3403]: slurmctld: 
POWER: Power save mode: 4 nodes

[root@mgmt ~]# scontrol show node many-antelope-c5n-2xlarge-0001
NodeName=many-antelope-c5n-2xlarge-0001 CoresPerSocket=4
   CPUAlloc=0 CPUEfctv=8 CPUTot=8 CPULoad=0.00
   AvailableFeatures=shape=c5n.2xlarge,ad=None,arch=x86_64
   ActiveFeatures=shape=c5n.2xlarge,ad=None,arch=x86_64
   Gres=(null)
   NodeAddr=many-antelope-c5n-2xlarge-0001 
NodeHostName=many-antelope-c5n-2xlarge-0001
   RealMemory=20034 AllocMem=0 FreeMem=N/A Sockets=1 Boards=1
   State=DOWN+CLOUD+POWERED_DOWN ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A 
MCS_label=N/A
   Partitions=production,debug,batch,long
   BootTime=None SlurmdStartTime=None
   LastBusyTime=Unknown ResumeAfterTime=None
   CfgTRES=cpu=8,mem=20034M,billing=8
   AllocTRES=
   CurrentWatts=0 AveWatts=0

   Reason=Not responding [slurm@2025-01-29T17:49:26]

[root@mgmt ~]# scontrol show jobs
JobId=30 JobName=test.sl
   UserId=mwmoorcroft(1106) GroupId=nssam(1101) MCS_label=N/A
   Priority=1 Nice=0 Account=nssam QOS=(null)
   JobState=PENDING 
Reason=Nodes_required_for_job_are_DOWN,_DRAINED_or_reserved_for_jobs_in_higher_priority_partitions
 Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=01:00:00 TimeMin=N/A
   SubmitTime=2025-01-29T18:58:26 EligibleTime=2025-01-29T18:58:26
   AccrueTime=2025-01-29T18:58:26
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-01-29T19:19:15 
Scheduler=Backfill:*
   Partition=production 
AllocNode:Sid=ip-172-16-2-14.us-gov-east-1.compute.internal:7090
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=
   NumNodes=1-1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   ReqTRES=cpu=1,mem=20034M,node=1,billing=1
   AllocTRES=(null)
   Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/mnt/shared/home/mwmoorcroft/test.sl
   WorkDir=/mnt/shared/home/mwmoorcroft
   StdErr=/mnt/shared/home/mwmoorcroft/slurm-30.out
   StdIn=/dev/null
   StdOut=/mnt/shared/home/mwmoorcroft/slurm-30.out

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Cloud elastic help

Reply via email to