[slurm-users] The issue in the distribution of job

Sundaram Kumaran via slurm-users Fri, 09 Aug 2024 05:11:30 -0700

Dear All,
May I have your suggestion in my issue facing,
While the job is launched using "salloc -N4--mem 4000 -p active"  I find the 
job is running in the one compute node and the other 3 machines are free, I 
don`t find the job is distributed evenly, May I have your suggestion,
I do squeue /scontrol to find the job distribution and it displays the 4 
machines but when I check on the respective machines I don`t find the job 
running only one machine takes the whole node,
Is there any issue in my conf file or what needs to be done, May I have your 
suggestion pls.


FYI, salloc -N4 --mem 4000 -p active
[cid:image001.png@01DAE8B4.215FBA50]
 While using TOP,
I find only Debussy is used heavily, I don`t find my job is evenly distributed, 
May I have your guidance pls.


[cid:image002.jpg@01DAEA98.2573DE10]




Regards,
KumaranS


This e-mail and any attachments are only for the use of the intended recipient 
and may contain material that is confidential, privileged and/or protected by 
the Official Secrets Act. If you are not the intended recipient, please delete 
it or notify the sender immediately. Please do not copy or use it for any 
purpose or disclose the contents to any other person.

#
# See the slurm.conf man page for more information.
#
# Legacy configuration
#ControlMachine=wagner
#ControlAddr=10.218.28.8
#BackupController=brahms
#BackupAddr=10.218.28.7

# New configuration
#SlurmctldHost=wagner
#ControlAddr=wagner:10.218.28.8
#SlurmctldHost=brahms
#ControlAddr=brahms:10.218.28.7
#SlurmctldHost=ravel
#ControlAddr=ravel:10.218.28.73
#SlurmctldHost=verdi
#ControlAddr=verdi:10.218.28.74

# New configuration
SlurmctldHost=wagner(10.218.28.8)
SlurmctldHost=brahms(10.218.28.7)
#SlurmctldHost=ravel(10.218.28.73)
#SlurmctldHost=verdi(10.218.28.74)
#SlurmctldHost=debussy(10.218.28.208)
#SlurmctldHost=schubert(10.218.28.207)
#SlurmctldHost=vivaldi(10.218.28.205)


AuthType=auth/munge
#CheckpointType=checkpoint/none
CryptoType=crypto/munge
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=999999
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobCheckpointDir=/var/slurm/checkpoint
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=
#KillOnBadExit=0
#LaunchType=launch/slurm
#Licenses=foo*4,bar
#MailProg=/bin/true
MaxJobCount=10000
MaxStepCount=40000
MaxTasksPerNode=512  # Maximum tasks per node (this is a count, not a memory 
unit)
#MaxTasksPerNode=128  # Maximum tasks per node (this is a count, not a memory 
unit)
#MpiDefault=pmix
MpiDefault=pmi2
#MpiParams=ports=#-#
PluginDir=/usr/local/lib:/usr/local/lib/slurm:/usr/lib:/lib
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/cgroup
#Prolog=
PrologFlags=x11
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
PropagateResourceLimitsExcept=MEMLOCK
#RebootProgram=
ReturnToService=1
#SallocDefaultCommand=
SlurmctldPidFile=/var/run/slurm/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurm/d
SlurmUser=slurm
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/var/spool/slurm/ctld
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/none
#TaskPluginParam=
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFS=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
# Resource Limits and Defaults
DefMemPerCPU=2048   # 2048 MB = 2 GB per CPU
MaxMemPerCPU=8192   # 8192 MB = 8 GB per CPU
#DefMemPerCPU=0
FastSchedule=1
#MaxMemPerCPU=0
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
#SelectType=select/linear
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory
#OverSubscribe=FORCE:5
LaunchParameters=use_interactive_step
#
#
# JOB PRIORITY
#PriorityFlags=
PriorityType=priority/multifactor
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
# This next group determines the weighting of each of the
# components of the Multifactor Job Priority Plugin.
# The default value for each of the following is 1.
PriorityWeightAge=1000
PriorityWeightFairshare=10000
PriorityWeightJobSize=1000
PriorityWeightPartition=1000
PriorityWeightQOS=0 # don't use the qos factor
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageHost=
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
#AccountingStorageUser=
#AccountingStoreJobComment=YES
ClusterName=cluster
DebugFlags=NO_CONF_HASH
#JobCompHost=
#JobCompLoc=
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/none
#JobCompUser=
#JobContainerType=job_container/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm/slurmctld_wagner.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurm/slurmd_wagner.log
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=wagner NodeAddr=10.218.28.8 CPUs=192 Boards=1 SocketsPerBoard=2 
CoresPerSocket=48 ThreadsPerCore=2 RealMemory=772966 state=UNKNOWN 
Features=MultiThreading
NodeName=brahms NodeAddr=10.218.28.7 CPUs=128 Boards=1 SocketsPerBoard=2 
CoresPerSocket=32 ThreadsPerCore=2 RealMemory=514943 State=UNKNOWN 
Features=High_Performance
NodeName=ravel  NodeAddr=10.218.28.73 CPUs=72 Boards=1 SocketsPerBoard=2 
CoresPerSocket=18 ThreadsPerCore=2 RealMemory=515224 State=UNKNOWN 
Features=MemoryOptimized1
NodeName=verdi  NodeAddr=10.218.28.74 CPUs=72 Boards=1 SocketsPerBoard=2 
CoresPerSocket=18 ThreadsPerCore=2 RealMemory=515224 State=UNKNOWN 
Features=MemoryOptimized2
NodeName=vivaldi  NodeAddr=10.218.28.205 CPUs=72 Boards=1 SocketsPerBoard=2 
CoresPerSocket=18 ThreadsPerCore=2 RealMemory=1030436 State=UNKNOWN 
Features=MemoryOptimized3
NodeName=debussy  NodeAddr=10.218.28.208 CPUs=72 Boards=1 SocketsPerBoard=2 
CoresPerSocket=18 ThreadsPerCore=2 RealMemory=1030436 State=UNKNOWN 
Features=MemoryOptimized4
NodeName=schubert NodeAddr=10.218.28.207 CPUs=72 Boards=1 SocketsPerBoard=2 
CoresPerSocket=18 ThreadsPerCore=2 RealMemory=1030436 State=UNKNOWN 
Features=MemoryOptimized5

PartitionName=active Nodes="ALL" Default=YES MaxTime=INFINITE State=UP 
Shared=FORCE
PreemptMode=Suspend,Gang
PreemptType=preempt/partition_prio

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] The issue in the distribution of job

Reply via email to