Dear All, May I have your suggestion in my issue facing, While the job is launched using "salloc -N4--mem 4000 -p active" I find the job is running in the one compute node and the other 3 machines are free, I don`t find the job is distributed evenly, May I have your suggestion, I do squeue /scontrol to find the job distribution and it displays the 4 machines but when I check on the respective machines I don`t find the job running only one machine takes the whole node, Is there any issue in my conf file or what needs to be done, May I have your suggestion pls.
FYI, salloc -N4 --mem 4000 -p active [cid:image001.png@01DAE8B4.215FBA50] While using TOP, I find only Debussy is used heavily, I don`t find my job is evenly distributed, May I have your guidance pls. [cid:image002.jpg@01DAEA98.2573DE10] Regards, KumaranS This e-mail and any attachments are only for the use of the intended recipient and may contain material that is confidential, privileged and/or protected by the Official Secrets Act. If you are not the intended recipient, please delete it or notify the sender immediately. Please do not copy or use it for any purpose or disclose the contents to any other person.
# # See the slurm.conf man page for more information. # # Legacy configuration #ControlMachine=wagner #ControlAddr=10.218.28.8 #BackupController=brahms #BackupAddr=10.218.28.7 # New configuration #SlurmctldHost=wagner #ControlAddr=wagner:10.218.28.8 #SlurmctldHost=brahms #ControlAddr=brahms:10.218.28.7 #SlurmctldHost=ravel #ControlAddr=ravel:10.218.28.73 #SlurmctldHost=verdi #ControlAddr=verdi:10.218.28.74 # New configuration SlurmctldHost=wagner(10.218.28.8) SlurmctldHost=brahms(10.218.28.7) #SlurmctldHost=ravel(10.218.28.73) #SlurmctldHost=verdi(10.218.28.74) #SlurmctldHost=debussy(10.218.28.208) #SlurmctldHost=schubert(10.218.28.207) #SlurmctldHost=vivaldi(10.218.28.205) AuthType=auth/munge #CheckpointType=checkpoint/none CryptoType=crypto/munge #DisableRootJobs=NO #EnforcePartLimits=NO #Epilog= #EpilogSlurmctld= #FirstJobId=1 #MaxJobId=999999 #GresTypes= #GroupUpdateForce=0 #GroupUpdateTime=600 #JobCheckpointDir=/var/slurm/checkpoint #JobCredentialPrivateKey= #JobCredentialPublicCertificate= #JobFileAppend=0 #JobRequeue=1 #JobSubmitPlugins= #KillOnBadExit=0 #LaunchType=launch/slurm #Licenses=foo*4,bar #MailProg=/bin/true MaxJobCount=10000 MaxStepCount=40000 MaxTasksPerNode=512 # Maximum tasks per node (this is a count, not a memory unit) #MaxTasksPerNode=128 # Maximum tasks per node (this is a count, not a memory unit) #MpiDefault=pmix MpiDefault=pmi2 #MpiParams=ports=#-# PluginDir=/usr/local/lib:/usr/local/lib/slurm:/usr/lib:/lib #PlugStackConfig= #PrivateData=jobs ProctrackType=proctrack/cgroup #Prolog= PrologFlags=x11 #PrologSlurmctld= #PropagatePrioProcess=0 #PropagateResourceLimits= PropagateResourceLimitsExcept=MEMLOCK #RebootProgram= ReturnToService=1 #SallocDefaultCommand= SlurmctldPidFile=/var/run/slurm/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurm/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurm/d SlurmUser=slurm #SlurmdUser=root #SrunEpilog= #SrunProlog= StateSaveLocation=/var/spool/slurm/ctld SwitchType=switch/none #TaskEpilog= TaskPlugin=task/none #TaskPluginParam= #TaskProlog= #TopologyPlugin=topology/tree #TmpFS=/tmp #TrackWCKey=no #TreeWidth= #UnkillableStepProgram= #UsePAM=0 # # # TIMERS #BatchStartTimeout=10 #CompleteWait=0 #EpilogMsgTime=2000 #GetEnvTimeout=2 #HealthCheckInterval=0 #HealthCheckProgram= InactiveLimit=0 KillWait=30 #MessageTimeout=10 #ResvOverRun=0 MinJobAge=300 #OverTimeLimit=0 SlurmctldTimeout=120 SlurmdTimeout=300 #UnkillableStepTimeout=60 #VSizeFactor=0 Waittime=0 # # # SCHEDULING # Resource Limits and Defaults DefMemPerCPU=2048 # 2048 MB = 2 GB per CPU MaxMemPerCPU=8192 # 8192 MB = 8 GB per CPU #DefMemPerCPU=0 FastSchedule=1 #MaxMemPerCPU=0 #SchedulerTimeSlice=30 SchedulerType=sched/backfill #SelectType=select/linear SelectType=select/cons_tres SelectTypeParameters=CR_Core_Memory #OverSubscribe=FORCE:5 LaunchParameters=use_interactive_step # # # JOB PRIORITY #PriorityFlags= PriorityType=priority/multifactor #PriorityDecayHalfLife= #PriorityCalcPeriod= #PriorityFavorSmall= #PriorityMaxAge= #PriorityUsageResetPeriod= # This next group determines the weighting of each of the # components of the Multifactor Job Priority Plugin. # The default value for each of the following is 1. PriorityWeightAge=1000 PriorityWeightFairshare=10000 PriorityWeightJobSize=1000 PriorityWeightPartition=1000 PriorityWeightQOS=0 # don't use the qos factor # # # LOGGING AND ACCOUNTING #AccountingStorageEnforce=0 #AccountingStorageHost= #AccountingStorageLoc= #AccountingStoragePass= #AccountingStoragePort= AccountingStorageType=accounting_storage/none #AccountingStorageUser= #AccountingStoreJobComment=YES ClusterName=cluster DebugFlags=NO_CONF_HASH #JobCompHost= #JobCompLoc= #JobCompPass= #JobCompPort= JobCompType=jobcomp/none #JobCompUser= #JobContainerType=job_container/none JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/none SlurmctldDebug=3 SlurmctldLogFile=/var/log/slurm/slurmctld_wagner.log SlurmdDebug=3 SlurmdLogFile=/var/log/slurm/slurmd_wagner.log #SlurmSchedLogFile= #SlurmSchedLogLevel= # # # POWER SAVE SUPPORT FOR IDLE NODES (optional) #SuspendProgram= #ResumeProgram= #SuspendTimeout= #ResumeTimeout= #ResumeRate= #SuspendExcNodes= #SuspendExcParts= #SuspendRate= #SuspendTime= # # # COMPUTE NODES NodeName=wagner NodeAddr=10.218.28.8 CPUs=192 Boards=1 SocketsPerBoard=2 CoresPerSocket=48 ThreadsPerCore=2 RealMemory=772966 state=UNKNOWN Features=MultiThreading NodeName=brahms NodeAddr=10.218.28.7 CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=32 ThreadsPerCore=2 RealMemory=514943 State=UNKNOWN Features=High_Performance NodeName=ravel NodeAddr=10.218.28.73 CPUs=72 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 ThreadsPerCore=2 RealMemory=515224 State=UNKNOWN Features=MemoryOptimized1 NodeName=verdi NodeAddr=10.218.28.74 CPUs=72 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 ThreadsPerCore=2 RealMemory=515224 State=UNKNOWN Features=MemoryOptimized2 NodeName=vivaldi NodeAddr=10.218.28.205 CPUs=72 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 ThreadsPerCore=2 RealMemory=1030436 State=UNKNOWN Features=MemoryOptimized3 NodeName=debussy NodeAddr=10.218.28.208 CPUs=72 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 ThreadsPerCore=2 RealMemory=1030436 State=UNKNOWN Features=MemoryOptimized4 NodeName=schubert NodeAddr=10.218.28.207 CPUs=72 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 ThreadsPerCore=2 RealMemory=1030436 State=UNKNOWN Features=MemoryOptimized5 PartitionName=active Nodes="ALL" Default=YES MaxTime=INFINITE State=UP Shared=FORCE PreemptMode=Suspend,Gang PreemptType=preempt/partition_prio
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com