Try setting RawShares to something greater than 1. I've seen it be the
case then when you set 1 it creates weirdness like this.
-Paul Edmon-
On 7/9/2020 1:12 PM, Dumont, Joey wrote:
Hi,
We recently set up fair tree scheduling (we have 19.05 running), and
are trying to use sshare to see usage information. Unfortunately,
sshare reports all zeros, even though there seems to be data in the
backend DB. Here's an example output:
$ sshare -l
Account User RawShares NormShares RawUsage
NormUsage EffectvUsage FairShare LevelFS
GrpTRESMins TRESRunMins
-------------------- ---------- ---------- ----------- -----------
----------- ------------- ---------- ----------
------------------------------ ------------------------------
root 0
0.000000 0.000000
cpu=0,mem=0,energy=0,node=0,b+
covid 1 0
0.000000 0.000000
cpu=0,mem=0,energy=0,node=0,b+
covid-01 1 0 0.000000
0.000000 cpu=0,mem=0,energy=0,node=0,b+
covid-02 1 0 0.000000
0.000000 cpu=0,mem=0,energy=0,node=0,b+
group1 1 0
0.000000 0.000000
cpu=0,mem=0,energy=0,node=0,b+
subgroup1 1 0 0.000000
0.000000 cpu=0,mem=0,energy=0,node=0,b+
othersubgroups 1 0 0.000000
0.000000 cpu=0,mem=0,energy=0,node=0,b+
othersubgroups 1 0 0.000000
0.000000 cpu=0,mem=0,energy=0,node=0,b+
othersubgroups 1 0 0.000000
0.000000 cpu=0,mem=0,energy=0,node=0,b+
othersubgroups 1 0 0.000000
0.000000 cpu=0,mem=0,energy=0,node=0,b+
othersubgroups 1 0 0.000000
0.000000 cpu=0,mem=0,energy=0,node=0,b+
othersubgroups 1 0 0.000000
0.000000 cpu=0,mem=0,energy=0,node=0,b+
othersubgroups 1 0 0.000000
0.000000 cpu=0,mem=0,energy=0,node=0,b+
othersubgroups 1 0 0.000000
0.000000 cpu=0,mem=0,energy=0,node=0,b+
othersubgroups 1 0 0.000000
0.000000 cpu=0,mem=0,energy=0,node=0,b+
othersubgroups 1 0 0.000000
0.000000 cpu=0,mem=0,energy=0,node=0,b+
subgroups 1 0 0.000000
0.000000 cpu=0,mem=0,energy=0,node=0,b+
subgroups 4 0 0.000000
0.000000 cpu=0,mem=0,energy=0,node=0,b+
subgroups 1 0 0.000000
0.000000 cpu=0,mem=0,energy=0,node=0,b+
SUBGROUP 1 0
0.000000 0.000000
cpu=0,mem=0,energy=0,node=0,b+
SUBGROUP 1 0
0.000000 0.000000
cpu=0,mem=0,energy=0,node=0,b+
And the slurm.conf config:
ClusterName=trixie
SlurmctldHost=trixie(10.10.0.11)
SlurmctldHost=hn2(10.10.0.12)
GresTypes=gpu
SlurmUser=slurm
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
StateSaveLocation=/gpfs/share/slurm/
SlurmdSpoolDir=/var/spool/slurm/d
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
ProctrackType=proctrack/cgroup
ReturnToService=2
PrologFlags=x11
TaskPlugin=task/cgroup
# TIMERS
SlurmctldTimeout=60
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
#
# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
FastSchedule=1
SchedulerParameters=bf_interval=60,bf_continue,bf_resolution=600,bf_window=2880,bf_max_job_test=5000,bf_max_job_part=1000,bf_max_job_user=10,bf_max_job_start=100
PriorityType=priority/multifactor
PriorityDecayHalfLife=14-0
PriorityWeightFairshare=100000
PriorityWeightAge=1000
PriorityWeightPartition=10000
PriorityWeightJobSize=1000
PriorityMaxAge=1-0
# LOGGING
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurmd.log
JobCompType=jobcomp/none
# ACCOUNTING
JobAcctGatherType=jobacct_gather/linux
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=hn2
AccountingStorageTRES=gres/gpu
# COMPUTE NODES
NodeName=cn[101-136] Procs=32 Gres=gpu:4 RealMemory=192782
# Partitions
PartitionName=JobTesting Nodes=cn[135-136] MaxTime=02:00:00
DefaultTime=00:30:00 MaxMemPerNode=192782
AllowGroups=DT-AI4DCluster-All State=UP
PartitionName=TrixieMain Nodes=cn[106-134] MaxTime=48:00:00
DefaultTime=08:00:00 MaxMemPerNode=192782
AllowGroups=DT-AI4DCluster-All State=UP Default=YES
PartitionName=ItOpsTests Nodes=cn[102-105] MaxTime=INFINITE
MaxMemPerNode=192782 AllowGroups=Admin-Access,Manager-Access State=UP
PartitionName=ItOpsImage Nodes=cn101 MaxTime=INFINITE
MaxMemPerNode=192782 AllowGroups=Admin-Access State=UP
Anything that would explain sshare returns only zeros?
The only particularity I can think of is that I don't think we
reloaded slurmctld, but just reconfigured.
Cheers,
Joey Dumont
Technical Advisor, Knowledge, Information, and Technology Services
National Research Council Canada / Governement of Canada
joey.dum...@nrc-cnrc.gc.ca <mailto:joey.dum...@nrc-cnrc.gc.ca> / Tel:
613-990-8152 / Cell: 438-340-7436
Conseiller technique, Services du savoir, de l'information et de la
technologie
Conseil national de recherches Canada / Gouvernement du Canada
joey.dum...@nrc-cnrc.gc.ca <mailto:joey.dum...@nrc-cnrc.gc.ca> / Tél.:
613-990-8152 / Tél. cell.: 438-340-7436