* Christopher Samuel <ch...@csamuel.org> [210514 15:47]: > > Usage reported in Percentage of Total > > -------------------------------------------------------------------------------- > > > > Cluster TRES Name Allocated Down PLND Dow Idle > > Reserved Reported > > --------- -------------- ------------ -------- -------- ----------- > > -------- ------------ > > oph cpu 81.93% 0.00% 0.00% 15.85% > > 2.22% 100.00% > > oph mem 80.60% 0.00% 0.00% 19.40% > > 0.00% 100.00% > > The "Reserved" column is the one you're interested in, it's indicating that > for the 13th some jobs were waiting for CPUs, not memory.
Hi Chris, the wording in the documentation is somewhat nebulous, but my understanding is that the "Reserved" column in sreport indicates the amount of resources that were actually idle but reserved by Slurm for scheduling purposes and, thus, unavailable for immediate job allocations. I assume this includes, for example, the time the scheduler needs to free sufficient resources for the highest priority job that is waiting for the number of requested nodes to become available. I think, there might be more reasons for Slurm to mark resources reserved (but not including resource reservations created with scontrol as these are reported as "Allocated" resources by sreport unless created with MAINT or IGNORE_JOBS flags). Anyway, as far as I understand the documentation, the sreport "Reserved" column by itself does not necessarily indicate the degree of (over-)utilitzation of the cluster as it does not take into account the amount of jobs in the queue for which Slurm has not yet started blocking idle resources. So, confusingly, there is a difference between "Reserved" in sreport and sacct. In sreport "Reserved" refers to idle but reserved cluster resources whereas in sacct "Reserved" means the waiting time of jobs. Or do I understand this wrong? However, there is also "Overcommited" in the sreport man page which looks promising by description - although its exact definition is also not completely clear to me right away: --- snip --- Overcommited Time of eligible jobs waiting in the queue over the Reserved time. Unlike Reserved, this has no limit. It is typically useful to determine whether your system is overloaded and by how much. --- snip --- This field is not included by default in the report but can be added with the Format option, e.g. sreport -t percent -T ALL cluster utilization Format=TRESName,Allocated,PlannedDown,Down,Idle,Reserved,Overcommitted,Reported (Note: There seems to be a typo in the scontrol man page. It should read "Overcommitted" rather than "Overcommited".) Best regards Jürgen