[slurm-users] Re: Print Slurm Stats on Login

2024-08-27 Thread Simon Andrews via slurm-users
Those pieces of information are available from squeue / sacct as long as you’re 
happy to have a wrapper which does the aggregation part for you.  The commands 
I parse for our stat summaries are:

scontrol show nodes

squeue -r -O jobid,username,minmemory,numcpus,nodelist

sacct -a -S [one_month_ago] -o 
jobid,jobname,alloccpus,cputime%15,reqmem,account,submit,elapsed,state

The only thing which I can’t find an easy way to get is the total requested 
memory for a job.  You’d think this would be simple with squeue minmemory – 
except that for some jobs that value is the value for the whole job, and for 
others it’s a value per-cpu, so if you want to know the total you have to 
multiply by the number of requested CPUs.  The only place I’ve managed to find 
that setting is from

scontrol show jobid -d [jobid]

Where you can examine the “MinMemoryCPU” value – however this is really slow if 
you’re doing that for thousands of jobs.  If anyone knows how to get this to 
show up correctly in squeue/sacct that would be super helpful.

Simon.


From: Davide DelVento 
Sent: 21 August 2024 00:14
To: Kevin Broch ; Simon Andrews 

Cc: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] Re: Print Slurm Stats on Login



CAUTION: This email originated outside of the Organisation. Please help to keep 
us safe and do not click links or open attachments unless you recognise the 
sender and know the content is safe.


Thanks Kevin and Simon,

The full thing that you do is indeed overkill, however I was able to learn how 
to collect/parse some of the information I need.

What I am still unable to get is:

- utilization by queue (or list of node names), to track actual use of 
expensive resources such as GPUs, high memory nodes, etc
- statistics about wait-in-queue for jobs, due to unavailable resources

hopefully both in a sreport-like format by user and by overall system

I suspect this information is available in sacct, but needs some 
massaging/consolidation to become useful for what I am looking for. Perhaps 
either (or both) of your scripts already do that in some place that I did not 
find? That would be terrific, and I'd appreciate it if you can point me to its 
place.

Thanks again!

On Tue, Aug 20, 2024 at 9:09 AM Kevin Broch via slurm-users 
mailto:slurm-users@lists.schedmd.com>> wrote:
Heavyweight solution (although if you have grafana and prometheus going already 
a little less so): https://github.com/rivosinc/prometheus-slurm-exporter

On Tue, Aug 20, 2024 at 12:40 AM Simon Andrews via slurm-users 
mailto:slurm-users@lists.schedmd.com>> wrote:
Possibly a bit more elaborate than you want but I wrote a web based monitoring 
system for our cluster.  It mostly uses standard slurm commands for job 
monitoring, but I've also added storage monitoring which requires a separate 
cron job to run every night.  It was written for our cluster, but probably 
wouldn't take much work to adapt to another cluster with similar structure.

You can see the code and some screenshots at:

 https://github.com/s-andrews/capstone_monitor

..and there's a video walk through at:

https://vimeo.com/982985174

We've also got more friendly scripts for monitoring current and past jobs on 
the command line.  These are in a private repository as some of the other 
information there is more sensitive but I'm happy to share those scripts.  You 
can see the scripts being used in 
https://vimeo.com/982986202

Simon.

-Original Message-
From: Paul Edmon via slurm-users 
mailto:slurm-users@lists.schedmd.com>>
Sent: 09 August 2024 16:12
To: slurm-users@lists.schedmd.com
Subject: [slurm-users] Print Slurm Stats on Login

We are working to make our users more aware of their usage. One of the ideas we 
came up with was to having some basic usage stats printed at login (usage over 
past day, fairshare, job efficiency, etc). Does anyone have any scripts or 
methods that they use to do this? Before baking my own I was curious what other 
sites do and if they would be willing to share their scripts and methodology.

-Paul Edmon-


--
slurm-users mailing list -- 
slurm-users@lists.schedmd.com To 
unsubscribe send an email to 
slurm-users-le...@lists.schedmd.com


This email has been scanned for spam & viruses. If you believe this email 
should have been stopped by our filters, click 

[slurm-users] Best practices for tracking jobs started across multiple clusters for accounting purposes.

2024-08-27 Thread Di Bernardini, Fabio via slurm-users
I need to account for jobs composed of multiple jobs launched on multiple 
federated (and non-federated) clusters, which therefore have different job IDs. 
What are the best practices to prevent users from bypassing this tracking?



NICE SRL, viale Monte Grappa 3/5, 20124 Milano, Italia, Registro delle Imprese 
di Milano Monza Brianza Lodi REA n. 2096882, Capitale Sociale: 10.329,14 EUR 
i.v., Cod. Fisc. e P.IVA 01133050052, Societa con Socio Unico



-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: REST API - get_user_environment

2024-08-27 Thread jpuerto--- via slurm-users
For those that are interested, I was able to resolve this by doing the 
following:

- Remove the "get_user_environment" attribute from the "jobs" attribute.
- Add an additional attribute to the "environment" attribute titled 
"SLURM_GET_USER_ENV" and setting that equal to 1.

Example before change:

{
 "job": {
  "get_user_environment": 1
 }
}

Example after change:
{
 "job": {
  "environment": {
   "SLURM_GET_USER_ENV": 1
  }
 }
}

Is anyone in contact with the development team? I feel that this is pretty 
basic functionality that was removed from the REST API without warning. 
Considering that this was a "patch" release (based on traditional semantic 
versioning guidelines), this type of modification shouldn't have happened and 
makes me worry about upgrading in the future.

Best regards,

Juan

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Print Slurm Stats on Login

2024-08-27 Thread Paul Edmon via slurm-users
This thread when a bunch of different directions. However I ran with 
Jeffrey's suggestion and wrote up a profile.d script along with other 
supporting scripts to pull the data. The setup I put together is here 
for the community to use as they see fit:


https://github.com/fasrc/puppet-slurm_stats

While this is written as a puppet module the scripts there in can be 
used by anyone as its a pretty straightforward set up and the templates 
have obvious places to do a find and replace.


Naturally I'm happy to take additional merge requests. Thanks for all 
the interesting conversation about this. Lots of great ideas.


-Paul Edmon-

On 8/9/24 12:04 PM, Jeffrey T Frey wrote:

You'd have to do this within e.g. the system's bashrc infrastructure.  The 
simplest idea would be to add to e.g. /etc/profile.d/zzz-slurmstats.sh and have 
some canned commands/scripts running.  That does introduce load to the system 
and Slurm on every login, though, and slows the startup of login shells based 
on how responsive slurmctld/slurmdbd are at that moment.

Another option would be to run the commands/scripts for all users on some timed 
schedule — e.g. produce per-user stats every 30 minutes.  So long as the stats 
are publicly-visible anyway, put those summaries in a shared file system with 
open read access.  Name the files by uid number.  Now your /etc/profile.d 
script just cat's ${STATS_DIR}/$(id -u).





On Aug 9, 2024, at 11:11, Paul Edmon via slurm-users 
 wrote:

We are working to make our users more aware of their usage. One of the ideas we 
came up with was to having some basic usage stats printed at login (usage over 
past day, fairshare, job efficiency, etc). Does anyone have any scripts or 
methods that they use to do this? Before baking my own I was curious what other 
sites do and if they would be willing to share their scripts and methodology.

-Paul Edmon-


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Slurm versions 24.05.3 and 23.11.10 are now available

2024-08-27 Thread Marshall Garey via slurm-users
We are pleased to announce the availability of Slurm versions 24.05.3 
and 23.11.10.


Version 24.05.3 fixes a potential database problem when deleting a qos. 
This bug only existed in 24.05.


Both versions have fixes for jobs potentially being stuck when using 
cloud nodes when some nodes are powered down, a regression in 23.11.9 
and 24.05.2 that caused sattach to crash, and some other minor issues.


Slurm can be downloaded from https://www.schedmd.com/downloads.php .

--
Marshall Garey
Release Management, Support, and Development
SchedMD LLC - Commercial Slurm Development and Support


* Changes in Slurm 24.05.3
==
 -- data_parser/v0.0.40 - Added field descriptions
 -- slurmrestd - Avoid creating new slurmdbd connection per request to
'* /slurm/slurmctld/*/*' endpoints.
 -- Fix compilation issue with switch/hpe_slingshot plugin.
 -- Fix gres per task allocation with threads-per-core.
 -- data_parser/v0.0.41 - Added field descriptions
 -- slurmrestd - Change back generated OpenAPI schema for
`DELETE /slurm/v0.0.40/jobs/` to RequestBody instead of using parameters
for request. slurmrestd will continue accept endpoint requests via
RequestBody or HTTP query.
 -- topology/tree - Fix issues with switch distance optimization.
 -- Fix potential segfault of secondary slurmctld when falling back to the
primary when running with a JobComp plugin.
 -- Enable --json/--yaml=v0.0.39 options on client commands to dump data using
data_parser/v0.0.39 instead or outputting nothing.
 -- switch/hpe_slingshot - Fix issue that could result in a 0 length state file.
 -- Fix unnecessary message protocol downgrade for unregistered nodes.
 -- Fix unnecessarily packing alias addrs when terminating jobs with a mix of
non-cloud/dynamic nodes and powered down cloud/dynamic nodes.
 -- accounting_storage/mysql - Fix issue when deleting a qos that could remove
too many commas from the qos and/or delta_qos fields of the assoc table.
 -- slurmctld - Fix memory leak when using RestrictedCoresPerGPU.
 -- Fix allowing access to reservations without MaxStartDelay set.
 -- Fix regression introduced in 24.05.0rc1 breaking srun --send-libs parsing.
 -- Fix slurmd vsize memory leak when using job submission/allocation commands
that implicitly or explicitly use --get-user-env.
 -- slurmd - Fix node going into invalid state when using CPUSpecList and
setting CPUs to the # of cores on a multithreaded node
 -- Fix reboot asap nodes being considered in backfill after a restart.
 -- Fix --clusters/-M queries for clusters outside of a federation when
fed_display is configured.
 -- Fix scontrol allowing updating job with bad cpus-per-task value.
 -- sattach - Fix regression from 24.05.2 security fix leading to crash.
 -- mpi/pmix - Fix assertion when built under --enable-debug.



* Changes in Slurm 23.11.10
===
 -- switch/hpe_slingshot - Fix issue that could result in a 0 length state file.
 -- Fix unnecessary message protocol downgrade for unregistered nodes.
 -- Fix unnecessarily packing alias addrs when terminating jobs with a mix of
non-cloud/dynamic nodes and powered down cloud/dynamic nodes.
 -- Fix allowing access to reservations without MaxStartDelay set.
 -- Fix scontrol allowing updating job with bad cpus-per-task value.
 -- sattach - Fix regression from 23.11.9 security fix leading to crash.


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: REST API - get_user_environment

2024-08-27 Thread Chris Samuel via slurm-users

On 15/8/24 10:55 am, jpuerto--- via slurm-users wrote:


Any ideas on whether there's a way to mirror this functionality in v0.0.40?


Sorry for not seeing this sooner, I don't I'm afraid!

All the best,
Chris

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: REST API - get_user_environment

2024-08-27 Thread Chris Samuel via slurm-users

On 22/8/24 11:18 am, jpuerto--- via slurm-users wrote:


Do you have a link to that code? Haven't had any luck finding that repo


It's here (on the 23.11 branch):

https://github.com/SchedMD/slurm/tree/slurm-23.11/src/slurmrestd/plugins/openapi/dbv0.0.38

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: REST API - get_user_environment

2024-08-27 Thread Chris Samuel via slurm-users

On 27/8/24 10:26 am, jpuerto--- via slurm-users wrote:


Is anyone in contact with the development team?


Folks with a support contract can submit bugs at 
https://support.schedmd.com/



I feel that this is pretty basic functionality that was removed from the REST API without 
warning. Considering that this was a "patch" release (based on traditional 
semantic versioning guidelines), this type of modification shouldn't have happened and 
makes me worry about upgrading in the future.


Slurm hasn't used semantic versioning for a long time, they moved to a 
year.month.minor version system a long time ago. The major releases are 
(now) every 6 months, so the most recent ones have been:


* 23.02.0
* 23.11.0 (old 9 month system)
* 24.05.0 (new 6 month system)

Next major release should be in November:

* 24.11.0

All the best,
Chris

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Spread a multistep job across clusters

2024-08-27 Thread Chris Samuel via slurm-users

On 26/8/24 8:40 am, Di Bernardini, Fabio via slurm-users wrote:

Hi everyone, for accounting reasons, I need to create only one job 
across two or more federated clusters with two or more srun steps.


The limitations for heterogenous jobs say:

https://slurm.schedmd.com/heterogeneous_jobs.html#limitations

> In a federation of clusters, a heterogeneous job will execute
> entirely on the cluster from which the job is submitted. The
> heterogeneous job will not be eligible to migrate between clusters
> or to have different components of the job execute on different
> clusters in the federation.

However, from your script it's not clear to me that's what you're 
meaning, because you include multiple --cluster options. I'm not sure if 
that works, as you mention the docs don't cover that case. They do say 
(however) that:


> If a heterogeneous job is submitted to run in multiple clusters not
> part of a federation (e.g. "sbatch --cluster=alpha,beta ...") then
> the entire job will be sent to the cluster expected to be able to
> start all components at the earliest time.

My gut instinct is that this isn't going to work, my feeling is that to 
launch a heterogenous job like this requires the slurmctld's on each 
cluster to coordinate and I'm not aware of that being possible currently.


All the best,
Chris

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com