[slurm-users] Re: Print Slurm Stats on Login

2024-08-20 Thread Simon Andrews via slurm-users
Possibly a bit more elaborate than you want but I wrote a web based monitoring 
system for our cluster.  It mostly uses standard slurm commands for job 
monitoring, but I've also added storage monitoring which requires a separate 
cron job to run every night.  It was written for our cluster, but probably 
wouldn't take much work to adapt to another cluster with similar structure.

You can see the code and some screenshots at:

 https://github.com/s-andrews/capstone_monitor

..and there's a video walk through at:

https://vimeo.com/982985174

We've also got more friendly scripts for monitoring current and past jobs on 
the command line.  These are in a private repository as some of the other 
information there is more sensitive but I'm happy to share those scripts.  You 
can see the scripts being used in https://vimeo.com/982986202 

Simon.

-Original Message-
From: Paul Edmon via slurm-users  
Sent: 09 August 2024 16:12
To: slurm-users@lists.schedmd.com
Subject: [slurm-users] Print Slurm Stats on Login

We are working to make our users more aware of their usage. One of the ideas we 
came up with was to having some basic usage stats printed at login (usage over 
past day, fairshare, job efficiency, etc). Does anyone have any scripts or 
methods that they use to do this? Before baking my own I was curious what other 
sites do and if they would be willing to share their scripts and methodology.

-Paul Edmon-


--
slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send 
an email to slurm-users-le...@lists.schedmd.com


This email has been scanned for spam & viruses. If you believe this email 
should have been stopped by our filters, click the following link to report it 
(https://portal-uk.mailanyone.net/index.html#/outer/reportspam?token=dXNlcj1zaW1vbi5hbmRyZXdzQGJhYnJhaGFtLmFjLnVrO3RzPTE3MjMyMTY5MzA7dXVpZD02NkI2MzQyMTY5MzU2Q0YwRThDQzI5RTY4MkMxOEY5Mjt0b2tlbj01MjI1ZmJmYzJjODgzNWM3ZDE2ZGRiOTE2ZjIxYzk4MjliMjY2MjA0Ow%3D%3D).

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Print Slurm Stats on Login

2024-08-20 Thread Kevin Broch via slurm-users
Heavyweight solution (although if you have grafana and prometheus going
already a little less so):
https://github.com/rivosinc/prometheus-slurm-exporter

On Tue, Aug 20, 2024 at 12:40 AM Simon Andrews via slurm-users <
slurm-users@lists.schedmd.com> wrote:

> Possibly a bit more elaborate than you want but I wrote a web based
> monitoring system for our cluster.  It mostly uses standard slurm commands
> for job monitoring, but I've also added storage monitoring which requires a
> separate cron job to run every night.  It was written for our cluster, but
> probably wouldn't take much work to adapt to another cluster with similar
> structure.
>
> You can see the code and some screenshots at:
>
>  https://github.com/s-andrews/capstone_monitor
>
> ..and there's a video walk through at:
>
> https://vimeo.com/982985174
>
> We've also got more friendly scripts for monitoring current and past jobs
> on the command line.  These are in a private repository as some of the
> other information there is more sensitive but I'm happy to share those
> scripts.  You can see the scripts being used in
> https://vimeo.com/982986202
>
> Simon.
>
> -Original Message-
> From: Paul Edmon via slurm-users 
> Sent: 09 August 2024 16:12
> To: slurm-users@lists.schedmd.com
> Subject: [slurm-users] Print Slurm Stats on Login
>
> We are working to make our users more aware of their usage. One of the
> ideas we came up with was to having some basic usage stats printed at login
> (usage over past day, fairshare, job efficiency, etc). Does anyone have any
> scripts or methods that they use to do this? Before baking my own I was
> curious what other sites do and if they would be willing to share their
> scripts and methodology.
>
> -Paul Edmon-
>
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe
> send an email to slurm-users-le...@lists.schedmd.com
>
> 
> This email has been scanned for spam & viruses. If you believe this email
> should have been stopped by our filters, click the following link to report
> it (
> https://portal-uk.mailanyone.net/index.html#/outer/reportspam?token=dXNlcj1zaW1vbi5hbmRyZXdzQGJhYnJhaGFtLmFjLnVrO3RzPTE3MjMyMTY5MzA7dXVpZD02NkI2MzQyMTY5MzU2Q0YwRThDQzI5RTY4MkMxOEY5Mjt0b2tlbj01MjI1ZmJmYzJjODgzNWM3ZDE2ZGRiOTE2ZjIxYzk4MjliMjY2MjA0Ow%3D%3D
> ).
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Slurm hanging behavior

2024-08-20 Thread Richard Yang via slurm-users
Hello Slurm community,

We are using slurm as the system to deploy training jobs on a large gpu 
cluster, but encounter a strange behavior. As new comers, we wonder if this is 
a known behavior. Below is some more info:

  *   We are running a relatively older version 22.0.5
  *   At relatively higher load, we encountered hanging. It is particularly 
puzzling in the following sense: assume we have nodelist1 with 6 hosts and 
nodelist2 with 7 hosts. We run simple ‘hostname’. Deploying on nodelist1 alone 
or nodelusr2 alone will be fine, but with all 13 hosts, the debug messages show 
that the execution hang after showing that the last task done. It then hangs 
for exactly 180 seconds.

Does anyone know the potential issue? We sure be happy to post more config 
details or debug messages.

Thank you so much!
Richard

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Print Slurm Stats on Login

2024-08-20 Thread Davide DelVento via slurm-users
Thanks Kevin and Simon,

The full thing that you do is indeed overkill, however I was able to learn
how to collect/parse some of the information I need.

What I am still unable to get is:

- utilization by queue (or list of node names), to track actual use of
expensive resources such as GPUs, high memory nodes, etc
- statistics about wait-in-queue for jobs, due to unavailable resources

hopefully both in a sreport-like format by user and by overall system

I suspect this information is available in sacct, but needs some
massaging/consolidation to become useful for what I am looking for. Perhaps
either (or both) of your scripts already do that in some place that I did
not find? That would be terrific, and I'd appreciate it if you can point me
to its place.

Thanks again!

On Tue, Aug 20, 2024 at 9:09 AM Kevin Broch via slurm-users <
slurm-users@lists.schedmd.com> wrote:

> Heavyweight solution (although if you have grafana and prometheus going
> already a little less so):
> https://github.com/rivosinc/prometheus-slurm-exporter
>
> On Tue, Aug 20, 2024 at 12:40 AM Simon Andrews via slurm-users <
> slurm-users@lists.schedmd.com> wrote:
>
>> Possibly a bit more elaborate than you want but I wrote a web based
>> monitoring system for our cluster.  It mostly uses standard slurm commands
>> for job monitoring, but I've also added storage monitoring which requires a
>> separate cron job to run every night.  It was written for our cluster, but
>> probably wouldn't take much work to adapt to another cluster with similar
>> structure.
>>
>> You can see the code and some screenshots at:
>>
>>  https://github.com/s-andrews/capstone_monitor
>>
>> ..and there's a video walk through at:
>>
>> https://vimeo.com/982985174
>>
>> We've also got more friendly scripts for monitoring current and past jobs
>> on the command line.  These are in a private repository as some of the
>> other information there is more sensitive but I'm happy to share those
>> scripts.  You can see the scripts being used in
>> https://vimeo.com/982986202
>>
>> Simon.
>>
>> -Original Message-
>> From: Paul Edmon via slurm-users 
>> Sent: 09 August 2024 16:12
>> To: slurm-users@lists.schedmd.com
>> Subject: [slurm-users] Print Slurm Stats on Login
>>
>> We are working to make our users more aware of their usage. One of the
>> ideas we came up with was to having some basic usage stats printed at login
>> (usage over past day, fairshare, job efficiency, etc). Does anyone have any
>> scripts or methods that they use to do this? Before baking my own I was
>> curious what other sites do and if they would be willing to share their
>> scripts and methodology.
>>
>> -Paul Edmon-
>>
>>
>> --
>> slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe
>> send an email to slurm-users-le...@lists.schedmd.com
>>
>> 
>> This email has been scanned for spam & viruses. If you believe this email
>> should have been stopped by our filters, click the following link to report
>> it (
>> https://portal-uk.mailanyone.net/index.html#/outer/reportspam?token=dXNlcj1zaW1vbi5hbmRyZXdzQGJhYnJhaGFtLmFjLnVrO3RzPTE3MjMyMTY5MzA7dXVpZD02NkI2MzQyMTY5MzU2Q0YwRThDQzI5RTY4MkMxOEY5Mjt0b2tlbj01MjI1ZmJmYzJjODgzNWM3ZDE2ZGRiOTE2ZjIxYzk4MjliMjY2MjA0Ow%3D%3D
>> ).
>>
>> --
>> slurm-users mailing list -- slurm-users@lists.schedmd.com
>> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>>
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Print Slurm Stats on Login

2024-08-20 Thread Ole Holm Nielsen via slurm-users

Hi Davide,

Did you already check out what the slurmacct script can do for you?  See 
https://github.com/OleHolmNielsen/Slurm_tools/blob/master/slurmacct/slurmacct


What you're asking for seems like a pretty heavy task regarding system 
resources and Slurm database requests.  You don't imagine this to run 
every time a user makes a login shell?  Some users might run "bash -l" 
inside jobs to emulate a login session, causing a heavy load on your servers.


/Ole

On 8/21/24 01:13, Davide DelVento via slurm-users wrote:

Thanks Kevin and Simon,

The full thing that you do is indeed overkill, however I was able to learn 
how to collect/parse some of the information I need.


What I am still unable to get is:

- utilization by queue (or list of node names), to track actual use of 
expensive resources such as GPUs, high memory nodes, etc

- statistics about wait-in-queue for jobs, due to unavailable resources

hopefully both in a sreport-like format by user and by overall system

I suspect this information is available in sacct, but needs some 
massaging/consolidation to become useful for what I am looking for. 
Perhaps either (or both) of your scripts already do that in some place 
that I did not find? That would be terrific, and I'd appreciate it if you 
can point me to its place.


Thanks again!

On Tue, Aug 20, 2024 at 9:09 AM Kevin Broch via slurm-users 
mailto:slurm-users@lists.schedmd.com>> wrote:


Heavyweight solution (although if you have grafana and prometheus
going already a little less so):
https://github.com/rivosinc/prometheus-slurm-exporter


On Tue, Aug 20, 2024 at 12:40 AM Simon Andrews via slurm-users
mailto:slurm-users@lists.schedmd.com>>
wrote:

Possibly a bit more elaborate than you want but I wrote a web
based monitoring system for our cluster.  It mostly uses standard
slurm commands for job monitoring, but I've also added storage
monitoring which requires a separate cron job to run every night. 
It was written for our cluster, but probably wouldn't take much

work to adapt to another cluster with similar structure.

You can see the code and some screenshots at:

https://github.com/s-andrews/capstone_monitor


..and there's a video walk through at:

https://vimeo.com/982985174 

We've also got more friendly scripts for monitoring current and
past jobs on the command line.  These are in a private repository
as some of the other information there is more sensitive but I'm
happy to share those scripts.  You can see the scripts being used
in https://vimeo.com/982986202 

Simon.

-Original Message-
From: Paul Edmon via slurm-users mailto:slurm-users@lists.schedmd.com>>
Sent: 09 August 2024 16:12
To: slurm-users@lists.schedmd.com

Subject: [slurm-users] Print Slurm Stats on Login

We are working to make our users more aware of their usage. One of
the ideas we came up with was to having some basic usage stats
printed at login (usage over past day, fairshare, job efficiency,
etc). Does anyone have any scripts or methods that they use to do
this? Before baking my own I was curious what other sites do and
if they would be willing to share their scripts and methodology.

-Paul Edmon-


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
 To unsubscribe send an
email to slurm-users-le...@lists.schedmd.com



This email has been scanned for spam & viruses. If you believe
this email should have been stopped by our filters, click the
following link to report it

(https://portal-uk.mailanyone.net/index.html#/outer/reportspam?token=dXNlcj1zaW1vbi5hbmRyZXdzQGJhYnJhaGFtLmFjLnVrO3RzPTE3MjMyMTY5MzA7dXVpZD02NkI2MzQyMTY5MzU2Q0YwRThDQzI5RTY4MkMxOEY5Mjt0b2tlbj01MjI1ZmJmYzJjODgzNWM3ZDE2ZGRiOTE2ZjIxYzk4MjliMjY2MjA0Ow%3D%3D
 
).



--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com