[slurm-users] Re: visualisation of JobComp and JobacctGather data with Grafana - screenshots, ideas?

2024-04-12 Thread Carsten Beyer via slurm-users

Hi Josef,

we use ClusterCockpit for that purpose. Users could monitor their 
running jobs or have a look for finished jobs.


https://www.clustercockpit.org/

Best Regards,
Carsten

--
Carsten Beyer
Abteilung Systeme

Deutsches Klimarechenzentrum GmbH (DKRZ)
Bundesstraße 45a * D-20146 Hamburg * Germany

Phone:  +49 40 460094-221
Fax:+49 40 460094-270
Email:  be...@dkrz.de
URL:http://www.dkrz.de

Geschäftsführer: Prof. Dr. Thomas Ludwig
Sitz der Gesellschaft: Hamburg
Amtsgericht Hamburg HRB 39784


Am 10.04.24 um 13:44 schrieb Josef Dvoracek via slurm-users:
Is here anybody having nice visualization of JobComp and JobacctGather 
data in Grafana?


I save JobComp data in Elasticsearch, JobacctGather data in influxDB, 
and thinking about how to provide meaningful insights to $users.


Things I'd like to show..: especially memory & cpu utilization, job 
result, possible malicious effects like OOMs...


Any screenshots, ideas, experience welcomed!

cheers

Josef





--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] slurmrestd connect to 192.168.87.113:6819 Connection refused

2024-04-12 Thread shaobo liu via slurm-users
hi,slurm configured primary and secondary,The error when requesting
slurmrest api is as follows, may I ask what is the reason?

# scontrol ping
Slurmctld(primary) at node003 is UP
Slurmctld(backup) at node113 is UP


# systemctl status slurmrestd.service
● slurmrestd.service - Slurm REST daemon
 Loaded: loaded (/lib/systemd/system/slurmrestd.service; enabled;
vendor preset: enabled)
 Active: active (running) since Fri 2024-04-12 17:07:08 CST; 21min ago
   Main PID: 705425 (slurmrestd)
  Tasks: 21 (limit: 629145)
 Memory: 20.3M
 CGroup: /system.slice/slurmrestd.service
 └─705425 /usr/sbin/slurmrestd -f /etc/slurm/slurm.conf
unix:/var/spool/slurm/slurmrestd.socket 0.0.0.0:6820 -vvv

Apr 12 17:08:46 node003 slurmrestd[705425]: debug2: _slurm_connect: failed
to connect to 192.168.87.113:6819: Connection refused
Apr 12 17:08:46 node003 slurmrestd[705425]: debug2: Error connecting slurm
stream socket at 192.168.87.113:6819: Connection refused
Apr 12 17:08:46 node003 slurmrestd[705425]: slurmrestd: error:
slurm_persist_conn_open_without_init: failed to open persistent connection
to host:node113:6819: Connection refused
Apr 12 17:08:46 node003 slurmrestd[705425]: slurmrestd: error: Sending
PersistInit msg: Connection refused
Apr 12 17:08:46 node003 slurmrestd[705425]: slurmrestd: error:
slurm_rest_auth_p_get_db_conn: unable to connect to slurmdbd: Connection
refused
Apr 12 17:08:46 node003 slurmrestd[705425]: slurmrestd: error:
init_connection[v0.0.39]:[[2.0.1.191]:50652] rc[7000]=Unable to connect to
database -> openapi_get_db_conn() failed to open slurmdb connecti>
Apr 12 17:08:46 node003 slurmrestd[705425]: error:
slurm_persist_conn_open_without_init: failed to open persistent connection
to host:node113:6819: Connection refused
Apr 12 17:08:46 node003 slurmrestd[705425]: error: Sending PersistInit msg:
Connection refused
Apr 12 17:08:46 node003 slurmrestd[705425]: error:
slurm_rest_auth_p_get_db_conn: unable to connect to slurmdbd: Connection
refused
Apr 12 17:08:46 node003 slurmrestd[705425]: error:
init_connection[v0.0.39]:[[2.0.1.191]:50652] rc[7000]=Unable to connect to
database -> openapi_get_db_conn() failed to open slurmdb connection

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: slurmrestd connect to 192.168.87.113:6819 Connection refused

2024-04-12 Thread Nico Derl via slurm-users
Hey,
Are slurmctrld and restd on separate machines? Can you manually reach them? 
Could there be a firewall/closed port in the way?

12. Apr. 2024, 11:36 von slurm-users@lists.schedmd.com:

> hi,slurm configured primary and secondary,The error when requesting slurmrest 
> api is as follows, may I ask what is the reason?
>
> # scontrol ping
> Slurmctld(primary) at node003 is UP
> Slurmctld(backup) at node113 is UP
>
>
> # systemctl status slurmrestd.service
> ● slurmrestd.service - Slurm REST daemon
>      Loaded: loaded (/lib/systemd/system/slurmrestd.service; enabled; vendor 
> preset: enabled)
>      Active: active (running) since Fri 2024-04-12 17:07:08 CST; 21min ago
>    Main PID: 705425 (slurmrestd)
>       Tasks: 21 (limit: 629145)
>      Memory: 20.3M
>      CGroup: /system.slice/slurmrestd.service
>              └─705425 /usr/sbin/slurmrestd -f /etc/slurm/slurm.conf 
> unix:/var/spool/slurm/slurmrestd.socket > 0.0.0.0:6820 > 
>  -vvv
>
> Apr 12 17:08:46 node003 slurmrestd[705425]: debug2: _slurm_connect: failed to 
> connect to > 192.168.87.113:6819 > : Connection 
> refused
> Apr 12 17:08:46 node003 slurmrestd[705425]: debug2: Error connecting slurm 
> stream socket at > 192.168.87.113:6819 > : 
> Connection refused
> Apr 12 17:08:46 node003 slurmrestd[705425]: slurmrestd: error: 
> slurm_persist_conn_open_without_init: failed to open persistent connection to 
> host:node113:6819: Connection refused
> Apr 12 17:08:46 node003 slurmrestd[705425]: slurmrestd: error: Sending 
> PersistInit msg: Connection refused
> Apr 12 17:08:46 node003 slurmrestd[705425]: slurmrestd: error: 
> slurm_rest_auth_p_get_db_conn: unable to connect to slurmdbd: Connection 
> refused
> Apr 12 17:08:46 node003 slurmrestd[705425]: slurmrestd: error: 
> init_connection[v0.0.39]:[[2.0.1.191]:50652] rc[7000]=Unable to connect to 
> database -> openapi_get_db_conn() failed to open slurmdb connecti>
> Apr 12 17:08:46 node003 slurmrestd[705425]: error: 
> slurm_persist_conn_open_without_init: failed to open persistent connection to 
> host:node113:6819: Connection refused
> Apr 12 17:08:46 node003 slurmrestd[705425]: error: Sending PersistInit msg: 
> Connection refused
> Apr 12 17:08:46 node003 slurmrestd[705425]: error: 
> slurm_rest_auth_p_get_db_conn: unable to connect to slurmdbd: Connection 
> refused
> Apr 12 17:08:46 node003 slurmrestd[705425]: error: 
> init_connection[v0.0.39]:[[2.0.1.191]:50652] rc[7000]=Unable to connect to 
> database -> openapi_get_db_conn() failed to open slurmdb connection
>


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com