Re: [slurm-users] How to print a user's creation timestamp from the Slurm database?

2020-01-20 Thread Marcus Wagner
Hi Juergen, THAT's a cool feature, never knew about that. But this leads to other questions. Short excerpt: 2020-01-19T00:00:18  Modify Clusters   slurmadm name='rcc' control_host='134.6+ 2020-01-19T03:24:16  Modify Clusters   slurmadm name='rcc' control_host='134.6+ 2020-01-19T03:34

Re: [slurm-users] How to print a user's creation timestamp from the Slurm database?

2020-01-20 Thread Juergen Salk
* Marcus Wagner [200120 09:17]: > I was astonished about the "Modify Clusters" transactions, so I looked a bit > further: > $> sacctmgr list transactions Action="Modify Clusters" -p > 2020-01-15T00:00:12|Modify > Clusters|slurmadm|name='rcc'|control_host='134.61.193.19', > control_port=6750, last

Re: [slurm-users] How to print a user's creation timestamp from the Slurm database?

2020-01-20 Thread Ole Holm Nielsen
Hi Jürgen, On 1/19/20 2:38 PM, Juergen Salk wrote: * Ole Holm Nielsen [200118 12:06]: When we have created a new Slurm user with "sacctmgr create user name=xxx", I would like inquire at a later date about the timestamp for the user creation. As far as I can tell, the sacctmgr command cannot

Re: [slurm-users] How to print a user's creation timestamp from the Slurm database?

2020-01-20 Thread Ole Holm Nielsen
Thanks to input from Niels Carl W. Hansen I have been able to write a new slurmusertable tool available from https://github.com/OleHolmNielsen/Slurm_tools/tree/master/slurmaccounts This slurmusertable tool reads your current Slurm database user_table and prints out a list of usernames, creatio

Re: [slurm-users] How to print a user's creation timestamp from the Slurm database?

2020-01-20 Thread Ole Holm Nielsen
Hi, I have been exploring how to list Slurm "Add Users" transactions for a specified number of days/weeks/months into the past. The "date" command is very flexible in printing days in the past. Here are some examples: # sacctmgr list transactions Action="Add Users" Start=`date -d "-1 month"

Re: [slurm-users] Job completed but child process still running

2020-01-20 Thread Youssef Eldakar
Thanks for the pointer to proctrack/cgroup! On Mon, Jan 13, 2020 at 6:46 PM Juergen Salk wrote: > > Are you saying that there is absolutely no need to take care > of potential leftover/stray processes in the epilog script any > more with proctrack/cgroup enabled? > I tried it, and proctrack/cgr

[slurm-users] Node node00x has low real_memory size & slurm_rpc_node_registration node=node003: Invalid argument

2020-01-20 Thread Robert Kudyba
I've posted about this previously here , and here so I'm trying to get to

Re: [slurm-users] Node node00x has low real_memory size & slurm_rpc_node_registration node=node003: Invalid argument

2020-01-20 Thread Brian Andrus
Try using "nodename=node003" in the slurm.conf on your nodes. Also, make sure the slurm.conf on the nodes is the same as on the head. Somewhere in there, you have "node=node003" (as well as the other nodes names). That may even do it, as they may be trying to register generically, so their c

Re: [slurm-users] Node node00x has low real_memory size & slurm_rpc_node_registration node=node003: Invalid argument

2020-01-20 Thread Robert Kudyba
We are on a Bright Cluster and their support says the head node controls this. Here you can see the sym links: [root@node001 ~]# file /etc/slurm/slurm.conf /etc/slurm/slurm.conf: symbolic link to `/cm/shared/apps/slurm/var/etc/slurm.conf' [root@ourcluster myuser]# file /etc/slurm/slurm.conf /etc/

[slurm-users] Node can't run simple job when STATUS is up and STATE is idle

2020-01-20 Thread Dean Schulze
I've got a node running on CentOS 7.7 build from the recent 20.02.0pre1 code base. It's behavior is strange to say the least. The controller was built from the same code base, but on Ubuntu 19.10. The controller reports the nodes state with sinfo, but can't run a simple job with srun because it

Re: [slurm-users] Node can't run simple job when STATUS is up and STATE is idle

2020-01-20 Thread Carlos Fenoy
Hi, The * next to the idle status in sinfo means that the node is unreachable/not responding. Check the status of the slurmd on the node and check the connectivity from the slurmctld host to the compute node (telnet may be enough). You can also check the slurmctld logs for more information. Regar

Re: [slurm-users] Node can't run simple job when STATUS is up and STATE is idle

2020-01-20 Thread Dean Schulze
If I run sinfo on the node itself it shows an asterisk. How can the node be unreachable from itself? On Mon, Jan 20, 2020 at 1:50 PM Carlos Fenoy wrote: > Hi, > > The * next to the idle status in sinfo means that the node is > unreachable/not responding. Check the status of the slurmd on the no

Re: [slurm-users] Node can't run simple job when STATUS is up and STATE is idle

2020-01-20 Thread Dean Schulze
If I restart slurmd the asterisk goes away. Then I can run the job once and the asterisk is back, and the node remains in comp*: [liqid@liqidos-dean-node1 ~]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST debug* up infinite 1 idle liqidos-dean-node1 [liqid@liqidos-dean-no

Re: [slurm-users] Node can't run simple job when STATUS is up and STATE is idle

2020-01-20 Thread Brian Andrus
Check the slurmd log file on the node. Ensure slurmd is still running. Sounds possible that OOM Killer or such may be killing slurmd Brian Andrus On 1/20/2020 1:12 PM, Dean Schulze wrote: If I restart slurmd the asterisk goes away.  Then I can run the job once and the asterisk is back, and t

Re: [slurm-users] Node can't run simple job when STATUS is up and STATE is idle

2020-01-20 Thread Carlos Fenoy
It seems to me that the problem is between the slurmctld and slurmd. When slurmd starts it sends a message to the slurmctld, that's why it appears idle. Every now and then the slurmctld will try to ping the slurmd to check if it's still alive. This ping doesn't seem to be working, so as I mentioned

[slurm-users] Downgraded to slurm 19.05.4 and now slrumctld won't start because of incompatible state

2020-01-20 Thread Dean Schulze
This is what I get from systemctl status slurmctld: fatal: Can not recover last_tres state, incompatible version, got 8960 need >= 8192 <= 8704, start with '-i' to ignore this Starting it with the -i option doesn't do anything. Where does slurm store this state so I can get rid of it? Thanks.

Re: [slurm-users] Node can't run simple job when STATUS is up and STATE is idle

2020-01-20 Thread Dean Schulze
There's either a problem with the source code I cloned from github, or there is a problem when the controller runs on Ubuntu 19 and the node runs on CentOS 7.7. I'm downgrading to a stable 19.05 build to see if that solves the problem. On Mon, Jan 20, 2020 at 3:41 PM Carlos Fenoy wrote: > It se

Re: [slurm-users] Downgraded to slurm 19.05.4 and now slrumctld won't start because of incompatible state

2020-01-20 Thread Ryan Novosielski
Check slurm.conf for StateSaveLocation. https://slurm.schedmd.com/slurm.conf.html -- || \\UTGERS, |---*O*--- ||_// the State | Ryan Novosielski - novos...@rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~

Re: [slurm-users] Node can't run simple job when STATUS is up and STATE is idle

2020-01-20 Thread Ryan Novosielski
The node is not getting the status from itself, it’s querying the slurmctld to ask for its status. -- || \\UTGERS, |---*O*--- ||_// the State | Ryan Novosielski - novos...@rutgers.edu || \\ University | Sr. Technologist - 973/972.0

Re: [slurm-users] Node can't run simple job when STATUS is up and STATE is idle

2020-01-20 Thread Chris Samuel
On 20/1/20 3:00 pm, Dean Schulze wrote: There's either a problem with the source code I cloned from github, or there is a problem when the controller runs on Ubuntu 19 and the node runs on CentOS 7.7.  I'm downgrading to a stable 19.05 build to see if that solves the problem. I've run the ma

Re: [slurm-users] Node node00x has low real_memory size & slurm_rpc_node_registration node=node003: Invalid argument

2020-01-20 Thread Marcus Wagner
Dear Robert, On 1/20/20 7:37 PM, Robert Kudyba wrote: I've posted about this previously here , and here