Re: [slurm-users] Exclude Slurm packages from the EPEL yum repository

2021-02-02 Thread Philip Kovacs
Lots of mixed reactions here, many in favor (and grateful) for the add to EPEL, many much less enthusiastic. I cannot rename an EPEL package that is now in the wild without providing an upgrade path to the new name. Such an upgrade path would defeat the purpose of the rename and won't help at a

Re: [slurm-users] [External] Re: salloc: error: Error on msg accept socket: Too many open files

2021-02-02 Thread Prentice Bisbal
Yes, I agree. The user gave me a minimal description of the problem, so thought he was seeing the error as soon as he called salloc. After pressing him for more information, it turns out he had been working in the salloc session for several hours before the problem occured, so I think this just

[slurm-users] salloc: error: Error on msg accept socket: Too many open files

2021-02-02 Thread Prentice Bisbal
Has anyone seen this error message before? A user just reported it. A Google search doesn't turn up anything useful. I mean, I understand what too many open files means, but I'm surprised to see it in the context of salloc. salloc: error: Error on msg accept socket: Too many open files -- Pre

[slurm-users] Slurm - sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:localhost:6819: Connection refused (Zainul Abiddin)

2021-02-02 Thread Michael Smith
--- next part ------ An HTML attachment was scrubbed... URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210202/f2348489/attachment-0001.htm> -- Message: 2 Date: Tue, 2 Feb 2021 16:16:09 +0300 From: Benson Muite To: slurm-users@lists.

Re: [slurm-users] salloc: error: Error on msg accept socket: Too many open files

2021-02-02 Thread Andy Riebs
Run salloc with a smaller number of nodes or tasks, then take a look at lsof (or some other favorite means of finding IP connections). IIRC, each srun/node in the allocation needs 70-80 IP connections with the node running salloc, so a large node count can overwhelm the default allocation of fi

Re: [slurm-users] salloc: error: Error on msg accept socket: Too many open files

2021-02-02 Thread Patrick Goetz
That sounds like a linux issue. You probably need to reset the max limit for file descriptors someplace. Maybe start here: https://rtcamp.com/tutorials/linux/increase-open-files-limit/ On 2/2/21 11:50 AM, Prentice Bisbal wrote: Has anyone seen this error message before? A user just reported it

Re: [slurm-users] Slurm - Munge configuration details

2021-02-02 Thread Benson Muite
On 2/2/21 4:00 PM, Zainul Abiddin wrote: Hi Benson, I am not able to do passwordless ssh  between master and compute nodes using Munge service. when i am running below command , here it is asking for a password for the compute node. /Am I configuring properly or not, so I need clarity on thi

[slurm-users] Slurm - sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:localhost:6819: Connection refused

2021-02-02 Thread Zainul Abiddin
Hi All, I have done slurmdbd configuration and while i am trying to run account manager with *sacct* i am getting below error. [root@smaster ~]# sacct sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:localhost:6819: Connection refused sacct: error: S

Re: [slurm-users] Slurm - Munge configuration details

2021-02-02 Thread Zainul Abiddin
Hi Benson, I am not able to do passwordless ssh between master and compute nodes using Munge service. when i am running below command , here it is asking for a password for the compute node. *Am I configuring properly or not, so I need clarity on this?* [root@smaster ~]# munge -n | ssh snode un

Re: [slurm-users] Slurm - Munge configuration details

2021-02-02 Thread Benson Muite
On 2/2/21 3:40 PM, Benson Muite wrote: On 2/2/21 3:30 PM, Zainul Abiddin wrote: Hi All, I am new to Slurm and trying to setup Slurm20.11.2 on Centos 7 My environment is Master node (smaster) + compute Node (snode) and i am using https://www.slothparadise.com/how-to-install-slurm-on-centos-7-clu

[slurm-users] Slurm : compute node status is UNKNOWN and Reason=NO NETWORK ADDRESS FOUND

2021-02-02 Thread Zainul Abiddin
Hi All, Please help me to resolve this issue My compute node (snode) status is UNKNOWN and Reason=NO NETWORK ADDRESS FOUND Master node (smaster) : [root@smaster ~]# cat /etc/slurm/slurm.conf # slurm.conf file generated by configurator easy.html. # Put this file on all nodes of your cluster. # S

Re: [slurm-users] Slurm - Munge configuration details

2021-02-02 Thread Benson Muite
On 2/2/21 3:30 PM, Zainul Abiddin wrote: Hi All, I am new to Slurm and trying to setup Slurm20.11.2 on Centos 7 My environment is Master node (smaster) + compute Node (snode) and i am using https://www.slothparadise.com/how-to-install-slurm-on-centos-7-cluster/

Re: [slurm-users] Slurm - Munge configuration details

2021-02-02 Thread Zainul Abiddin
Hi, [root@smaster ~]# munge -n | unmunge STATUS: Success (0) ENCODE_HOST: smaster.calligotech.com (192.168.1.195) ENCODE_TIME: 2021-02-01 13:58:04 +0530 (1612168084) DECODE_TIME: 2021-02-01 13:58:04 +0530 (1612168084) TTL: 300 CIPHER: aes128 (4) MAC:

[slurm-users] Slurm - Munge configuration details

2021-02-02 Thread Zainul Abiddin
Hi All, I am new to Slurm and trying to setup Slurm20.11.2 on Centos 7 My environment is Master node (smaster) + compute Node (snode) and i am using https://www.slothparadise.com/how-to-install-slurm-on-centos-7-cluster/ link to setup Slurm on Master and compute nodes. I have tried installing Mung