[slurm-dev] srun: error: _server_read: fd 18 error reading header: Connection reset by peer

2017-10-17 Thread Chaofeng Zhang
I met the error when using slurm. srun: error: _server_read: fd 18 error reading header: Connection reset by peer srun: error: step_launch_notify_io_failure: aborting, io error with slurmstepd on node 1 0: slurmstepd: error: execve(): singularity: No such file or directory srun: error: master: ta

[slurm-dev] Re: srun: error: _server_read: fd 18 error reading header: Connection reset by peer

2017-10-17 Thread Wensheng Deng
It is probably that it could not find the executable singularity? On Tue, Oct 17, 2017 at 8:04 AM, Chaofeng Zhang wrote: > I met the error when using slurm. > > > > srun: error: _server_read: fd 18 error reading header: Connection reset by > peer > srun: error: step_launch_notify_io_failure: ab

[slurm-dev] slurmctld causes slurmdbd to seg fault

2017-10-17 Thread Loris Bennett
Hi, We have been having some with NFS mounts via Infiniband getting dropped by nodes. We ended up switching our main admin server, which provides NFS and Slurm from one machine to another. Now, however, if slurmdbd is started, as soon as slurmctld starts, slurmdbd seg faults. In the slurmdbd.l

[slurm-dev] Re: slurmctld causes slurmdbd to seg fault

2017-10-17 Thread Douglas Jacobsen
You probably have a core file in the directory where slurmdbd logs to, a back trace from gdb would be most telling On Oct 17, 2017 08:17, "Loris Bennett" wrote: > > Hi, > > We have been having some with NFS mounts via Infiniband getting dropped > by nodes. We ended up switching our main admin s

[slurm-dev] Head Node Hardware Requirements

2017-10-17 Thread Daniel Barker
Hi, All, I am gathering hardware requirements for head nodes for my next cluster. The new cluster will have ~1500 nodes. We ran 5 million jobs last year. I plan to run the slurmctld on one node and the slurmdbd on another. ​I also plan to write the StateSaveLocation to an NFS appliance. Does the f

[slurm-dev] Re: Head Node Hardware Requirements

2017-10-17 Thread Paul Edmon
From my experience that should all be sufficient.  The real key is the high clock rate CPU and sufficient memory.  SSD's for the slurm spool directory is good too.  You will want to put CentOS7 on this at least and run the latest version of MariaDB as well. -Paul Edmon- On 10/17/2017 5:13 PM

[slurm-dev] Qos limits associations and AD auth

2017-10-17 Thread Nadav Toledo
Hey everyone, I am working at a university and we trying to setup a slurm cluster for courses and research. for the courses we would like to enforce qos on users that can connect via pbis-open auth. meaning they are authenticating against AD server. There are alot of users and ea

[slurm-dev] Re: Qos limits associations and AD auth

2017-10-17 Thread Benjamin LIPERE
You Can use AD, but it is bothersome in Many ways. Use à Web portail for manage your users. Le 18 oct. 2017 07:25, "Nadav Toledo" a écrit : > Hey everyone, > I am working at a university and we trying to setup a slurm cluster for > courses and research. > for the courses we would like to enforce

[slurm-dev] Re: Qos limits associations and AD auth

2017-10-17 Thread Nadav Toledo
Qos limits associations and AD auth Sorry for all the wierd symbols, I was copying the code from linux terminal here is the clean code(I hope): if ((accounting_enforce & ACCOUNTING_ENFORCE_QOS) && assoc_ptr && !admin && (!assoc_ptr->usage->valid_qos || !bit_test(assoc_ptr->

[slurm-dev] Re: Qos limits associations and AD auth

2017-10-17 Thread Benjamin LIPERE
Yo. Put à freaking Web portail, if you add this to thé cluster you and your student will have to manage it. The will get bad habit of it. Or installé à singularity cluster. You Can code all this in à afternoon easy. Le 18 oct. 2017 07:35, "Nadav Toledo" a écrit : > Sorry for all the wierd symbol

[slurm-dev] Re: Qos limits associations and AD auth

2017-10-17 Thread Nadav Toledo
Re: [slurm-dev] Re: Qos limits associations and AD auth can you ellaborate what exactly you mean by web portal? at the moment users are logging to login server via ssh with their AD credentials, these creds are being auth against AD via pbis-open What do you suggest I add to these me

[slurm-dev] Re: Qos limits associations and AD auth

2017-10-17 Thread Benjamin LIPERE
Wellington, for security, first wrong starting. HPC not secure. Except if you have à 10pers team. I hope that at list you put thé cluster behind a router firewall in à militarisation zone. If you d'idées not second score in your ass, Man. Also thé third screw is that you let ssh access to not trust

[slurm-dev] Re: Head Node Hardware Requirements

2017-10-17 Thread Benjamin Redling
Am 17. Oktober 2017 23:12:35 MESZ, schrieb Daniel Barker : >Hi, All, > >I am gathering hardware requirements for head nodes for my next >cluster. >The new cluster will have ~1500 nodes. We ran 5 million jobs last year. >I >plan to run the slurmctld on one node and the slurmdbd on another. ​I >also