Re: [slurm-users] Multi-node job failure

2019-12-12 Thread Chris Samuel
On 11/12/19 8:05 am, Chris Woelkers - NOAA Federal wrote: Partial progress. The scientist that developed theĀ model took a look at the output and found that instead of one model run being ran in parallel srun had ran multiple instancesĀ of the model, one per thread, which for this test was 110 t

Re: [slurm-users] Multi-node job failure

2019-12-11 Thread Chris Woelkers - NOAA Federal
Partial progress. The scientist that developed the model took a look at the output and found that instead of one model run being ran in parallel srun had ran multiple instances of the model, one per thread, which for this test was 110 threads. I have a feeling this just verified the same thing that

Re: [slurm-users] Multi-node job failure

2019-12-11 Thread Chris Woelkers - NOAA Federal
I tried a simple thing of swapping out mpirun in the sbatch script for srun. Nothing more, nothing less. The model is now working on at least two nodes, I will have to test again on more but this is progress. Thanks, Chris Woelkers IT Specialist National Oceanic and Atmospheric Agency Great Lakes

Re: [slurm-users] Multi-node job failure

2019-12-11 Thread Chris Woelkers - NOAA Federal
Thanks all for the ideas and possibilities. I will answer all in turn. Paul: Neither of the switches in use, Ethernet and Infiniband, have any form of broadcast storm protection enabled. Chris: I have passed on your question to the scientist that created the sbatch script. I will also look into o

Re: [slurm-users] Multi-node job failure

2019-12-11 Thread Zacarias Benta
I had a simmilar issue, please check if the home drive, or the place the data should be stored is mounted on the nodes. On Tue, 2019-12-10 at 14:49 -0500, Chris Woelkers - NOAA Federal wrote: > I have a 16 node HPC that is in the process of being upgraded from > CentOS 6 to 7. All nodes are diskles

Re: [slurm-users] Multi-node job failure

2019-12-10 Thread Ree, Jan-Albert van
er 11, 2019 01:11 To: Slurm User Community List Subject: Re: [slurm-users] Multi-node job failure Thanks for the reply and the things to try. Here are the answers to your questions/tests in order: - I tried mpiexec and the same issue occurred. - While the job is listed as running I checked all the n

Re: [slurm-users] Multi-node job failure

2019-12-10 Thread Chris Samuel
Hi Chris, On Tuesday, 10 December 2019 11:49:44 AM PST Chris Woelkers - NOAA Federal wrote: > Test jobs, submitted via sbatch, are able to run on one node with no problem > but will not run on multiple nodes. The jobs are using mpirun and mvapich2 > is installed. Is there a reason why you aren'

Re: [slurm-users] Multi-node job failure

2019-12-10 Thread Paul Kenyon
lbert van Ree | Linux System Administrator | Digital Services >> MARIN | T +31 317 49 35 48 | j.a.v@marin.nl | www.marin.nl >> >> [image: LinkedIn] <https://www.linkedin.com/company/marin> [image: >> YouTube] <http://www.youtube.com/marinmultimedia> [image: Twi

Re: [slurm-users] Multi-node job failure

2019-12-10 Thread Chris Woelkers - NOAA Federal
lt;https://www.marin.nl/flare-holds-first-general-assembly-meeting-in-bremen-germany> > > -- > *From:* slurm-users on behalf of > Chris Woelkers - NOAA Federal > *Sent:* Tuesday, December 10, 2019 20:49 > *To:* slurm-users@lists.schedmd.com > *Subject:*

Re: [slurm-users] Multi-node job failure

2019-12-10 Thread Ree, Jan-Albert van
_ From: slurm-users on behalf of Chris Woelkers - NOAA Federal Sent: Tuesday, December 10, 2019 20:49 To: slurm-users@lists.schedmd.com Subject: [slurm-users] Multi-node job failure I have a 16 node HPC that is in the process of being upgraded from CentOS 6 to 7. All nod

[slurm-users] Multi-node job failure

2019-12-10 Thread Chris Woelkers - NOAA Federal
I have a 16 node HPC that is in the process of being upgraded from CentOS 6 to 7. All nodes are diskless and connected via 1Gbps Ethernet and FDR Infiniband. I am using Bright Cluster Management to manage it and their support has not found a solution to this problem. For the most part the cluster i