Re: [OMPI users] IMB-MPI1 hangs after 30 minutes with Open MPI 3.0.0 (was: Openmpi 1.10.4 crashes with 1024 processes)

2017-12-01 Thread Götz Waschk
On Thu, Nov 30, 2017 at 6:32 PM, Jeff Squyres (jsquyres) wrote: > Ah, I was misled by the subject. > > Can you provide more information about "hangs", and your environment? > > You previously cited: > > - E5-2697A v4 CPUs and Mellanox ConnectX-3 FDR Infiniband > - SLRUM > - Open MPI v3.0.0 > - IMB

Re: [OMPI users] IMB-MPI1 hangs after 30 minutes with Open MPI 3.0.0 (was: Openmpi 1.10.4 crashes with 1024 processes)

2017-12-01 Thread Götz Waschk
On Fri, Dec 1, 2017 at 10:13 AM, Götz Waschk wrote: > I have attached my slurm job script, it will simply do an mpirun > IMB-MPI1 with 1024 processes. I haven't set any mca parameters, so for > instance, vader is enabled. I have tested again, with mpirun --mca btl "^vader" IMB-MPI1 it made no

Re: [OMPI users] IMB-MPI1 hangs after 30 minutes with Open MPI 3.0.0 (was: Openmpi 1.10.4 crashes with 1024 processes)

2017-12-01 Thread Noam Bernstein
> On Dec 1, 2017, at 8:10 AM, Götz Waschk wrote: > > On Fri, Dec 1, 2017 at 10:13 AM, Götz Waschk wrote: >> I have attached my slurm job script, it will simply do an mpirun >> IMB-MPI1 with 1024 processes. I haven't set any mca parameters, so for >> instance, vader is enabled. > I have tested a

Re: [OMPI users] IMB-MPI1 hangs after 30 minutes with Open MPI 3.0.0 (was: Openmpi 1.10.4 crashes with 1024 processes)

2017-12-01 Thread Gilles Gouaillardet
FWIW, pstack Is a gdb wrapper that displays the stack trace. PADB http://padb.pittman.org.uk is a great OSS that automatically collect the stack traces of all the MPI tasks (and can do some grouping similar to dshbak) Cheers, Gilles Noam Bernstein wrote: > > >On Dec 1, 2017, at 8:10 AM, Göt

Re: [OMPI users] IMB-MPI1 hangs after 30 minutes with Open MPI 3.0.0 (was: Openmpi 1.10.4 crashes with 1024 processes)

2017-12-01 Thread Götz Waschk
Thanks, I've tried padb first to get stack traces. This is from IMB-MPI1 hanging after one hour, the last output was: # Benchmarking Alltoall # #processes = 1024 # #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]