[OMPI users] orte has lost communication

2016-04-12 Thread Stefan Friedel
Good Morning List, we have a problem on our cluster with bigger jobs (~> 200 nodes) - almost every job ends with a message like: ### Starting at Mon Apr 11 15:54:06 CEST 2016 Running on hosts: stek[034-086,088-201,203-247,249-344,346-379,381-388] Running on 350 nodes. Current work

Re: [OMPI users] orte has lost communication

2016-04-12 Thread Gilles Gouaillardet
Stefan, which version of OpenMPI are you using ? when does the error occur ? is it before MPI_Init() completes ? is it in the middle of the job ? if yes, are you sure no task invoked MPI_Abort() ? also, you might want to check the system logs and make sure there was no OOM (Out Of Memory). a

Re: [OMPI users] orte has lost communication

2016-04-12 Thread Stefan Friedel
On Tue, Apr 12, 2016 at 05:11:59PM +0900, Gilles Gouaillardet wrote: Dear Gilles, which version of OpenMPI are you using ? as I wrote: openmpi-1.10.2, slurm-15.08.9; homes mounted via NFS/RDMA/ipoib, mpi when does the error occur ? is it before MPI_Init() completes ? is it in the middle o

Re: [OMPI users] orte has lost communication

2016-04-12 Thread Gilles Gouaillardet
Stefan, what if you ulimit -c unlimited do orted generate some core dump ? Cheers Gilles On Tuesday, April 12, 2016, Stefan Friedel < stefan.frie...@iwr.uni-heidelberg.de> wrote: > On Tue, Apr 12, 2016 at 05:11:59PM +0900, Gilles Gouaillardet wrote: > Dear Gilles, > >> which version of OpenMP

Re: [OMPI users] orte has lost communication

2016-04-12 Thread Stefan Friedel
On Tue, Apr 12, 2016 at 07:51:48PM +0900, Gilles Gouaillardet wrote: what if you ulimit -c unlimited do orted generate some core dump ? Hi Gilles, -thanks for you support!- nope, no core, just the "orte has lost"... I now tested with a simple hello-world mpi program- printf("rank, processor")

Re: [OMPI users] orte has lost communication

2016-04-12 Thread Stefan Friedel
On Tue, Apr 12, 2016 at 01:30:37PM +0200, Stefan Friedel wrote: -thanks for you support!- nope, no core, just the "orte has lost"... Dear list - the problem is _not_ related to openmpi. I compiled mvapich2 and I get communication errors,too. Probably this is a hardware problem. Sorry for the noi

Re: [OMPI users] orte has lost communication

2016-04-12 Thread Ralph Castain
My apologies for the tardy response - been stuck in meetings. I'm glad to hear that you are making progress tracking this down. FWIW: the error message you received indicates that the socket from that node unexpectedly reset during execution of the application. So it sounds like there is something

[OMPI users] Debugging help

2016-04-12 Thread dpchoudh .
Hello all I am trying to set a breakpoint during the modex exchange process so I can see the data being passed for different transport type. I assume that this is being done in the context of orted since this is part of process launch. Here is what I did: (All of this pertains to the master branc

[OMPI users] Possible bug in MPI_Barrier() ?

2016-04-12 Thread dpchoudh .
Hi all I have reported this issue before, but then had brushed it off as something that was caused by my modifications to the source tree. It looks like that is not the case. Just now, I did the following: 1. Cloned a fresh copy from master. 2. Configured with the following flags, built and inst

Re: [OMPI users] Debugging help

2016-04-12 Thread Jeff Squyres (jsquyres)
On Apr 12, 2016, at 2:38 PM, dpchoudh . wrote: > > Hello all > > I am trying to set a breakpoint during the modex exchange process so I can > see the data being passed for different transport type. I assume that this is > being done in the context of orted since this is part of process launch.

Re: [OMPI users] Possible bug in MPI_Barrier() ?

2016-04-12 Thread Gilles Gouaillardet
This is quite unlikely, and fwiw, your test program works for me. i suggest you check your 3 TCP networks are usable, for example $ mpirun -np 2 -hostfile ~/hostfile -mca btl self,tcp -mca pml ob1 --mca btl_tcp_if_include xxx ./mpitest in which xxx is a [list of] interface name : eth0 eth1 ib