Re: [OMPI users] ORTE has lost communication

2016-05-16 Thread Ralph Castain
run unless the firewalld daemon is disabled” for how to get around >> this from Gilles or Jeff. >> >> >> >> I thank you. >> >> -- >> >> Llolsten >> >>   <> >> From: users [mailto:users-boun...@open-mpi.org &g

Re: [OMPI users] ORTE has lost communication

2016-05-16 Thread Zabiziz Zaz
gt;> unless the firewalld daemon is disabled*” for how to get around this >> from Gilles or Jeff. >> >> >> >> I thank you. >> >> -- >> >> Llolsten >> >> >> >> *From:* users [mailto:users-boun...@open-mpi.org] *O

Re: [OMPI users] ORTE has lost communication

2016-05-16 Thread Zabiziz Zaz
My application have a heartbeat that checks if a node is alive and can redistribute a task to another node if the master lost communication with it. The application also have a checkpoint/restart, but since I usually have hundreds of nodes for one job and usually takes a long time to restart the jo

Re: [OMPI users] ORTE has lost communication

2016-05-16 Thread Ralph Castain
; Llolsten > >   <> > From: users [mailto:users-boun...@open-mpi.org > <mailto:users-boun...@open-mpi.org>] On Behalf Of Zabiziz Zaz > Sent: Monday, May 16, 2016 10:46 AM > To: us...@open-mpi.org <mailto:us...@open-mpi.org

[OMPI users] ORTE has lost communication

2016-05-16 Thread Gilles Gouaillardet
What do you mean by fault tolerant application ? from an OpenMPI point of view, if such a connection is lost, your application will no more be able to communicate, so killing it is the best option. if your application has built in checkpoint/restart, then you have to restart it with mpirun after th

Re: [OMPI users] ORTE has lost communication

2016-05-16 Thread Zabiziz Zaz
Behalf Of *Zabiziz > Zaz > *Sent:* Monday, May 16, 2016 10:46 AM > *To:* us...@open-mpi.org > *Subject:* [OMPI users] ORTE has lost communication > > > > Hi, > > I'm using openmpi-1.10.2 and sometimes I'm receiving the message below: > >

Re: [OMPI users] ORTE has lost communication

2016-05-16 Thread Llolsten Kaonga
” for how to get around this from Gilles or Jeff. I thank you. -- Llolsten From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Zabiziz Zaz Sent: Monday, May 16, 2016 10:46 AM To: us...@open-mpi.org Subject: [OMPI users] ORTE has lost communication Hi, I'm using openmpi-

[OMPI users] ORTE has lost communication

2016-05-16 Thread Zabiziz Zaz
Hi, I'm using openmpi-1.10.2 and sometimes I'm receiving the message below: -- ORTE has lost communication with its daemon located on node: hostname: This is usually due to either a failure of the TCP network connecti

Re: [OMPI users] orte has lost communication

2016-04-12 Thread Ralph Castain
My apologies for the tardy response - been stuck in meetings. I'm glad to hear that you are making progress tracking this down. FWIW: the error message you received indicates that the socket from that node unexpectedly reset during execution of the application. So it sounds like there is something

Re: [OMPI users] orte has lost communication

2016-04-12 Thread Stefan Friedel
On Tue, Apr 12, 2016 at 01:30:37PM +0200, Stefan Friedel wrote: -thanks for you support!- nope, no core, just the "orte has lost"... Dear list - the problem is _not_ related to openmpi. I compiled mvapich2 and I get communication errors,too. Probably this is a hardware problem. Sorry for the noi

Re: [OMPI users] orte has lost communication

2016-04-12 Thread Stefan Friedel
On Tue, Apr 12, 2016 at 07:51:48PM +0900, Gilles Gouaillardet wrote: what if you ulimit -c unlimited do orted generate some core dump ? Hi Gilles, -thanks for you support!- nope, no core, just the "orte has lost"... I now tested with a simple hello-world mpi program- printf("rank, processor")

Re: [OMPI users] orte has lost communication

2016-04-12 Thread Gilles Gouaillardet
Stefan, what if you ulimit -c unlimited do orted generate some core dump ? Cheers Gilles On Tuesday, April 12, 2016, Stefan Friedel < stefan.frie...@iwr.uni-heidelberg.de> wrote: > On Tue, Apr 12, 2016 at 05:11:59PM +0900, Gilles Gouaillardet wrote: > Dear Gilles, > >> which version of OpenMP

Re: [OMPI users] orte has lost communication

2016-04-12 Thread Stefan Friedel
On Tue, Apr 12, 2016 at 05:11:59PM +0900, Gilles Gouaillardet wrote: Dear Gilles, which version of OpenMPI are you using ? as I wrote: openmpi-1.10.2, slurm-15.08.9; homes mounted via NFS/RDMA/ipoib, mpi when does the error occur ? is it before MPI_Init() completes ? is it in the middle o

Re: [OMPI users] orte has lost communication

2016-04-12 Thread Gilles Gouaillardet
Stefan, which version of OpenMPI are you using ? when does the error occur ? is it before MPI_Init() completes ? is it in the middle of the job ? if yes, are you sure no task invoked MPI_Abort() ? also, you might want to check the system logs and make sure there was no OOM (Out Of Memory). a

[OMPI users] orte has lost communication

2016-04-12 Thread Stefan Friedel
Good Morning List, we have a problem on our cluster with bigger jobs (~> 200 nodes) - almost every job ends with a message like: ### Starting at Mon Apr 11 15:54:06 CEST 2016 Running on hosts: stek[034-086,088-201,203-247,249-344,346-379,381-388] Running on 350 nodes. Current work