run unless the firewalld daemon is disabled” for how to get around
>> this from Gilles or Jeff.
>>
>>
>>
>> I thank you.
>>
>> --
>>
>> Llolsten
>>
>> <>
>> From: users [mailto:users-boun...@open-mpi.org
&g
gt;> unless the firewalld daemon is disabled*” for how to get around this
>> from Gilles or Jeff.
>>
>>
>>
>> I thank you.
>>
>> --
>>
>> Llolsten
>>
>>
>>
>> *From:* users [mailto:users-boun...@open-mpi.org] *O
My application have a heartbeat that checks if a node is alive and can
redistribute a task to another node if the master lost communication with
it. The application also have a checkpoint/restart, but since I usually
have hundreds of nodes for one job and usually takes a long time to restart
the jo
; Llolsten
>
> <>
> From: users [mailto:users-boun...@open-mpi.org
> <mailto:users-boun...@open-mpi.org>] On Behalf Of Zabiziz Zaz
> Sent: Monday, May 16, 2016 10:46 AM
> To: us...@open-mpi.org <mailto:us...@open-mpi.org
What do you mean by fault tolerant application ?
from an OpenMPI point of view, if such a connection is lost, your
application will no more be able to communicate, so killing it is the best
option.
if your application has built in checkpoint/restart, then you have to
restart it with mpirun after th
Behalf Of *Zabiziz
> Zaz
> *Sent:* Monday, May 16, 2016 10:46 AM
> *To:* us...@open-mpi.org
> *Subject:* [OMPI users] ORTE has lost communication
>
>
>
> Hi,
>
> I'm using openmpi-1.10.2 and sometimes I'm receiving the message below:
>
>
” for how to get around this from
Gilles or Jeff.
I thank you.
--
Llolsten
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Zabiziz Zaz
Sent: Monday, May 16, 2016 10:46 AM
To: us...@open-mpi.org
Subject: [OMPI users] ORTE has lost communication
Hi,
I'm using openmpi-
Hi,
I'm using openmpi-1.10.2 and sometimes I'm receiving the message below:
--
ORTE has lost communication with its daemon located on node:
hostname:
This is usually due to either a failure of the TCP network
connecti
My apologies for the tardy response - been stuck in meetings. I'm glad to
hear that you are making progress tracking this down. FWIW: the error
message you received indicates that the socket from that node unexpectedly
reset during execution of the application. So it sounds like there is
something
On Tue, Apr 12, 2016 at 01:30:37PM +0200, Stefan Friedel wrote:
-thanks for you support!- nope, no core, just the "orte has lost"...
Dear list - the problem is _not_ related to openmpi. I compiled mvapich2 and I
get communication errors,too. Probably this is a hardware problem.
Sorry for the noi
On Tue, Apr 12, 2016 at 07:51:48PM +0900, Gilles Gouaillardet wrote:
what if you
ulimit -c unlimited
do orted generate some core dump ?
Hi Gilles,
-thanks for you support!- nope, no core, just the "orte has lost"...
I now tested with a simple hello-world mpi program- printf("rank, processor")
Stefan,
what if you
ulimit -c unlimited
do orted generate some core dump ?
Cheers
Gilles
On Tuesday, April 12, 2016, Stefan Friedel <
stefan.frie...@iwr.uni-heidelberg.de> wrote:
> On Tue, Apr 12, 2016 at 05:11:59PM +0900, Gilles Gouaillardet wrote:
> Dear Gilles,
>
>> which version of OpenMP
On Tue, Apr 12, 2016 at 05:11:59PM +0900, Gilles Gouaillardet wrote:
Dear Gilles,
which version of OpenMPI are you using ?
as I wrote:
openmpi-1.10.2, slurm-15.08.9; homes mounted via NFS/RDMA/ipoib, mpi
when does the error occur ?
is it before MPI_Init() completes ?
is it in the middle o
Stefan,
which version of OpenMPI are you using ?
when does the error occur ?
is it before MPI_Init() completes ?
is it in the middle of the job ? if yes, are you sure no task invoked
MPI_Abort() ?
also, you might want to check the system logs and make sure there was no
OOM (Out Of Memory).
a
Good Morning List,
we have a problem on our cluster with bigger jobs (~> 200 nodes) -
almost every job ends with a message like:
###
Starting at Mon Apr 11 15:54:06 CEST 2016
Running on hosts: stek[034-086,088-201,203-247,249-344,346-379,381-388]
Running on 350 nodes.
Current work
15 matches
Mail list logo