Good Morning List,
we have a problem on our cluster with bigger jobs (~> 200 nodes) -
almost every job ends with a message like:
###
Starting at Mon Apr 11 15:54:06 CEST 2016
Running on hosts: stek[034-086,088-201,203-247,249-344,346-379,381-388]
Running on 350 nodes.
Current work
Stefan,
which version of OpenMPI are you using ?
when does the error occur ?
is it before MPI_Init() completes ?
is it in the middle of the job ? if yes, are you sure no task invoked
MPI_Abort() ?
also, you might want to check the system logs and make sure there was no
OOM (Out Of Memory).
a
On Tue, Apr 12, 2016 at 05:11:59PM +0900, Gilles Gouaillardet wrote:
Dear Gilles,
which version of OpenMPI are you using ?
as I wrote:
openmpi-1.10.2, slurm-15.08.9; homes mounted via NFS/RDMA/ipoib, mpi
when does the error occur ?
is it before MPI_Init() completes ?
is it in the middle o
Stefan,
what if you
ulimit -c unlimited
do orted generate some core dump ?
Cheers
Gilles
On Tuesday, April 12, 2016, Stefan Friedel <
stefan.frie...@iwr.uni-heidelberg.de> wrote:
> On Tue, Apr 12, 2016 at 05:11:59PM +0900, Gilles Gouaillardet wrote:
> Dear Gilles,
>
>> which version of OpenMP
On Tue, Apr 12, 2016 at 07:51:48PM +0900, Gilles Gouaillardet wrote:
what if you
ulimit -c unlimited
do orted generate some core dump ?
Hi Gilles,
-thanks for you support!- nope, no core, just the "orte has lost"...
I now tested with a simple hello-world mpi program- printf("rank, processor")
On Tue, Apr 12, 2016 at 01:30:37PM +0200, Stefan Friedel wrote:
-thanks for you support!- nope, no core, just the "orte has lost"...
Dear list - the problem is _not_ related to openmpi. I compiled mvapich2 and I
get communication errors,too. Probably this is a hardware problem.
Sorry for the noi
My apologies for the tardy response - been stuck in meetings. I'm glad to
hear that you are making progress tracking this down. FWIW: the error
message you received indicates that the socket from that node unexpectedly
reset during execution of the application. So it sounds like there is
something
Hello all
I am trying to set a breakpoint during the modex exchange process so I can
see the data being passed for different transport type. I assume that this
is being done in the context of orted since this is part of process launch.
Here is what I did: (All of this pertains to the master branc
Hi all
I have reported this issue before, but then had brushed it off as something
that was caused by my modifications to the source tree. It looks like that
is not the case.
Just now, I did the following:
1. Cloned a fresh copy from master.
2. Configured with the following flags, built and inst
On Apr 12, 2016, at 2:38 PM, dpchoudh . wrote:
>
> Hello all
>
> I am trying to set a breakpoint during the modex exchange process so I can
> see the data being passed for different transport type. I assume that this is
> being done in the context of orted since this is part of process launch.
This is quite unlikely, and fwiw, your test program works for me.
i suggest you check your 3 TCP networks are usable, for example
$ mpirun -np 2 -hostfile ~/hostfile -mca btl self,tcp -mca pml ob1 --mca
btl_tcp_if_include xxx ./mpitest
in which xxx is a [list of] interface name :
eth0
eth1
ib
11 matches
Mail list logo