[OMPI users] orte has lost communication

2016-04-12 Thread Stefan Friedel

Good Morning List,
we have a problem on our cluster with bigger jobs (~> 200 nodes) -
almost every job ends with a message like:

###
Starting at Mon Apr 11 15:54:06 CEST 2016
Running on hosts: stek[034-086,088-201,203-247,249-344,346-379,381-388]
Running on 350 nodes.
Current working directory is /export/homelocal/sfriedel/beff
--
ORTE has lost communication with its daemon located on node:

 hostname:  stek346=20

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.

--
--
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--
Program finished with exit code 0 at: Mon Apr 11 15:54:41 CEST 2016
##

I found a similar question on the list by Emyr James (2015-10-01) but
nobody answered until now.

Cluster: Dual-Intel Xeon E5-2630 v3 Haswell, Intel/Qlogic Truescale IB QDR,
Debian Jessie 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u2,
openmpi-1.10.2, slurm-15.08.9; homes mounted via NFS/RDMA/ipoib, mpi messages
over psm/IB + 1G Ethernet (Mgmt, pxe boot, ssh, openmpi tcp network etc.)

Jobs are started via slurm sbatch/script (mpirun --mca mtl psm ~/path/to/binary)

Already tested:
*several mca settings (in ...many... combinations)
mtl_psm_connect_timeout 600
oob_tcp_keepalive_time 600
oob_tcp_if_include eth0
oob_tcp_listen_mode listen_thread=20
oob_tcp_keepalive_time 600

*several network/sysctl settings (in ...many... combinations)
/sbin/sysctl -w net.core.somaxconn=3D2
/sbin/sysctl -w net.core.netdev_max_backlog=3D20
/sbin/sysctl -w net.ipv4.tcp_max_syn_backlog=3D102400
/sbin/sysctl -w net.ipv4.ip_local_port_range=3D"15000 61000"
/sbin/sysctl -w net.ipv4.tcp_fin_timeout=3D10
/sbin/sysctl -w net.ipv4.tcp_tw_recycle=3D1
/sbin/sysctl -w net.ipv4.tcp_tw_reuse=3D1
/sbin/sysctl -w net.ipv4.tcp_mem=3D"383865   511820   2303190"
echo 2500 > /proc/sys/fs/nr_open

*ulimit stuff

Routing on the nodes: two private networks 10.203.0.0/22 eth0 and 10.203.40.0/22
ib0, both with their routes, no default route.

If I start the job with debugging/logging (--mca oob_tcp_debug 5 --mca
oob_base_verbose 8) it takes much longer until the error occurs and the job is
starting on the nodes (producing some timesteps of output) but it will fail at
some later point.

Any hint? PSM? Some kernel number must be increased?  Wrong network/routing
(should not happen with --mca oob_tcp_if_include eth0)??

MfG/Sincerely
Stefan Friedel
--
IWR * 4.317 * INF205 * 69120 Heidelberg
T +49 6221 5414404 * F +49 6221 5414427
stefan.frie...@iwr.uni-heidelberg.de


signature.asc
Description: PGP signature


Re: [OMPI users] orte has lost communication

2016-04-12 Thread Gilles Gouaillardet

Stefan,

which version of OpenMPI are you using ?

when does the error occur ?
is it before MPI_Init() completes ?
is it in the middle of the job ? if yes, are you sure no task invoked 
MPI_Abort() ?


also, you might want to check the system logs and make sure there was no 
OOM (Out Of Memory).
a possible explanation could be some tasks caused OOM, and the OOM 
killer chose to kill orted instead of a.out


if you cannot access your system log, you can try with a large number of 
nodes, and one mpi task per node, and then increase the number of tasks 
per node and see if problem starts happening.


of course, you can try
mpirun --mca oob_tcp_if_include eth0 ...
to be on the safe side

you can also try to run your application over TCP and see if it helps
(note, the issue might be hidden since TCP is much slower than native PSM)

mpirun --mca btl tcp,vader,self --mca btl_tcp_if_include eth0 ...
or
mpirun --mca btl tcp,vader,self --mca btl_tcp_if_include ib0 ...

/* feel free to replace vader with sm, if vader is not available on your 
system */


Cheers,

Gilles

On 4/12/2016 4:37 PM, Stefan Friedel wrote:

Good Morning List,
we have a problem on our cluster with bigger jobs (~> 200 nodes) -
almost every job ends with a message like:

###
Starting at Mon Apr 11 15:54:06 CEST 2016
Running on hosts: stek[034-086,088-201,203-247,249-344,346-379,381-388]
Running on 350 nodes.
Current working directory is /export/homelocal/sfriedel/beff
-- 


ORTE has lost communication with its daemon located on node:

 hostname:  stek346=20

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.

-- 

-- 


An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
-- 


Program finished with exit code 0 at: Mon Apr 11 15:54:41 CEST 2016
##

I found a similar question on the list by Emyr James (2015-10-01) but
nobody answered until now.

Cluster: Dual-Intel Xeon E5-2630 v3 Haswell, Intel/Qlogic Truescale IB 
QDR,

Debian Jessie 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u2,
openmpi-1.10.2, slurm-15.08.9; homes mounted via NFS/RDMA/ipoib, mpi 
messages

over psm/IB + 1G Ethernet (Mgmt, pxe boot, ssh, openmpi tcp network etc.)

Jobs are started via slurm sbatch/script (mpirun --mca mtl psm 
~/path/to/binary)


Already tested:
*several mca settings (in ...many... combinations)
mtl_psm_connect_timeout 600
oob_tcp_keepalive_time 600
oob_tcp_if_include eth0
oob_tcp_listen_mode listen_thread=20
oob_tcp_keepalive_time 600

*several network/sysctl settings (in ...many... combinations)
/sbin/sysctl -w net.core.somaxconn=3D2
/sbin/sysctl -w net.core.netdev_max_backlog=3D20
/sbin/sysctl -w net.ipv4.tcp_max_syn_backlog=3D102400
/sbin/sysctl -w net.ipv4.ip_local_port_range=3D"15000 61000"
/sbin/sysctl -w net.ipv4.tcp_fin_timeout=3D10
/sbin/sysctl -w net.ipv4.tcp_tw_recycle=3D1
/sbin/sysctl -w net.ipv4.tcp_tw_reuse=3D1
/sbin/sysctl -w net.ipv4.tcp_mem=3D"383865   511820   2303190"
echo 2500 > /proc/sys/fs/nr_open

*ulimit stuff

Routing on the nodes: two private networks 10.203.0.0/22 eth0 and 
10.203.40.0/22

ib0, both with their routes, no default route.

If I start the job with debugging/logging (--mca oob_tcp_debug 5 --mca
oob_base_verbose 8) it takes much longer until the error occurs and 
the job is
starting on the nodes (producing some timesteps of output) but it will 
fail at

some later point.

Any hint? PSM? Some kernel number must be increased?  Wrong 
network/routing

(should not happen with --mca oob_tcp_if_include eth0)??

MfG/Sincerely
Stefan Friedel
--
IWR * 4.317 * INF205 * 69120 Heidelberg
T +49 6221 5414404 * F +49 6221 5414427
stefan.frie...@iwr.uni-heidelberg.de


___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/04/28922.php




Re: [OMPI users] orte has lost communication

2016-04-12 Thread Stefan Friedel

On Tue, Apr 12, 2016 at 05:11:59PM +0900, Gilles Gouaillardet wrote:
Dear Gilles,

which version of OpenMPI are you using ?

as I wrote:

   openmpi-1.10.2, slurm-15.08.9; homes mounted via NFS/RDMA/ipoib, mpi



when does the error occur ?
is it before MPI_Init() completes ?
is it in the middle of the job ? if yes, are you sure no task invoked MPI_Abort

During the setup of the job (in most cases) and there is no output from the
application. I will build a minimal program to get some printf debugging ...I'll
report...


also, you might want to check the system logs and make sure there was no OOM
(Out Of Memory).

No OOM messages from the nodes. No relevant messages at all from the
nodes...(remote syslog is running from all nodes to a central system)


mpirun --mca oob_tcp_if_include eth0 ...

I already tested this.


mpirun --mca btl tcp,vader,self --mca btl_tcp_if_include eth0 ...
or
mpirun --mca btl tcp,vader,self --mca btl_tcp_if_include ib0 ...

Just tested this on 350 nodes - two out of seven jobs spawned one after each
other were successfull but subsequent jobs were failing again:

*tcp,vader,self eth0 failed
*tcp,sm,self eth0 failed
*tcp,vader,self ib0 failed
*tcp,sm,self ib0 success!
*tcp,sm,self ib0 failed :-/
*tcp,sm,self ib0 success again!
*tcp,sm,self ib0 failed...

hhmmm. tcp+sm is a little bit more reliable??

For the sake of completeness - I forgot the ompi_info output:

Package: Open MPI root@dyaus Distribution
   Open MPI: 1.10.2
 Open MPI repo revision: v1.10.1-145-g799148f
  Open MPI release date: Jan 21, 2016
   Open RTE: 1.10.2
 Open RTE repo revision: v1.10.1-145-g799148f
  Open RTE release date: Jan 21, 2016
   OPAL: 1.10.2
 OPAL repo revision: v1.10.1-145-g799148f
  OPAL release date: Jan 21, 2016
MPI API: 3.0.0
   Ident string: 1.10.2
 Prefix: /opt/openmpi/1.10.2/gcc/4.9.2
Configured architecture: x86_64-pc-linux-gnu
 Configure host: dyaus
  Configured by: root
  Configured on: Mon Apr 11 09:54:21 CEST 2016
 Configure host: dyaus
   Built by: root
   Built on: Mon Apr 11 10:12:25 CEST 2016
 Built host: dyaus
 C bindings: yes
   C++ bindings: yes
Fort mpif.h: yes (all)
   Fort use mpi: yes (full: ignore TKR)
  Fort use mpi size: deprecated-ompi-info-value
   Fort use mpi_f08: yes
Fort mpi_f08 compliance: The mpi_f08 module is available, but due to 
limitations in the gfortran compiler, does not support the following: array 
subsections, direct passthru (where possible) to underlying Open MPI's C 
functionality
 Fort mpi_f08 subarrays: no
  Java bindings: no
 Wrapper compiler rpath: runpath
 C compiler: gcc
C compiler absolute: /usr/bin/gcc
 C compiler family name: GNU
 C compiler version: 4.9.2
   C++ compiler: g++
  C++ compiler absolute: /usr/bin/g++
  Fort compiler: gfortran
  Fort compiler abs: /usr/bin/gfortran
Fort ignore TKR: yes (!GCC$ ATTRIBUTES NO_ARG_CHECK ::)
  Fort 08 assumed shape: yes
 Fort optional args: yes
 Fort INTERFACE: yes
   Fort ISO_FORTRAN_ENV: yes
  Fort STORAGE_SIZE: yes
 Fort BIND(C) (all): yes
 Fort ISO_C_BINDING: yes
Fort SUBROUTINE BIND(C): yes
  Fort TYPE,BIND(C): yes
Fort T,BIND(C,name="a"): yes
   Fort PRIVATE: yes
 Fort PROTECTED: yes
  Fort ABSTRACT: yes
  Fort ASYNCHRONOUS: yes
 Fort PROCEDURE: yes
Fort USE...ONLY: yes
  Fort C_FUNLOC: yes
Fort f08 using wrappers: yes
Fort MPI_SIZEOF: yes
C profiling: yes
  C++ profiling: yes
  Fort mpif.h profiling: yes
 Fort use mpi profiling: yes
  Fort use mpi_f08 prof: yes
 C++ exceptions: no
 Thread support: posix (MPI_THREAD_MULTIPLE: no, OPAL support: yes, 
OMPI progress: no, ORTE progress: yes, Event lib: yes)
  Sparse Groups: no
 Internal debug support: no
 MPI interface warnings: yes
MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
 dl support: yes
  Heterogeneous support: no
mpirun default --prefix: no
MPI I/O support: yes
  MPI_WTIME support: gettimeofday
Symbol vis. support: yes
  Host topology support: yes
 MPI extensions: 
  FT Checkpoint support: no (checkpoint thread: no)

  C/R Enabled Debugging: no
VampirTrace support: yes
 MPI_MAX_PROCESSOR_NAME: 256
   MPI_MAX_ERROR_STRING: 256
MPI_MAX_OBJECT_NAME: 64
   MPI_MAX_INFO_KEY: 36
   MPI_MAX_INFO_VAL: 256
  MPI_MAX_PORT_NAME: 1024
 MPI_MAX_DATAREP_STRING: 128
  MCA backtrace: execinfo (MCA v2.0.0, API v2.0.0, Component v1.10.2)
   MCA compress: gzip (MCA v2.0.0, API v2.0.0, Component v1.10.2)
   MCA compress: bzip (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA crs: none (MCA v2.0.0, API v2.0.0, Component v1.1

Re: [OMPI users] orte has lost communication

2016-04-12 Thread Gilles Gouaillardet
Stefan,

what if you
ulimit -c unlimited

do orted generate some core dump ?

Cheers

Gilles

On Tuesday, April 12, 2016, Stefan Friedel <
stefan.frie...@iwr.uni-heidelberg.de> wrote:

> On Tue, Apr 12, 2016 at 05:11:59PM +0900, Gilles Gouaillardet wrote:
> Dear Gilles,
>
>> which version of OpenMPI are you using ?
>>
> as I wrote:
>
>>openmpi-1.10.2, slurm-15.08.9; homes mounted via NFS/RDMA/ipoib, mpi
>>
>
> when does the error occur ?
>> is it before MPI_Init() completes ?
>> is it in the middle of the job ? if yes, are you sure no task invoked
>> MPI_Abort
>>
> During the setup of the job (in most cases) and there is no output from the
> application. I will build a minimal program to get some printf debugging
> ...I'll
> report...
>
> also, you might want to check the system logs and make sure there was no
>> OOM
>> (Out Of Memory).
>>
> No OOM messages from the nodes. No relevant messages at all from the
> nodes...(remote syslog is running from all nodes to a central system)
>
> mpirun --mca oob_tcp_if_include eth0 ...
>>
> I already tested this.
>
> mpirun --mca btl tcp,vader,self --mca btl_tcp_if_include eth0 ...
>> or
>> mpirun --mca btl tcp,vader,self --mca btl_tcp_if_include ib0 ...
>>
> Just tested this on 350 nodes - two out of seven jobs spawned one after
> each
> other were successfull but subsequent jobs were failing again:
>
> *tcp,vader,self eth0 failed
> *tcp,sm,self eth0 failed
> *tcp,vader,self ib0 failed
> *tcp,sm,self ib0 success!
> *tcp,sm,self ib0 failed :-/
> *tcp,sm,self ib0 success again!
> *tcp,sm,self ib0 failed...
>
> hhmmm. tcp+sm is a little bit more reliable??
>
> For the sake of completeness - I forgot the ompi_info output:
>
> Package: Open MPI root@dyaus Distribution
>Open MPI: 1.10.2
>  Open MPI repo revision: v1.10.1-145-g799148f
>   Open MPI release date: Jan 21, 2016
>Open RTE: 1.10.2
>  Open RTE repo revision: v1.10.1-145-g799148f
>   Open RTE release date: Jan 21, 2016
>OPAL: 1.10.2
>  OPAL repo revision: v1.10.1-145-g799148f
>   OPAL release date: Jan 21, 2016
> MPI API: 3.0.0
>Ident string: 1.10.2
>  Prefix: /opt/openmpi/1.10.2/gcc/4.9.2
> Configured architecture: x86_64-pc-linux-gnu
>  Configure host: dyaus
>   Configured by: root
>   Configured on: Mon Apr 11 09:54:21 CEST 2016
>  Configure host: dyaus
>Built by: root
>Built on: Mon Apr 11 10:12:25 CEST 2016
>  Built host: dyaus
>  C bindings: yes
>C++ bindings: yes
> Fort mpif.h: yes (all)
>Fort use mpi: yes (full: ignore TKR)
>   Fort use mpi size: deprecated-ompi-info-value
>Fort use mpi_f08: yes
> Fort mpi_f08 compliance: The mpi_f08 module is available, but due to
> limitations in the gfortran compiler, does not support the following: array
> subsections, direct passthru (where possible) to underlying Open MPI's C
> functionality
>  Fort mpi_f08 subarrays: no
>   Java bindings: no
>  Wrapper compiler rpath: runpath
>  C compiler: gcc
> C compiler absolute: /usr/bin/gcc
>  C compiler family name: GNU
>  C compiler version: 4.9.2
>C++ compiler: g++
>   C++ compiler absolute: /usr/bin/g++
>   Fort compiler: gfortran
>   Fort compiler abs: /usr/bin/gfortran
> Fort ignore TKR: yes (!GCC$ ATTRIBUTES NO_ARG_CHECK ::)
>   Fort 08 assumed shape: yes
>  Fort optional args: yes
>  Fort INTERFACE: yes
>Fort ISO_FORTRAN_ENV: yes
>   Fort STORAGE_SIZE: yes
>  Fort BIND(C) (all): yes
>  Fort ISO_C_BINDING: yes
> Fort SUBROUTINE BIND(C): yes
>   Fort TYPE,BIND(C): yes
> Fort T,BIND(C,name="a"): yes
>Fort PRIVATE: yes
>  Fort PROTECTED: yes
>   Fort ABSTRACT: yes
>   Fort ASYNCHRONOUS: yes
>  Fort PROCEDURE: yes
> Fort USE...ONLY: yes
>   Fort C_FUNLOC: yes
> Fort f08 using wrappers: yes
> Fort MPI_SIZEOF: yes
> C profiling: yes
>   C++ profiling: yes
>   Fort mpif.h profiling: yes
>  Fort use mpi profiling: yes
>   Fort use mpi_f08 prof: yes
>  C++ exceptions: no
>  Thread support: posix (MPI_THREAD_MULTIPLE: no, OPAL support:
> yes, OMPI progress: no, ORTE progress: yes, Event lib: yes)
>   Sparse Groups: no
>  Internal debug support: no
>  MPI interface warnings: yes
> MPI parameter check: runtime
> Memory profiling support: no
> Memory debugging support: no
>  dl support: yes
>   Heterogeneous support: no
> mpirun default --prefix: no
> MPI I/O support: yes
>   MPI_WTIME support: gettimeofday
> Symbol vis. support: yes
>   Host topology support: yes
>  MPI extensions:   FT Checkpoint support: no (checkpoint thread:
> no)
>   C/R Enabled Debugging: no
> VampirTrace support: yes
>  MPI_MAX_PROCESSOR_NAME: 256
>MPI_MAX_ERRO

Re: [OMPI users] orte has lost communication

2016-04-12 Thread Stefan Friedel

On Tue, Apr 12, 2016 at 07:51:48PM +0900, Gilles Gouaillardet wrote:

what if you
ulimit -c unlimited

do orted generate some core dump ?

Hi Gilles,
-thanks for you support!- nope, no core, just the "orte has lost"...

I now tested with a simple hello-world mpi program- printf("rank, processor") in
the middle and a printf("before mpi_init")/printf("after mpi_init").

Starting in the batch script with

mpirun -verbose --mca mtl psm --mca btl vader,self --mca 
orte_base_help_aggregate 0 ~/mpihw/mpi_hello_world

Results:

Starting at Tue Apr 12 13:06:38 CEST 2016
Running on hosts: stek[090-189]
Running on 100 nodes.
Current working directory is /export/homelocal/sfriedel/mpihw
Hello world before mpi_init
[...]
Hello world from processor stek150, rank 971 out of 1600 processors
Program finished with exit code 0 at: Tue Apr 12 13:06:42 CEST 2016

Even with just 100 nodes: some jobs are failing (50/50), failing jobs: _no
output_, _no core dumped_...only orte has lost...

Running on >=350 nodes: almost all jobs are failing, but some jobs succeeded
(similar output: only "orte has lost..." for failing jobs and the expected
output for the other jobs).

Weird.

MfG/Sincerely
Stefan Friedel
--
IWR * 4.317 * INF205 * 69120 Heidelberg
T +49 6221 5414404 * F +49 6221 5414427
stefan.frie...@iwr.uni-heidelberg.de


signature.asc
Description: PGP signature


Re: [OMPI users] orte has lost communication

2016-04-12 Thread Stefan Friedel

On Tue, Apr 12, 2016 at 01:30:37PM +0200, Stefan Friedel wrote:

-thanks for you support!- nope, no core, just the "orte has lost"...

Dear list - the problem is _not_ related to openmpi. I compiled mvapich2 and I
get communication errors,too. Probably this is a hardware problem.
Sorry for the noise - I will report about the real reason for the orte has
lost... message.
MfG/Sincerely
Stefan Friedel
--
IWR * 4.317 * INF205 * 69120 Heidelberg
T +49 6221 5414404 * F +49 6221 5414427
stefan.frie...@iwr.uni-heidelberg.de


signature.asc
Description: PGP signature


Re: [OMPI users] orte has lost communication

2016-04-12 Thread Ralph Castain
My apologies for the tardy response - been stuck in meetings. I'm glad to
hear that you are making progress tracking this down. FWIW: the error
message you received indicates that the socket from that node unexpectedly
reset during execution of the application. So it sounds like there is
something flaky in the Ethernet.

One thing I've found that can cause that problem is two nodes having the
same IP address. This causes periodic random resets of the connections. So
you might want to just do an IP scan to ensure that all the addresses are
unique.

Let us know if we can be of help
Ralph


On Tue, Apr 12, 2016 at 7:22 AM, Stefan Friedel <
stefan.frie...@iwr.uni-heidelberg.de> wrote:

> On Tue, Apr 12, 2016 at 01:30:37PM +0200, Stefan Friedel wrote:
>
>> -thanks for you support!- nope, no core, just the "orte has lost"...
>>
> Dear list - the problem is _not_ related to openmpi. I compiled mvapich2
> and I
> get communication errors,too. Probably this is a hardware problem.
> Sorry for the noise - I will report about the real reason for the orte has
> lost... message.
>
> MfG/Sincerely
> Stefan Friedel
> --
> IWR * 4.317 * INF205 * 69120 Heidelberg
> T +49 6221 5414404 * F +49 6221 5414427
> stefan.frie...@iwr.uni-heidelberg.de
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/04/28927.php
>


[OMPI users] Debugging help

2016-04-12 Thread dpchoudh .
Hello all

I am trying to set a breakpoint during the modex exchange process so I can
see the data being passed for different transport type. I assume that this
is being done in the context of orted since this is part of process launch.

Here is what I did: (All of this pertains to the master branch and NOT the
1.10 release)

1. Built  and installed OpenMPI like this: (on two nodes)
./configure --enable-debug --enable-debug-symbols --disable-dlopen && make
&& sudo make install

2. Compiled a tiny hello-world MPI program, mpitest (on both nodes)

3. Since the modex exchange is a macro now, (it used to be a function call
before), I have to put the breakpoint inside a line of code in the macro; I
chose the function mca_base_component_to_string(). I hoped that choosing
--enable-debug-symbols and --disable-dlopen will make this function
visible, but may be I am wrong. Do I need to explicitly add a DLSPEC to
libtools?

4. I launched gdb like this:
gdb mpirun
set args -np 2 -H bigMPI,smallMPI -mca btl tcp,self ./mpitest
b mca_base_component_to_string
run

This told me that the breakpoint is not present in the executable and gdb
will try to load a shared object if needed; I chose 'yes'.
However, the breakpoint never triggers and the program runs to completion
and exits.

I have two requests:
1. Please help me understand what I am doing wrong.
2. Is there a (perhaps a sequence of) switch to 'configure' that will
create the most debuggable image, while throwing all performance
optimization out of the window? This would be a great thing for a developer.

Thank you
Durga

We learn from history that we never learn from history.


[OMPI users] Possible bug in MPI_Barrier() ?

2016-04-12 Thread dpchoudh .
Hi all

I have reported this issue before, but then had brushed it off as something
that was caused by my modifications to the source tree. It looks like that
is not the case.

Just now, I did the following:

1. Cloned a fresh copy from master.
2. Configured with the following flags, built and installed it in my
two-node "cluster".
--enable-debug --enable-debug-symbols --disable-dlopen
3. Compiled the following program, mpitest.c with these flags: -g3 -Wall
-Wextra
4. Ran it like this:
[durga@smallMPI ~]$ mpirun -np 2 -hostfile ~/hostfile -mca btl self,tcp
-mca pml ob1 ./mpitest

With this, the code hangs at MPI_Barrier() on both nodes, after generating
the following output:

Hello world from processor smallMPI, rank 0 out of 2 processors
Hello world from processor bigMPI, rank 1 out of 2 processors
smallMPI sent haha!
bigMPI received haha!

Attaching to the hung process at one node gives the following backtrace:

(gdb) bt
#0  0x7f55b0f41c3d in poll () from /lib64/libc.so.6
#1  0x7f55b03ccde6 in poll_dispatch (base=0x70e7b0, tv=0x7ffd1bb551c0)
at poll.c:165
#2  0x7f55b03c4a90 in opal_libevent2022_event_base_loop (base=0x70e7b0,
flags=2) at event.c:1630
#3  0x7f55b02f0144 in opal_progress () at runtime/opal_progress.c:171
#4  0x7f55b14b4d8b in opal_condition_wait (c=0x7f55b19fec40
, m=0x7f55b19febc0 ) at
../opal/threads/condition.h:76
#5  0x7f55b14b531b in ompi_request_default_wait_all (count=2,
requests=0x7ffd1bb55370, statuses=0x7ffd1bb55340) at request/req_wait.c:287
#6  0x7f55b157a225 in ompi_coll_base_sendrecv_zero (dest=1, stag=-16,
source=1, rtag=-16, comm=0x601280 )
at base/coll_base_barrier.c:63
#7  0x7f55b157a92a in ompi_coll_base_barrier_intra_two_procs
(comm=0x601280 , module=0x7c2630) at
base/coll_base_barrier.c:308
#8  0x7f55b15aafec in ompi_coll_tuned_barrier_intra_dec_fixed
(comm=0x601280 , module=0x7c2630) at
coll_tuned_decision_fixed.c:196
#9  0x7f55b14d36fd in PMPI_Barrier (comm=0x601280
) at pbarrier.c:63
#10 0x00400b0b in main (argc=1, argv=0x7ffd1bb55658) at mpitest.c:26
(gdb)

Thinking that this might be a bug in tuned collectives, since that is what
the stack shows, I ran the program like this (basically adding the ^tuned
part)

[durga@smallMPI ~]$ mpirun -np 2 -hostfile ~/hostfile -mca btl self,tcp
-mca pml ob1 -mca coll ^tuned ./mpitest

It still hangs, but now with a different stack trace:
(gdb) bt
#0  0x7f910d38ac3d in poll () from /lib64/libc.so.6
#1  0x7f910c815de6 in poll_dispatch (base=0x1a317b0, tv=0x7fff43ee3610)
at poll.c:165
#2  0x7f910c80da90 in opal_libevent2022_event_base_loop
(base=0x1a317b0, flags=2) at event.c:1630
#3  0x7f910c739144 in opal_progress () at runtime/opal_progress.c:171
#4  0x7f910db130f7 in opal_condition_wait (c=0x7f910de47c40
, m=0x7f910de47bc0 )
at ../../../../opal/threads/condition.h:76
#5  0x7f910db132d8 in ompi_request_wait_completion (req=0x1b07680) at
../../../../ompi/request/request.h:383
#6  0x7f910db1533b in mca_pml_ob1_send (buf=0x0, count=0,
datatype=0x7f910de1e340 , dst=1, tag=-16,
sendmode=MCA_PML_BASE_SEND_STANDARD,
comm=0x601280 ) at pml_ob1_isend.c:259
#7  0x7f910d9c3b38 in ompi_coll_base_barrier_intra_basic_linear
(comm=0x601280 , module=0x1b092c0) at
base/coll_base_barrier.c:368
#8  0x7f910d91c6fd in PMPI_Barrier (comm=0x601280
) at pbarrier.c:63
#9  0x00400b0b in main (argc=1, argv=0x7fff43ee3a58) at mpitest.c:26
(gdb)

The mpitest.c program is as follows:
#include 
#include 
#include 

int main(int argc, char** argv)
{
int world_size, world_rank, name_len;
char hostname[MPI_MAX_PROCESSOR_NAME], buf[8];

MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
MPI_Get_processor_name(hostname, &name_len);
printf("Hello world from processor %s, rank %d out of %d processors\n",
hostname, world_rank, world_size);
if (world_rank == 1)
{
MPI_Recv(buf, 6, MPI_CHAR, 0, 99, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("%s received %s\n", hostname, buf);
}
else
{
strcpy(buf, "haha!");
MPI_Send(buf, 6, MPI_CHAR, 1, 99, MPI_COMM_WORLD);
printf("%s sent %s\n", hostname, buf);
}
MPI_Barrier(MPI_COMM_WORLD);
MPI_Finalize();
return 0;
}

The hostfile is as follows:
10.10.10.10 slots=1
10.10.10.11 slots=1

The two nodes are connected by three physical and 3 logical networks:
Physical: Gigabit Ethernet, 10G iWARP, 20G Infiniband
Logical: IP (all 3), PSM (Qlogic Infiniband), Verbs (iWARP and Infiniband)

Please note again that this is a fresh, brand new clone.

Is this a bug (perhaps a side effect of --disable-dlopen) or something I am
doing wrong?

Thanks
Durga

We learn from history that we never learn from history.


Re: [OMPI users] Debugging help

2016-04-12 Thread Jeff Squyres (jsquyres)
On Apr 12, 2016, at 2:38 PM, dpchoudh .  wrote:
> 
> Hello all
> 
> I am trying to set a breakpoint during the modex exchange process so I can 
> see the data being passed for different transport type. I assume that this is 
> being done in the context of orted since this is part of process launch.
> 
> Here is what I did: (All of this pertains to the master branch and NOT the 
> 1.10 release)
> 
> 1. Built  and installed OpenMPI like this: (on two nodes)
> ./configure --enable-debug --enable-debug-symbols --disable-dlopen && make && 
> sudo make install

FWIW: You don't need to --disable-dlopen for this; using dlopen and plugins is 
very, very helpful (and a giant time-saver) when you're building/debugging a 
single BTL plugin, for example (because you can "cd opal/mca/btl/YOUR_BTL; make 
install" instead of a top-level install).

> 2. Compiled a tiny hello-world MPI program, mpitest (on both nodes)
> 
> 3. Since the modex exchange is a macro now, (it used to be a function call 
> before), I have to put the breakpoint inside a line of code in the macro; I 
> chose the function mca_base_component_to_string(). I hoped that choosing 
> --enable-debug-symbols and --disable-dlopen will make this function visible, 
> but may be I am wrong. Do I need to explicitly add a DLSPEC to lib tools?

No, you don't need to add anything to libtool.

There's two parts to the modex:

1. each component modex sending their data
2. each component selectively/lazily reading data from peers

> 4. I launched gdb like this:
> gdb mpirun
> set args -np 2 -H bigMPI,smallMPI -mca btl tcp,self ./mpitest
> b mca_base_component_to_string
> run

That looks reasonable, but you are probably breaking in the wrong function.

Also, if your mpitest program doesn't do any MPI_Send/MPI_Recv functionality, 
the modex receive functionality may not be invoked.  It might be better to use 
examples/ring_c.c as your test program.

If you upgrade your GDB to the latest version, you should be able to break on a 
macro.

> This told me that the breakpoint is not present in the executable and gdb 
> will try to load a shared object if needed; I chose 'yes'.
> However, the breakpoint never triggers and the program runs to completion and 
> exits.
> 
> I have two requests:
> 1. Please help me understand what I am doing wrong.
> 2. Is there a (perhaps a sequence of) switch to 'configure' that will create 
> the most debuggable image, while throwing all performance optimization out of 
> the window? This would be a great thing for a developer.

--enable-debug should do ya.

You might want to enable --enable-mem-debug and --enable-mem-profile, too, but 
those are supplementary.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] Possible bug in MPI_Barrier() ?

2016-04-12 Thread Gilles Gouaillardet

This is quite unlikely, and fwiw, your test program works for me.

i suggest you check your 3 TCP networks are usable, for example

$ mpirun -np 2 -hostfile ~/hostfile -mca btl self,tcp -mca pml ob1 --mca 
btl_tcp_if_include xxx ./mpitest


in which xxx is a [list of] interface name :
eth0
eth1
ib0
eth0,eth1
eth0,ib0
...
eth0,eth1,ib0

and see where problem start occuring.

btw, are your 3 interfaces in 3 different subnet ? is routing required 
between two interfaces of the same type ?


Cheers,

Gilles
On 4/13/2016 7:15 AM, dpchoudh . wrote:

Hi all

I have reported this issue before, but then had brushed it off as 
something that was caused by my modifications to the source tree. It 
looks like that is not the case.


Just now, I did the following:

1. Cloned a fresh copy from master.
2. Configured with the following flags, built and installed it in my 
two-node "cluster".

--enable-debug --enable-debug-symbols --disable-dlopen
3. Compiled the following program, mpitest.c with these flags: -g3 
-Wall -Wextra

4. Ran it like this:
[durga@smallMPI ~]$ mpirun -np 2 -hostfile ~/hostfile -mca btl 
self,tcp -mca pml ob1 ./mpitest


With this, the code hangs at MPI_Barrier() on both nodes, after 
generating the following output:


Hello world from processor smallMPI, rank 0 out of 2 processors
Hello world from processor bigMPI, rank 1 out of 2 processors
smallMPI sent haha!
bigMPI received haha!

Attaching to the hung process at one node gives the following backtrace:

(gdb) bt
#0  0x7f55b0f41c3d in poll () from /lib64/libc.so.6
#1  0x7f55b03ccde6 in poll_dispatch (base=0x70e7b0, 
tv=0x7ffd1bb551c0) at poll.c:165
#2  0x7f55b03c4a90 in opal_libevent2022_event_base_loop 
(base=0x70e7b0, flags=2) at event.c:1630

#3  0x7f55b02f0144 in opal_progress () at runtime/opal_progress.c:171
#4  0x7f55b14b4d8b in opal_condition_wait (c=0x7f55b19fec40 
, m=0x7f55b19febc0 ) at 
../opal/threads/condition.h:76
#5  0x7f55b14b531b in ompi_request_default_wait_all (count=2, 
requests=0x7ffd1bb55370, statuses=0x7ffd1bb55340) at 
request/req_wait.c:287
#6  0x7f55b157a225 in ompi_coll_base_sendrecv_zero (dest=1, 
stag=-16, source=1, rtag=-16, comm=0x601280 )

at base/coll_base_barrier.c:63
#7  0x7f55b157a92a in ompi_coll_base_barrier_intra_two_procs 
(comm=0x601280 , module=0x7c2630) at 
base/coll_base_barrier.c:308
#8  0x7f55b15aafec in ompi_coll_tuned_barrier_intra_dec_fixed 
(comm=0x601280 , module=0x7c2630) at 
coll_tuned_decision_fixed.c:196
#9  0x7f55b14d36fd in PMPI_Barrier (comm=0x601280 
) at pbarrier.c:63
#10 0x00400b0b in main (argc=1, argv=0x7ffd1bb55658) at 
mpitest.c:26

(gdb)

Thinking that this might be a bug in tuned collectives, since that is 
what the stack shows, I ran the program like this (basically adding 
the ^tuned part)


[durga@smallMPI ~]$ mpirun -np 2 -hostfile ~/hostfile -mca btl 
self,tcp -mca pml ob1 -mca coll ^tuned ./mpitest


It still hangs, but now with a different stack trace:
(gdb) bt
#0  0x7f910d38ac3d in poll () from /lib64/libc.so.6
#1  0x7f910c815de6 in poll_dispatch (base=0x1a317b0, 
tv=0x7fff43ee3610) at poll.c:165
#2  0x7f910c80da90 in opal_libevent2022_event_base_loop 
(base=0x1a317b0, flags=2) at event.c:1630

#3  0x7f910c739144 in opal_progress () at runtime/opal_progress.c:171
#4  0x7f910db130f7 in opal_condition_wait (c=0x7f910de47c40 
, m=0x7f910de47bc0 )

at ../../../../opal/threads/condition.h:76
#5  0x7f910db132d8 in ompi_request_wait_completion (req=0x1b07680) 
at ../../../../ompi/request/request.h:383
#6  0x7f910db1533b in mca_pml_ob1_send (buf=0x0, count=0, 
datatype=0x7f910de1e340 , dst=1, tag=-16, 
sendmode=MCA_PML_BASE_SEND_STANDARD,

comm=0x601280 ) at pml_ob1_isend.c:259
#7  0x7f910d9c3b38 in ompi_coll_base_barrier_intra_basic_linear 
(comm=0x601280 , module=0x1b092c0) at 
base/coll_base_barrier.c:368
#8  0x7f910d91c6fd in PMPI_Barrier (comm=0x601280 
) at pbarrier.c:63
#9  0x00400b0b in main (argc=1, argv=0x7fff43ee3a58) at 
mpitest.c:26

(gdb)

The mpitest.c program is as follows:
#include 
#include 
#include 

int main(int argc, char** argv)
{
int world_size, world_rank, name_len;
char hostname[MPI_MAX_PROCESSOR_NAME], buf[8];

MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
MPI_Get_processor_name(hostname, &name_len);
printf("Hello world from processor %s, rank %d out of %d 
processors\n", hostname, world_rank, world_size);

if (world_rank == 1)
{
MPI_Recv(buf, 6, MPI_CHAR, 0, 99, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("%s received %s\n", hostname, buf);
}
else
{
strcpy(buf, "haha!");
MPI_Send(buf, 6, MPI_CHAR, 1, 99, MPI_COMM_WORLD);
printf("%s sent %s\n", hostname, buf);
}
MPI_Barrier(MPI_COMM_WORLD);
MPI_Finalize();
return 0;
}

The hostfile is as follows:
10.10.10.10 slots=1
10.10.10.11 slots=1

The two nodes are