[OMPI users] orte has lost communication
Good Morning List, we have a problem on our cluster with bigger jobs (~> 200 nodes) - almost every job ends with a message like: ### Starting at Mon Apr 11 15:54:06 CEST 2016 Running on hosts: stek[034-086,088-201,203-247,249-344,346-379,381-388] Running on 350 nodes. Current working directory is /export/homelocal/sfriedel/beff -- ORTE has lost communication with its daemon located on node: hostname: stek346=20 This is usually due to either a failure of the TCP network connection to the node, or possibly an internal failure of the daemon itself. We cannot recover from this failure, and therefore will terminate the job. -- -- An ORTE daemon has unexpectedly failed after launch and before communicating back to mpirun. This could be caused by a number of factors, including an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements). -- Program finished with exit code 0 at: Mon Apr 11 15:54:41 CEST 2016 ## I found a similar question on the list by Emyr James (2015-10-01) but nobody answered until now. Cluster: Dual-Intel Xeon E5-2630 v3 Haswell, Intel/Qlogic Truescale IB QDR, Debian Jessie 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u2, openmpi-1.10.2, slurm-15.08.9; homes mounted via NFS/RDMA/ipoib, mpi messages over psm/IB + 1G Ethernet (Mgmt, pxe boot, ssh, openmpi tcp network etc.) Jobs are started via slurm sbatch/script (mpirun --mca mtl psm ~/path/to/binary) Already tested: *several mca settings (in ...many... combinations) mtl_psm_connect_timeout 600 oob_tcp_keepalive_time 600 oob_tcp_if_include eth0 oob_tcp_listen_mode listen_thread=20 oob_tcp_keepalive_time 600 *several network/sysctl settings (in ...many... combinations) /sbin/sysctl -w net.core.somaxconn=3D2 /sbin/sysctl -w net.core.netdev_max_backlog=3D20 /sbin/sysctl -w net.ipv4.tcp_max_syn_backlog=3D102400 /sbin/sysctl -w net.ipv4.ip_local_port_range=3D"15000 61000" /sbin/sysctl -w net.ipv4.tcp_fin_timeout=3D10 /sbin/sysctl -w net.ipv4.tcp_tw_recycle=3D1 /sbin/sysctl -w net.ipv4.tcp_tw_reuse=3D1 /sbin/sysctl -w net.ipv4.tcp_mem=3D"383865 511820 2303190" echo 2500 > /proc/sys/fs/nr_open *ulimit stuff Routing on the nodes: two private networks 10.203.0.0/22 eth0 and 10.203.40.0/22 ib0, both with their routes, no default route. If I start the job with debugging/logging (--mca oob_tcp_debug 5 --mca oob_base_verbose 8) it takes much longer until the error occurs and the job is starting on the nodes (producing some timesteps of output) but it will fail at some later point. Any hint? PSM? Some kernel number must be increased? Wrong network/routing (should not happen with --mca oob_tcp_if_include eth0)?? MfG/Sincerely Stefan Friedel -- IWR * 4.317 * INF205 * 69120 Heidelberg T +49 6221 5414404 * F +49 6221 5414427 stefan.frie...@iwr.uni-heidelberg.de signature.asc Description: PGP signature
Re: [OMPI users] orte has lost communication
Stefan, which version of OpenMPI are you using ? when does the error occur ? is it before MPI_Init() completes ? is it in the middle of the job ? if yes, are you sure no task invoked MPI_Abort() ? also, you might want to check the system logs and make sure there was no OOM (Out Of Memory). a possible explanation could be some tasks caused OOM, and the OOM killer chose to kill orted instead of a.out if you cannot access your system log, you can try with a large number of nodes, and one mpi task per node, and then increase the number of tasks per node and see if problem starts happening. of course, you can try mpirun --mca oob_tcp_if_include eth0 ... to be on the safe side you can also try to run your application over TCP and see if it helps (note, the issue might be hidden since TCP is much slower than native PSM) mpirun --mca btl tcp,vader,self --mca btl_tcp_if_include eth0 ... or mpirun --mca btl tcp,vader,self --mca btl_tcp_if_include ib0 ... /* feel free to replace vader with sm, if vader is not available on your system */ Cheers, Gilles On 4/12/2016 4:37 PM, Stefan Friedel wrote: Good Morning List, we have a problem on our cluster with bigger jobs (~> 200 nodes) - almost every job ends with a message like: ### Starting at Mon Apr 11 15:54:06 CEST 2016 Running on hosts: stek[034-086,088-201,203-247,249-344,346-379,381-388] Running on 350 nodes. Current working directory is /export/homelocal/sfriedel/beff -- ORTE has lost communication with its daemon located on node: hostname: stek346=20 This is usually due to either a failure of the TCP network connection to the node, or possibly an internal failure of the daemon itself. We cannot recover from this failure, and therefore will terminate the job. -- -- An ORTE daemon has unexpectedly failed after launch and before communicating back to mpirun. This could be caused by a number of factors, including an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements). -- Program finished with exit code 0 at: Mon Apr 11 15:54:41 CEST 2016 ## I found a similar question on the list by Emyr James (2015-10-01) but nobody answered until now. Cluster: Dual-Intel Xeon E5-2630 v3 Haswell, Intel/Qlogic Truescale IB QDR, Debian Jessie 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u2, openmpi-1.10.2, slurm-15.08.9; homes mounted via NFS/RDMA/ipoib, mpi messages over psm/IB + 1G Ethernet (Mgmt, pxe boot, ssh, openmpi tcp network etc.) Jobs are started via slurm sbatch/script (mpirun --mca mtl psm ~/path/to/binary) Already tested: *several mca settings (in ...many... combinations) mtl_psm_connect_timeout 600 oob_tcp_keepalive_time 600 oob_tcp_if_include eth0 oob_tcp_listen_mode listen_thread=20 oob_tcp_keepalive_time 600 *several network/sysctl settings (in ...many... combinations) /sbin/sysctl -w net.core.somaxconn=3D2 /sbin/sysctl -w net.core.netdev_max_backlog=3D20 /sbin/sysctl -w net.ipv4.tcp_max_syn_backlog=3D102400 /sbin/sysctl -w net.ipv4.ip_local_port_range=3D"15000 61000" /sbin/sysctl -w net.ipv4.tcp_fin_timeout=3D10 /sbin/sysctl -w net.ipv4.tcp_tw_recycle=3D1 /sbin/sysctl -w net.ipv4.tcp_tw_reuse=3D1 /sbin/sysctl -w net.ipv4.tcp_mem=3D"383865 511820 2303190" echo 2500 > /proc/sys/fs/nr_open *ulimit stuff Routing on the nodes: two private networks 10.203.0.0/22 eth0 and 10.203.40.0/22 ib0, both with their routes, no default route. If I start the job with debugging/logging (--mca oob_tcp_debug 5 --mca oob_base_verbose 8) it takes much longer until the error occurs and the job is starting on the nodes (producing some timesteps of output) but it will fail at some later point. Any hint? PSM? Some kernel number must be increased? Wrong network/routing (should not happen with --mca oob_tcp_if_include eth0)?? MfG/Sincerely Stefan Friedel -- IWR * 4.317 * INF205 * 69120 Heidelberg T +49 6221 5414404 * F +49 6221 5414427 stefan.frie...@iwr.uni-heidelberg.de ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2016/04/28922.php
Re: [OMPI users] orte has lost communication
On Tue, Apr 12, 2016 at 05:11:59PM +0900, Gilles Gouaillardet wrote: Dear Gilles, which version of OpenMPI are you using ? as I wrote: openmpi-1.10.2, slurm-15.08.9; homes mounted via NFS/RDMA/ipoib, mpi when does the error occur ? is it before MPI_Init() completes ? is it in the middle of the job ? if yes, are you sure no task invoked MPI_Abort During the setup of the job (in most cases) and there is no output from the application. I will build a minimal program to get some printf debugging ...I'll report... also, you might want to check the system logs and make sure there was no OOM (Out Of Memory). No OOM messages from the nodes. No relevant messages at all from the nodes...(remote syslog is running from all nodes to a central system) mpirun --mca oob_tcp_if_include eth0 ... I already tested this. mpirun --mca btl tcp,vader,self --mca btl_tcp_if_include eth0 ... or mpirun --mca btl tcp,vader,self --mca btl_tcp_if_include ib0 ... Just tested this on 350 nodes - two out of seven jobs spawned one after each other were successfull but subsequent jobs were failing again: *tcp,vader,self eth0 failed *tcp,sm,self eth0 failed *tcp,vader,self ib0 failed *tcp,sm,self ib0 success! *tcp,sm,self ib0 failed :-/ *tcp,sm,self ib0 success again! *tcp,sm,self ib0 failed... hhmmm. tcp+sm is a little bit more reliable?? For the sake of completeness - I forgot the ompi_info output: Package: Open MPI root@dyaus Distribution Open MPI: 1.10.2 Open MPI repo revision: v1.10.1-145-g799148f Open MPI release date: Jan 21, 2016 Open RTE: 1.10.2 Open RTE repo revision: v1.10.1-145-g799148f Open RTE release date: Jan 21, 2016 OPAL: 1.10.2 OPAL repo revision: v1.10.1-145-g799148f OPAL release date: Jan 21, 2016 MPI API: 3.0.0 Ident string: 1.10.2 Prefix: /opt/openmpi/1.10.2/gcc/4.9.2 Configured architecture: x86_64-pc-linux-gnu Configure host: dyaus Configured by: root Configured on: Mon Apr 11 09:54:21 CEST 2016 Configure host: dyaus Built by: root Built on: Mon Apr 11 10:12:25 CEST 2016 Built host: dyaus C bindings: yes C++ bindings: yes Fort mpif.h: yes (all) Fort use mpi: yes (full: ignore TKR) Fort use mpi size: deprecated-ompi-info-value Fort use mpi_f08: yes Fort mpi_f08 compliance: The mpi_f08 module is available, but due to limitations in the gfortran compiler, does not support the following: array subsections, direct passthru (where possible) to underlying Open MPI's C functionality Fort mpi_f08 subarrays: no Java bindings: no Wrapper compiler rpath: runpath C compiler: gcc C compiler absolute: /usr/bin/gcc C compiler family name: GNU C compiler version: 4.9.2 C++ compiler: g++ C++ compiler absolute: /usr/bin/g++ Fort compiler: gfortran Fort compiler abs: /usr/bin/gfortran Fort ignore TKR: yes (!GCC$ ATTRIBUTES NO_ARG_CHECK ::) Fort 08 assumed shape: yes Fort optional args: yes Fort INTERFACE: yes Fort ISO_FORTRAN_ENV: yes Fort STORAGE_SIZE: yes Fort BIND(C) (all): yes Fort ISO_C_BINDING: yes Fort SUBROUTINE BIND(C): yes Fort TYPE,BIND(C): yes Fort T,BIND(C,name="a"): yes Fort PRIVATE: yes Fort PROTECTED: yes Fort ABSTRACT: yes Fort ASYNCHRONOUS: yes Fort PROCEDURE: yes Fort USE...ONLY: yes Fort C_FUNLOC: yes Fort f08 using wrappers: yes Fort MPI_SIZEOF: yes C profiling: yes C++ profiling: yes Fort mpif.h profiling: yes Fort use mpi profiling: yes Fort use mpi_f08 prof: yes C++ exceptions: no Thread support: posix (MPI_THREAD_MULTIPLE: no, OPAL support: yes, OMPI progress: no, ORTE progress: yes, Event lib: yes) Sparse Groups: no Internal debug support: no MPI interface warnings: yes MPI parameter check: runtime Memory profiling support: no Memory debugging support: no dl support: yes Heterogeneous support: no mpirun default --prefix: no MPI I/O support: yes MPI_WTIME support: gettimeofday Symbol vis. support: yes Host topology support: yes MPI extensions: FT Checkpoint support: no (checkpoint thread: no) C/R Enabled Debugging: no VampirTrace support: yes MPI_MAX_PROCESSOR_NAME: 256 MPI_MAX_ERROR_STRING: 256 MPI_MAX_OBJECT_NAME: 64 MPI_MAX_INFO_KEY: 36 MPI_MAX_INFO_VAL: 256 MPI_MAX_PORT_NAME: 1024 MPI_MAX_DATAREP_STRING: 128 MCA backtrace: execinfo (MCA v2.0.0, API v2.0.0, Component v1.10.2) MCA compress: gzip (MCA v2.0.0, API v2.0.0, Component v1.10.2) MCA compress: bzip (MCA v2.0.0, API v2.0.0, Component v1.10.2) MCA crs: none (MCA v2.0.0, API v2.0.0, Component v1.1
Re: [OMPI users] orte has lost communication
Stefan, what if you ulimit -c unlimited do orted generate some core dump ? Cheers Gilles On Tuesday, April 12, 2016, Stefan Friedel < stefan.frie...@iwr.uni-heidelberg.de> wrote: > On Tue, Apr 12, 2016 at 05:11:59PM +0900, Gilles Gouaillardet wrote: > Dear Gilles, > >> which version of OpenMPI are you using ? >> > as I wrote: > >>openmpi-1.10.2, slurm-15.08.9; homes mounted via NFS/RDMA/ipoib, mpi >> > > when does the error occur ? >> is it before MPI_Init() completes ? >> is it in the middle of the job ? if yes, are you sure no task invoked >> MPI_Abort >> > During the setup of the job (in most cases) and there is no output from the > application. I will build a minimal program to get some printf debugging > ...I'll > report... > > also, you might want to check the system logs and make sure there was no >> OOM >> (Out Of Memory). >> > No OOM messages from the nodes. No relevant messages at all from the > nodes...(remote syslog is running from all nodes to a central system) > > mpirun --mca oob_tcp_if_include eth0 ... >> > I already tested this. > > mpirun --mca btl tcp,vader,self --mca btl_tcp_if_include eth0 ... >> or >> mpirun --mca btl tcp,vader,self --mca btl_tcp_if_include ib0 ... >> > Just tested this on 350 nodes - two out of seven jobs spawned one after > each > other were successfull but subsequent jobs were failing again: > > *tcp,vader,self eth0 failed > *tcp,sm,self eth0 failed > *tcp,vader,self ib0 failed > *tcp,sm,self ib0 success! > *tcp,sm,self ib0 failed :-/ > *tcp,sm,self ib0 success again! > *tcp,sm,self ib0 failed... > > hhmmm. tcp+sm is a little bit more reliable?? > > For the sake of completeness - I forgot the ompi_info output: > > Package: Open MPI root@dyaus Distribution >Open MPI: 1.10.2 > Open MPI repo revision: v1.10.1-145-g799148f > Open MPI release date: Jan 21, 2016 >Open RTE: 1.10.2 > Open RTE repo revision: v1.10.1-145-g799148f > Open RTE release date: Jan 21, 2016 >OPAL: 1.10.2 > OPAL repo revision: v1.10.1-145-g799148f > OPAL release date: Jan 21, 2016 > MPI API: 3.0.0 >Ident string: 1.10.2 > Prefix: /opt/openmpi/1.10.2/gcc/4.9.2 > Configured architecture: x86_64-pc-linux-gnu > Configure host: dyaus > Configured by: root > Configured on: Mon Apr 11 09:54:21 CEST 2016 > Configure host: dyaus >Built by: root >Built on: Mon Apr 11 10:12:25 CEST 2016 > Built host: dyaus > C bindings: yes >C++ bindings: yes > Fort mpif.h: yes (all) >Fort use mpi: yes (full: ignore TKR) > Fort use mpi size: deprecated-ompi-info-value >Fort use mpi_f08: yes > Fort mpi_f08 compliance: The mpi_f08 module is available, but due to > limitations in the gfortran compiler, does not support the following: array > subsections, direct passthru (where possible) to underlying Open MPI's C > functionality > Fort mpi_f08 subarrays: no > Java bindings: no > Wrapper compiler rpath: runpath > C compiler: gcc > C compiler absolute: /usr/bin/gcc > C compiler family name: GNU > C compiler version: 4.9.2 >C++ compiler: g++ > C++ compiler absolute: /usr/bin/g++ > Fort compiler: gfortran > Fort compiler abs: /usr/bin/gfortran > Fort ignore TKR: yes (!GCC$ ATTRIBUTES NO_ARG_CHECK ::) > Fort 08 assumed shape: yes > Fort optional args: yes > Fort INTERFACE: yes >Fort ISO_FORTRAN_ENV: yes > Fort STORAGE_SIZE: yes > Fort BIND(C) (all): yes > Fort ISO_C_BINDING: yes > Fort SUBROUTINE BIND(C): yes > Fort TYPE,BIND(C): yes > Fort T,BIND(C,name="a"): yes >Fort PRIVATE: yes > Fort PROTECTED: yes > Fort ABSTRACT: yes > Fort ASYNCHRONOUS: yes > Fort PROCEDURE: yes > Fort USE...ONLY: yes > Fort C_FUNLOC: yes > Fort f08 using wrappers: yes > Fort MPI_SIZEOF: yes > C profiling: yes > C++ profiling: yes > Fort mpif.h profiling: yes > Fort use mpi profiling: yes > Fort use mpi_f08 prof: yes > C++ exceptions: no > Thread support: posix (MPI_THREAD_MULTIPLE: no, OPAL support: > yes, OMPI progress: no, ORTE progress: yes, Event lib: yes) > Sparse Groups: no > Internal debug support: no > MPI interface warnings: yes > MPI parameter check: runtime > Memory profiling support: no > Memory debugging support: no > dl support: yes > Heterogeneous support: no > mpirun default --prefix: no > MPI I/O support: yes > MPI_WTIME support: gettimeofday > Symbol vis. support: yes > Host topology support: yes > MPI extensions: FT Checkpoint support: no (checkpoint thread: > no) > C/R Enabled Debugging: no > VampirTrace support: yes > MPI_MAX_PROCESSOR_NAME: 256 >MPI_MAX_ERRO
Re: [OMPI users] orte has lost communication
On Tue, Apr 12, 2016 at 07:51:48PM +0900, Gilles Gouaillardet wrote: what if you ulimit -c unlimited do orted generate some core dump ? Hi Gilles, -thanks for you support!- nope, no core, just the "orte has lost"... I now tested with a simple hello-world mpi program- printf("rank, processor") in the middle and a printf("before mpi_init")/printf("after mpi_init"). Starting in the batch script with mpirun -verbose --mca mtl psm --mca btl vader,self --mca orte_base_help_aggregate 0 ~/mpihw/mpi_hello_world Results: Starting at Tue Apr 12 13:06:38 CEST 2016 Running on hosts: stek[090-189] Running on 100 nodes. Current working directory is /export/homelocal/sfriedel/mpihw Hello world before mpi_init [...] Hello world from processor stek150, rank 971 out of 1600 processors Program finished with exit code 0 at: Tue Apr 12 13:06:42 CEST 2016 Even with just 100 nodes: some jobs are failing (50/50), failing jobs: _no output_, _no core dumped_...only orte has lost... Running on >=350 nodes: almost all jobs are failing, but some jobs succeeded (similar output: only "orte has lost..." for failing jobs and the expected output for the other jobs). Weird. MfG/Sincerely Stefan Friedel -- IWR * 4.317 * INF205 * 69120 Heidelberg T +49 6221 5414404 * F +49 6221 5414427 stefan.frie...@iwr.uni-heidelberg.de signature.asc Description: PGP signature
Re: [OMPI users] orte has lost communication
On Tue, Apr 12, 2016 at 01:30:37PM +0200, Stefan Friedel wrote: -thanks for you support!- nope, no core, just the "orte has lost"... Dear list - the problem is _not_ related to openmpi. I compiled mvapich2 and I get communication errors,too. Probably this is a hardware problem. Sorry for the noise - I will report about the real reason for the orte has lost... message. MfG/Sincerely Stefan Friedel -- IWR * 4.317 * INF205 * 69120 Heidelberg T +49 6221 5414404 * F +49 6221 5414427 stefan.frie...@iwr.uni-heidelberg.de signature.asc Description: PGP signature
Re: [OMPI users] orte has lost communication
My apologies for the tardy response - been stuck in meetings. I'm glad to hear that you are making progress tracking this down. FWIW: the error message you received indicates that the socket from that node unexpectedly reset during execution of the application. So it sounds like there is something flaky in the Ethernet. One thing I've found that can cause that problem is two nodes having the same IP address. This causes periodic random resets of the connections. So you might want to just do an IP scan to ensure that all the addresses are unique. Let us know if we can be of help Ralph On Tue, Apr 12, 2016 at 7:22 AM, Stefan Friedel < stefan.frie...@iwr.uni-heidelberg.de> wrote: > On Tue, Apr 12, 2016 at 01:30:37PM +0200, Stefan Friedel wrote: > >> -thanks for you support!- nope, no core, just the "orte has lost"... >> > Dear list - the problem is _not_ related to openmpi. I compiled mvapich2 > and I > get communication errors,too. Probably this is a hardware problem. > Sorry for the noise - I will report about the real reason for the orte has > lost... message. > > MfG/Sincerely > Stefan Friedel > -- > IWR * 4.317 * INF205 * 69120 Heidelberg > T +49 6221 5414404 * F +49 6221 5414427 > stefan.frie...@iwr.uni-heidelberg.de > > ___ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/04/28927.php >
[OMPI users] Debugging help
Hello all I am trying to set a breakpoint during the modex exchange process so I can see the data being passed for different transport type. I assume that this is being done in the context of orted since this is part of process launch. Here is what I did: (All of this pertains to the master branch and NOT the 1.10 release) 1. Built and installed OpenMPI like this: (on two nodes) ./configure --enable-debug --enable-debug-symbols --disable-dlopen && make && sudo make install 2. Compiled a tiny hello-world MPI program, mpitest (on both nodes) 3. Since the modex exchange is a macro now, (it used to be a function call before), I have to put the breakpoint inside a line of code in the macro; I chose the function mca_base_component_to_string(). I hoped that choosing --enable-debug-symbols and --disable-dlopen will make this function visible, but may be I am wrong. Do I need to explicitly add a DLSPEC to libtools? 4. I launched gdb like this: gdb mpirun set args -np 2 -H bigMPI,smallMPI -mca btl tcp,self ./mpitest b mca_base_component_to_string run This told me that the breakpoint is not present in the executable and gdb will try to load a shared object if needed; I chose 'yes'. However, the breakpoint never triggers and the program runs to completion and exits. I have two requests: 1. Please help me understand what I am doing wrong. 2. Is there a (perhaps a sequence of) switch to 'configure' that will create the most debuggable image, while throwing all performance optimization out of the window? This would be a great thing for a developer. Thank you Durga We learn from history that we never learn from history.
[OMPI users] Possible bug in MPI_Barrier() ?
Hi all I have reported this issue before, but then had brushed it off as something that was caused by my modifications to the source tree. It looks like that is not the case. Just now, I did the following: 1. Cloned a fresh copy from master. 2. Configured with the following flags, built and installed it in my two-node "cluster". --enable-debug --enable-debug-symbols --disable-dlopen 3. Compiled the following program, mpitest.c with these flags: -g3 -Wall -Wextra 4. Ran it like this: [durga@smallMPI ~]$ mpirun -np 2 -hostfile ~/hostfile -mca btl self,tcp -mca pml ob1 ./mpitest With this, the code hangs at MPI_Barrier() on both nodes, after generating the following output: Hello world from processor smallMPI, rank 0 out of 2 processors Hello world from processor bigMPI, rank 1 out of 2 processors smallMPI sent haha! bigMPI received haha! Attaching to the hung process at one node gives the following backtrace: (gdb) bt #0 0x7f55b0f41c3d in poll () from /lib64/libc.so.6 #1 0x7f55b03ccde6 in poll_dispatch (base=0x70e7b0, tv=0x7ffd1bb551c0) at poll.c:165 #2 0x7f55b03c4a90 in opal_libevent2022_event_base_loop (base=0x70e7b0, flags=2) at event.c:1630 #3 0x7f55b02f0144 in opal_progress () at runtime/opal_progress.c:171 #4 0x7f55b14b4d8b in opal_condition_wait (c=0x7f55b19fec40 , m=0x7f55b19febc0 ) at ../opal/threads/condition.h:76 #5 0x7f55b14b531b in ompi_request_default_wait_all (count=2, requests=0x7ffd1bb55370, statuses=0x7ffd1bb55340) at request/req_wait.c:287 #6 0x7f55b157a225 in ompi_coll_base_sendrecv_zero (dest=1, stag=-16, source=1, rtag=-16, comm=0x601280 ) at base/coll_base_barrier.c:63 #7 0x7f55b157a92a in ompi_coll_base_barrier_intra_two_procs (comm=0x601280 , module=0x7c2630) at base/coll_base_barrier.c:308 #8 0x7f55b15aafec in ompi_coll_tuned_barrier_intra_dec_fixed (comm=0x601280 , module=0x7c2630) at coll_tuned_decision_fixed.c:196 #9 0x7f55b14d36fd in PMPI_Barrier (comm=0x601280 ) at pbarrier.c:63 #10 0x00400b0b in main (argc=1, argv=0x7ffd1bb55658) at mpitest.c:26 (gdb) Thinking that this might be a bug in tuned collectives, since that is what the stack shows, I ran the program like this (basically adding the ^tuned part) [durga@smallMPI ~]$ mpirun -np 2 -hostfile ~/hostfile -mca btl self,tcp -mca pml ob1 -mca coll ^tuned ./mpitest It still hangs, but now with a different stack trace: (gdb) bt #0 0x7f910d38ac3d in poll () from /lib64/libc.so.6 #1 0x7f910c815de6 in poll_dispatch (base=0x1a317b0, tv=0x7fff43ee3610) at poll.c:165 #2 0x7f910c80da90 in opal_libevent2022_event_base_loop (base=0x1a317b0, flags=2) at event.c:1630 #3 0x7f910c739144 in opal_progress () at runtime/opal_progress.c:171 #4 0x7f910db130f7 in opal_condition_wait (c=0x7f910de47c40 , m=0x7f910de47bc0 ) at ../../../../opal/threads/condition.h:76 #5 0x7f910db132d8 in ompi_request_wait_completion (req=0x1b07680) at ../../../../ompi/request/request.h:383 #6 0x7f910db1533b in mca_pml_ob1_send (buf=0x0, count=0, datatype=0x7f910de1e340 , dst=1, tag=-16, sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x601280 ) at pml_ob1_isend.c:259 #7 0x7f910d9c3b38 in ompi_coll_base_barrier_intra_basic_linear (comm=0x601280 , module=0x1b092c0) at base/coll_base_barrier.c:368 #8 0x7f910d91c6fd in PMPI_Barrier (comm=0x601280 ) at pbarrier.c:63 #9 0x00400b0b in main (argc=1, argv=0x7fff43ee3a58) at mpitest.c:26 (gdb) The mpitest.c program is as follows: #include #include #include int main(int argc, char** argv) { int world_size, world_rank, name_len; char hostname[MPI_MAX_PROCESSOR_NAME], buf[8]; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &world_size); MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); MPI_Get_processor_name(hostname, &name_len); printf("Hello world from processor %s, rank %d out of %d processors\n", hostname, world_rank, world_size); if (world_rank == 1) { MPI_Recv(buf, 6, MPI_CHAR, 0, 99, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("%s received %s\n", hostname, buf); } else { strcpy(buf, "haha!"); MPI_Send(buf, 6, MPI_CHAR, 1, 99, MPI_COMM_WORLD); printf("%s sent %s\n", hostname, buf); } MPI_Barrier(MPI_COMM_WORLD); MPI_Finalize(); return 0; } The hostfile is as follows: 10.10.10.10 slots=1 10.10.10.11 slots=1 The two nodes are connected by three physical and 3 logical networks: Physical: Gigabit Ethernet, 10G iWARP, 20G Infiniband Logical: IP (all 3), PSM (Qlogic Infiniband), Verbs (iWARP and Infiniband) Please note again that this is a fresh, brand new clone. Is this a bug (perhaps a side effect of --disable-dlopen) or something I am doing wrong? Thanks Durga We learn from history that we never learn from history.
Re: [OMPI users] Debugging help
On Apr 12, 2016, at 2:38 PM, dpchoudh . wrote: > > Hello all > > I am trying to set a breakpoint during the modex exchange process so I can > see the data being passed for different transport type. I assume that this is > being done in the context of orted since this is part of process launch. > > Here is what I did: (All of this pertains to the master branch and NOT the > 1.10 release) > > 1. Built and installed OpenMPI like this: (on two nodes) > ./configure --enable-debug --enable-debug-symbols --disable-dlopen && make && > sudo make install FWIW: You don't need to --disable-dlopen for this; using dlopen and plugins is very, very helpful (and a giant time-saver) when you're building/debugging a single BTL plugin, for example (because you can "cd opal/mca/btl/YOUR_BTL; make install" instead of a top-level install). > 2. Compiled a tiny hello-world MPI program, mpitest (on both nodes) > > 3. Since the modex exchange is a macro now, (it used to be a function call > before), I have to put the breakpoint inside a line of code in the macro; I > chose the function mca_base_component_to_string(). I hoped that choosing > --enable-debug-symbols and --disable-dlopen will make this function visible, > but may be I am wrong. Do I need to explicitly add a DLSPEC to lib tools? No, you don't need to add anything to libtool. There's two parts to the modex: 1. each component modex sending their data 2. each component selectively/lazily reading data from peers > 4. I launched gdb like this: > gdb mpirun > set args -np 2 -H bigMPI,smallMPI -mca btl tcp,self ./mpitest > b mca_base_component_to_string > run That looks reasonable, but you are probably breaking in the wrong function. Also, if your mpitest program doesn't do any MPI_Send/MPI_Recv functionality, the modex receive functionality may not be invoked. It might be better to use examples/ring_c.c as your test program. If you upgrade your GDB to the latest version, you should be able to break on a macro. > This told me that the breakpoint is not present in the executable and gdb > will try to load a shared object if needed; I chose 'yes'. > However, the breakpoint never triggers and the program runs to completion and > exits. > > I have two requests: > 1. Please help me understand what I am doing wrong. > 2. Is there a (perhaps a sequence of) switch to 'configure' that will create > the most debuggable image, while throwing all performance optimization out of > the window? This would be a great thing for a developer. --enable-debug should do ya. You might want to enable --enable-mem-debug and --enable-mem-profile, too, but those are supplementary. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] Possible bug in MPI_Barrier() ?
This is quite unlikely, and fwiw, your test program works for me. i suggest you check your 3 TCP networks are usable, for example $ mpirun -np 2 -hostfile ~/hostfile -mca btl self,tcp -mca pml ob1 --mca btl_tcp_if_include xxx ./mpitest in which xxx is a [list of] interface name : eth0 eth1 ib0 eth0,eth1 eth0,ib0 ... eth0,eth1,ib0 and see where problem start occuring. btw, are your 3 interfaces in 3 different subnet ? is routing required between two interfaces of the same type ? Cheers, Gilles On 4/13/2016 7:15 AM, dpchoudh . wrote: Hi all I have reported this issue before, but then had brushed it off as something that was caused by my modifications to the source tree. It looks like that is not the case. Just now, I did the following: 1. Cloned a fresh copy from master. 2. Configured with the following flags, built and installed it in my two-node "cluster". --enable-debug --enable-debug-symbols --disable-dlopen 3. Compiled the following program, mpitest.c with these flags: -g3 -Wall -Wextra 4. Ran it like this: [durga@smallMPI ~]$ mpirun -np 2 -hostfile ~/hostfile -mca btl self,tcp -mca pml ob1 ./mpitest With this, the code hangs at MPI_Barrier() on both nodes, after generating the following output: Hello world from processor smallMPI, rank 0 out of 2 processors Hello world from processor bigMPI, rank 1 out of 2 processors smallMPI sent haha! bigMPI received haha! Attaching to the hung process at one node gives the following backtrace: (gdb) bt #0 0x7f55b0f41c3d in poll () from /lib64/libc.so.6 #1 0x7f55b03ccde6 in poll_dispatch (base=0x70e7b0, tv=0x7ffd1bb551c0) at poll.c:165 #2 0x7f55b03c4a90 in opal_libevent2022_event_base_loop (base=0x70e7b0, flags=2) at event.c:1630 #3 0x7f55b02f0144 in opal_progress () at runtime/opal_progress.c:171 #4 0x7f55b14b4d8b in opal_condition_wait (c=0x7f55b19fec40 , m=0x7f55b19febc0 ) at ../opal/threads/condition.h:76 #5 0x7f55b14b531b in ompi_request_default_wait_all (count=2, requests=0x7ffd1bb55370, statuses=0x7ffd1bb55340) at request/req_wait.c:287 #6 0x7f55b157a225 in ompi_coll_base_sendrecv_zero (dest=1, stag=-16, source=1, rtag=-16, comm=0x601280 ) at base/coll_base_barrier.c:63 #7 0x7f55b157a92a in ompi_coll_base_barrier_intra_two_procs (comm=0x601280 , module=0x7c2630) at base/coll_base_barrier.c:308 #8 0x7f55b15aafec in ompi_coll_tuned_barrier_intra_dec_fixed (comm=0x601280 , module=0x7c2630) at coll_tuned_decision_fixed.c:196 #9 0x7f55b14d36fd in PMPI_Barrier (comm=0x601280 ) at pbarrier.c:63 #10 0x00400b0b in main (argc=1, argv=0x7ffd1bb55658) at mpitest.c:26 (gdb) Thinking that this might be a bug in tuned collectives, since that is what the stack shows, I ran the program like this (basically adding the ^tuned part) [durga@smallMPI ~]$ mpirun -np 2 -hostfile ~/hostfile -mca btl self,tcp -mca pml ob1 -mca coll ^tuned ./mpitest It still hangs, but now with a different stack trace: (gdb) bt #0 0x7f910d38ac3d in poll () from /lib64/libc.so.6 #1 0x7f910c815de6 in poll_dispatch (base=0x1a317b0, tv=0x7fff43ee3610) at poll.c:165 #2 0x7f910c80da90 in opal_libevent2022_event_base_loop (base=0x1a317b0, flags=2) at event.c:1630 #3 0x7f910c739144 in opal_progress () at runtime/opal_progress.c:171 #4 0x7f910db130f7 in opal_condition_wait (c=0x7f910de47c40 , m=0x7f910de47bc0 ) at ../../../../opal/threads/condition.h:76 #5 0x7f910db132d8 in ompi_request_wait_completion (req=0x1b07680) at ../../../../ompi/request/request.h:383 #6 0x7f910db1533b in mca_pml_ob1_send (buf=0x0, count=0, datatype=0x7f910de1e340 , dst=1, tag=-16, sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x601280 ) at pml_ob1_isend.c:259 #7 0x7f910d9c3b38 in ompi_coll_base_barrier_intra_basic_linear (comm=0x601280 , module=0x1b092c0) at base/coll_base_barrier.c:368 #8 0x7f910d91c6fd in PMPI_Barrier (comm=0x601280 ) at pbarrier.c:63 #9 0x00400b0b in main (argc=1, argv=0x7fff43ee3a58) at mpitest.c:26 (gdb) The mpitest.c program is as follows: #include #include #include int main(int argc, char** argv) { int world_size, world_rank, name_len; char hostname[MPI_MAX_PROCESSOR_NAME], buf[8]; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &world_size); MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); MPI_Get_processor_name(hostname, &name_len); printf("Hello world from processor %s, rank %d out of %d processors\n", hostname, world_rank, world_size); if (world_rank == 1) { MPI_Recv(buf, 6, MPI_CHAR, 0, 99, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("%s received %s\n", hostname, buf); } else { strcpy(buf, "haha!"); MPI_Send(buf, 6, MPI_CHAR, 1, 99, MPI_COMM_WORLD); printf("%s sent %s\n", hostname, buf); } MPI_Barrier(MPI_COMM_WORLD); MPI_Finalize(); return 0; } The hostfile is as follows: 10.10.10.10 slots=1 10.10.10.11 slots=1 The two nodes are