Dong --

I do not see an obvious cause for the error.

Are you able to run trivial hello world / ring kinds of MPI jobs?
Is the problem localized to a specific set of nodes in the cluster?



> On Apr 14, 2017, at 4:30 PM, Dong Young Yoon <dy...@umich.edu> wrote:
> 
> Hi everyone,
> 
> I am a student working on a project using Infiniband+RDMA. 
> I use the university's HPC cluster and my application seems to work on some 
> nodes, but fails with errors on other nodes. 
> It gives the following error messages when it is assigned to several specific 
> (at least it seems to me) nodes:
>> [cn039][[8119,1],1][btl_openib_proc.c:157:mca_btl_openib_proc_create] 
>> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[8119,1],3]
>> *** An error occurred in MPI_Init_thread
>> *** on a NULL communicator
>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>> ***    and potentially your MPI job)
>> [cn039:90629] Local abort before MPI_INIT completed successfully; not able 
>> to aggregate error messages, and not able to guarantee that all other 
>> processes were killed!
>> [cn014][[8119,1],2][btl_openib_proc.c:157:mca_btl_openib_proc_create] 
>> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[8119,1],3]
>> *** An error occurred in MPI_Init_thread
>> *** on a NULL communicator
>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>> [cn024][[8119,1],0][btl_openib_proc.c:157:mca_btl_openib_proc_create] 
>> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[8119,1],3]
>> ***    and potentially your MPI job)
>> [cn014:9916] Local abort before MPI_INIT completed successfully; not able to 
>> aggregate error messages, and not able to guarantee that all other processes 
>> were killed!
>> *** An error occurred in MPI_Init_thread
>> *** on a NULL communicator
>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>> ***    and potentially your MPI job)
>> [cn024:142408] Local abort before MPI_INIT completed successfully; not able 
>> to aggregate error messages, and not able to guarantee that all other 
>> processes were killed!
>> *** An error occurred in MPI_Init_thread
>> *** on a NULL communicator
>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>> ***    and potentially your MPI job)
>> [cn019:111090] Local abort before MPI_INIT completed successfully; not able 
>> to aggregate error messages, and not able to guarantee that all other 
>> processes were killed!
>> mpirun: Forwarding signal 18 to job
> 
> The following is the output of ‘ompi_info —all’ from the node I am submitting 
> job using mpirun:
>> [dyoon@ln001 err]$ ompi_info --all
>>                Package: Open MPI raeker@ln001 Distribution
>>               Open MPI: 1.10.3
>> Open MPI repo revision: v1.10.2-251-g9acf492
>>  Open MPI release date: Jun 14, 2016
>>               Open RTE: 1.10.3
>> Open RTE repo revision: v1.10.2-251-g9acf492
>>  Open RTE release date: Jun 14, 2016
>>                   OPAL: 1.10.3
>>     OPAL repo revision: v1.10.2-251-g9acf492
>>      OPAL release date: Jun 14, 2016
>>                MPI API: 3.0.0
>>           Ident string: 1.10.3
>>                 Prefix: 
>> /gpfs/gpfs0/software/rhel72/packages/openmpi/1.10.3/gcc-5.4.0
>>            Exec_prefix: 
>> /gpfs/gpfs0/software/rhel72/packages/openmpi/1.10.3/gcc-5.4.0
>>                 Bindir: 
>> /gpfs/gpfs0/software/rhel72/packages/openmpi/1.10.3/gcc-5.4.0/bin
>>                Sbindir: 
>> /gpfs/gpfs0/software/rhel72/packages/openmpi/1.10.3/gcc-5.4.0/sbin
>>                 Libdir: 
>> /gpfs/gpfs0/software/rhel72/packages/openmpi/1.10.3/gcc-5.4.0/lib
>>                 Incdir: 
>> /gpfs/gpfs0/software/rhel72/packages/openmpi/1.10.3/gcc-5.4.0/include
>>                 Mandir: 
>> /gpfs/gpfs0/software/rhel72/packages/openmpi/1.10.3/gcc-5.4.0/man
>>              Pkglibdir: 
>> /gpfs/gpfs0/software/rhel72/packages/openmpi/1.10.3/gcc-5.4.0/lib/openmpi
>>             Libexecdir: 
>> /gpfs/gpfs0/software/rhel72/packages/openmpi/1.10.3/gcc-5.4.0/libexec
>>            Datarootdir: 
>> /gpfs/gpfs0/software/rhel72/packages/openmpi/1.10.3/gcc-5.4.0/share
>>                Datadir: 
>> /gpfs/gpfs0/software/rhel72/packages/openmpi/1.10.3/gcc-5.4.0/share
>>             Sysconfdir: 
>> /gpfs/gpfs0/software/rhel72/packages/openmpi/1.10.3/gcc-5.4.0/etc
>>         Sharedstatedir: 
>> /gpfs/gpfs0/software/rhel72/packages/openmpi/1.10.3/gcc-5.4.0/com
>>          Localstatedir: 
>> /gpfs/gpfs0/software/rhel72/packages/openmpi/1.10.3/gcc-5.4.0/var
>>                Infodir: 
>> /gpfs/gpfs0/software/rhel72/packages/openmpi/1.10.3/gcc-5.4.0/share/info
>>             Pkgdatadir: 
>> /gpfs/gpfs0/software/rhel72/packages/openmpi/1.10.3/gcc-5.4.0/share/openmpi
>>              Pkglibdir: 
>> /gpfs/gpfs0/software/rhel72/packages/openmpi/1.10.3/gcc-5.4.0/lib/openmpi
>>          Pkgincludedir: 
>> /gpfs/gpfs0/software/rhel72/packages/openmpi/1.10.3/gcc-5.4.0/include/openmpi
>> Configured architecture: powerpc64le-unknown-linux-gnu
>>         Configure host: ln001
>>          Configured by: raeker
>>          Configured on: Fri Jun 17 14:26:10 EDT 2016
>>         Configure host: ln001
>>               Built by: raeker
>>               Built on: Fri Jun 17 14:48:13 EDT 2016
>>             Built host: ln001
>>             C bindings: yes
>>           C++ bindings: yes
>>            Fort mpif.h: yes (all)
>>           Fort use mpi: yes (full: ignore TKR)
>>      Fort use mpi size: deprecated-ompi-info-value
>>       Fort use mpi_f08: yes
>> Fort mpi_f08 compliance: The mpi_f08 module is available, but due to
>>                         limitations in the gfortran compiler, does not
>>                         support the following: array subsections, direct
>>                         passthru (where possible) to underlying Open MPI's
>>                         C functionality
>> Fort mpi_f08 subarrays: no
>>          Java bindings: no
>> Wrapper compiler rpath: runpath
>>             C compiler: gcc
>>    C compiler absolute: 
>> /gpfs/gpfs0/software/rhel72/packages/gcc/5.4.0/bin/gcc
>> C compiler family name: GNU
>>     C compiler version: 5.4.0
>>            C char size: 1
>>            C bool size: 1
>>           C short size: 2
>>             C int size: 4
>>            C long size: 8
>>           C float size: 4
>>          C double size: 8
>>         C pointer size: 8
>>           C char align: 1
>>           C bool align: 1
>>            C int align: 4
>>          C float align: 4
>>         C double align: 8
>>           C++ compiler: g++
>>  C++ compiler absolute: 
>> /gpfs/gpfs0/software/rhel72/packages/gcc/5.4.0/bin/g++
>>          Fort compiler: gfortran
>>      Fort compiler abs: 
>> /gpfs/gpfs0/software/rhel72/packages/gcc/5.4.0/bin/gfortran
>>        Fort ignore TKR: yes (!GCC$ ATTRIBUTES NO_ARG_CHECK ::)
>>  Fort 08 assumed shape: yes
>>     Fort optional args: yes
>>         Fort INTERFACE: yes
>>   Fort ISO_FORTRAN_ENV: yes
>>      Fort STORAGE_SIZE: yes
>>     Fort BIND(C) (all): yes
>>     Fort ISO_C_BINDING: yes
>> Fort SUBROUTINE BIND(C): yes
>>      Fort TYPE,BIND(C): yes
>> Fort T,BIND(C,name="a"): yes
>>           Fort PRIVATE: yes
>>         Fort PROTECTED: yes
>>          Fort ABSTRACT: yes
>>      Fort ASYNCHRONOUS: yes
>>         Fort PROCEDURE: yes
>>        Fort USE...ONLY: yes
>>          Fort C_FUNLOC: yes
>> Fort f08 using wrappers: yes
>>        Fort MPI_SIZEOF: yes
>>      Fort integer size: 4
>>      Fort logical size: 4
>> Fort logical value true: 1
>>     Fort have integer1: yes
>>     Fort have integer2: yes
>>     Fort have integer4: yes
>>     Fort have integer8: yes
>>    Fort have integer16: no
>>        Fort have real4: yes
>>        Fort have real8: yes
>>       Fort have real16: yes
>>     Fort have complex8: yes
>>    Fort have complex16: yes
>>    Fort have complex32: yes
>>     Fort integer1 size: 1
>>     Fort integer2 size: 2
>>     Fort integer4 size: 4
>>     Fort integer8 size: 8
>>    Fort integer16 size: -1
>>         Fort real size: 4
>>        Fort real4 size: 4
>>        Fort real8 size: 8
>>       Fort real16 size: 16
>>     Fort dbl prec size: 8
>>         Fort cplx size: 8
>>     Fort dbl cplx size: 16
>>        Fort cplx8 size: 8
>>       Fort cplx16 size: 16
>>       Fort cplx32 size: 32
>>     Fort integer align: 4
>>    Fort integer1 align: 1
>>    Fort integer2 align: 2
>>    Fort integer4 align: 4
>>    Fort integer8 align: 8
>>   Fort integer16 align: -1
>>        Fort real align: 4
>>       Fort real4 align: 4
>>       Fort real8 align: 8
>>      Fort real16 align: 16
>>    Fort dbl prec align: 8
>>        Fort cplx align: 4
>>    Fort dbl cplx align: 8
>>       Fort cplx8 align: 4
>>      Fort cplx16 align: 8
>>      Fort cplx32 align: 16
>>            C profiling: yes
>>          C++ profiling: yes
>>  Fort mpif.h profiling: yes
>> Fort use mpi profiling: yes
>>  Fort use mpi_f08 prof: yes
>>         C++ exceptions: no
>>         Thread support: posix (MPI_THREAD_MULTIPLE: no, OPAL support: yes,
>>                         OMPI progress: no, ORTE progress: yes, Event lib:
>>                         yes)
>>          Sparse Groups: no
>>           Build CFLAGS: -O3 -DNDEBUG -finline-functions
>>                         -fno-strict-aliasing -pthread
>>         Build CXXFLAGS: -O3 -DNDEBUG -finline-functions -pthread
>>          Build FCFLAGS:
>>          Build LDFLAGS:
>>             Build LIBS: -lrt -lm -lutil
>>   Wrapper extra CFLAGS: -pthread
>> Wrapper extra CXXFLAGS: -pthread
>>  Wrapper extra FCFLAGS: -pthread
>>  Wrapper extra LDFLAGS: 
>> -L/gpfs/gpfs0/systems/sw/LSF/9.1/linux3.10-glibc2.17-ppc64le/lib
>>                            -Wl,-rpath
>>                         
>> -Wl,/gpfs/gpfs0/systems/sw/LSF/9.1/linux3.10-glibc2.17-ppc64le/lib
>>                         -Wl,-rpath -Wl,@{libdir} -Wl,--enable-new-dtags
>>     Wrapper extra LIBS: -lm -lnuma -ldl -lrt -lbat -llsf -lnsl -losmcomp
>>                         -libverbs -lrdmacm -lutil
>> Internal debug support: no
>> MPI interface warnings: yes
>>    MPI parameter check: runtime
>> Memory profiling support: no
>> Memory debugging support: no
>>             dl support: no
>>  Heterogeneous support: no
>> mpirun default --prefix: no
>>        MPI I/O support: yes
>>      MPI_WTIME support: gettimeofday
>>    Symbol vis. support: yes
>>  Host topology support: yes
>>         MPI extensions:
>>  FT Checkpoint support: no (checkpoint thread: no)
>>  C/R Enabled Debugging: no
>>    VampirTrace support: yes
>> MPI_MAX_PROCESSOR_NAME: 256
>>   MPI_MAX_ERROR_STRING: 256
>>    MPI_MAX_OBJECT_NAME: 64
>>       MPI_MAX_INFO_KEY: 36
>>       MPI_MAX_INFO_VAL: 256
>>      MPI_MAX_PORT_NAME: 1024
>> MPI_MAX_DATAREP_STRING: 128
>>                MCA mca: parameter "mca_param_files" (current value:
>>                         
>> "/gpfs/gpfs0/groups/mozafari/dyoon/.openmpi/mca-params.conf:/gpfs/gpfs0/software/rhel72/packages/openmpi/1.10.3/gcc-5.4.0/etc/openmpi-mca-params.conf",
>>                         data source: default, level: 2 user/detail, type:
>>                         string, deprecated, synonym of:
>>                         mca_base_param_files)
>>                         Path for MCA configuration files containing
>>                         variable values
>>                MCA mca: parameter "mca_component_path" (current value:
>>                         
>> "/gpfs/gpfs0/software/rhel72/packages/openmpi/1.10.3/gcc-5.4.0/lib/openmpi:/gpfs/gpfs0/groups/mozafari/dyoon/.openmpi/components",
>>                         data source: default, level: 9 dev/all, type:
>>                         string, deprecated, synonym of:
>>                         mca_base_component_path)
>>                         Path where to look for Open MPI and ORTE components
>>                MCA mca: parameter "mca_component_show_load_errors" (current
>>                         value: "true", data source: default, level: 9
>>                         dev/all, type: bool, deprecated, synonym of:
>>                         mca_base_component_show_load_errors)
>>                         Whether to show errors for components that failed
>>                         to load or not
[snip]
> 
> I ran my job with the following command:
>> mpirun --mca btl self,sm,openib <my application>
> 
> 
> My PATH and LD_LIBRARY_PATH are set as as follows:
>> [dyoon@ln001 err]$ echo $LD_LIBRARY_PATH
>> /gpfs/gpfs0/software/rhel72/packages/openmpi/1.10.3/gcc-5.4.0/lib:/gpfs/gpfs0/software/rhel72/packages/gcc/5.4.0/lib64:/lib64:/gpfs/gpfs0/groups/mozafari/dyoon/work/rdma_2pc_dev/lib:/gpfs/gpfs0/systems/sw/LSF/9.1/linux3.10-glibc2.17-ppc64le/lib:/lib64
>> [dyoon@ln001 err]$ echo $PATH
>> /gpfs/gpfs0/software/rhel72/packages/openmpi/1.10.3/gcc-5.4.0/bin:/gpfs/gpfs0/software/rhel72/packages/gcc/5.4.0/bin:/gpfs/gpfs0/groups/mozafari/dyoon/.bin:/gpfs/gpfs0/groups/mozafari/dyoon/.local/scons/bin:/usr/mpi/gcc/openmpi-1.8.8/bin:/gpfs/gpfs0/systems/sw/LSF/9.1/linux3.10-glibc2.17-ppc64le/etc:/gpfs/gpfs0/systems/sw/LSF/9.1/linux3.10-glibc2.17-ppc64le/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/ibutils/bin:/gpfs/gpfs0/groups/mozafari/dyoon/.local/bin:/gpfs/gpfs0/groups/mozafari/dyoon/bin
> 
> Could this be a node-specific issue? I hope this is rather a problem with my 
> application as I do not have much control over the setup of the cluster, and 
> I could not find any resource to identify the correct cause of this issue.
> Any feedback or help will be greatly appreciated. Thank you.


-- 
Jeff Squyres
jsquy...@cisco.com

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to