Dong -- I do not see an obvious cause for the error.
Are you able to run trivial hello world / ring kinds of MPI jobs? Is the problem localized to a specific set of nodes in the cluster? > On Apr 14, 2017, at 4:30 PM, Dong Young Yoon <dy...@umich.edu> wrote: > > Hi everyone, > > I am a student working on a project using Infiniband+RDMA. > I use the university's HPC cluster and my application seems to work on some > nodes, but fails with errors on other nodes. > It gives the following error messages when it is assigned to several specific > (at least it seems to me) nodes: >> [cn039][[8119,1],1][btl_openib_proc.c:157:mca_btl_openib_proc_create] >> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[8119,1],3] >> *** An error occurred in MPI_Init_thread >> *** on a NULL communicator >> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >> *** and potentially your MPI job) >> [cn039:90629] Local abort before MPI_INIT completed successfully; not able >> to aggregate error messages, and not able to guarantee that all other >> processes were killed! >> [cn014][[8119,1],2][btl_openib_proc.c:157:mca_btl_openib_proc_create] >> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[8119,1],3] >> *** An error occurred in MPI_Init_thread >> *** on a NULL communicator >> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >> [cn024][[8119,1],0][btl_openib_proc.c:157:mca_btl_openib_proc_create] >> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[8119,1],3] >> *** and potentially your MPI job) >> [cn014:9916] Local abort before MPI_INIT completed successfully; not able to >> aggregate error messages, and not able to guarantee that all other processes >> were killed! >> *** An error occurred in MPI_Init_thread >> *** on a NULL communicator >> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >> *** and potentially your MPI job) >> [cn024:142408] Local abort before MPI_INIT completed successfully; not able >> to aggregate error messages, and not able to guarantee that all other >> processes were killed! >> *** An error occurred in MPI_Init_thread >> *** on a NULL communicator >> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >> *** and potentially your MPI job) >> [cn019:111090] Local abort before MPI_INIT completed successfully; not able >> to aggregate error messages, and not able to guarantee that all other >> processes were killed! >> mpirun: Forwarding signal 18 to job > > The following is the output of ‘ompi_info —all’ from the node I am submitting > job using mpirun: >> [dyoon@ln001 err]$ ompi_info --all >> Package: Open MPI raeker@ln001 Distribution >> Open MPI: 1.10.3 >> Open MPI repo revision: v1.10.2-251-g9acf492 >> Open MPI release date: Jun 14, 2016 >> Open RTE: 1.10.3 >> Open RTE repo revision: v1.10.2-251-g9acf492 >> Open RTE release date: Jun 14, 2016 >> OPAL: 1.10.3 >> OPAL repo revision: v1.10.2-251-g9acf492 >> OPAL release date: Jun 14, 2016 >> MPI API: 3.0.0 >> Ident string: 1.10.3 >> Prefix: >> /gpfs/gpfs0/software/rhel72/packages/openmpi/1.10.3/gcc-5.4.0 >> Exec_prefix: >> /gpfs/gpfs0/software/rhel72/packages/openmpi/1.10.3/gcc-5.4.0 >> Bindir: >> /gpfs/gpfs0/software/rhel72/packages/openmpi/1.10.3/gcc-5.4.0/bin >> Sbindir: >> /gpfs/gpfs0/software/rhel72/packages/openmpi/1.10.3/gcc-5.4.0/sbin >> Libdir: >> /gpfs/gpfs0/software/rhel72/packages/openmpi/1.10.3/gcc-5.4.0/lib >> Incdir: >> /gpfs/gpfs0/software/rhel72/packages/openmpi/1.10.3/gcc-5.4.0/include >> Mandir: >> /gpfs/gpfs0/software/rhel72/packages/openmpi/1.10.3/gcc-5.4.0/man >> Pkglibdir: >> /gpfs/gpfs0/software/rhel72/packages/openmpi/1.10.3/gcc-5.4.0/lib/openmpi >> Libexecdir: >> /gpfs/gpfs0/software/rhel72/packages/openmpi/1.10.3/gcc-5.4.0/libexec >> Datarootdir: >> /gpfs/gpfs0/software/rhel72/packages/openmpi/1.10.3/gcc-5.4.0/share >> Datadir: >> /gpfs/gpfs0/software/rhel72/packages/openmpi/1.10.3/gcc-5.4.0/share >> Sysconfdir: >> /gpfs/gpfs0/software/rhel72/packages/openmpi/1.10.3/gcc-5.4.0/etc >> Sharedstatedir: >> /gpfs/gpfs0/software/rhel72/packages/openmpi/1.10.3/gcc-5.4.0/com >> Localstatedir: >> /gpfs/gpfs0/software/rhel72/packages/openmpi/1.10.3/gcc-5.4.0/var >> Infodir: >> /gpfs/gpfs0/software/rhel72/packages/openmpi/1.10.3/gcc-5.4.0/share/info >> Pkgdatadir: >> /gpfs/gpfs0/software/rhel72/packages/openmpi/1.10.3/gcc-5.4.0/share/openmpi >> Pkglibdir: >> /gpfs/gpfs0/software/rhel72/packages/openmpi/1.10.3/gcc-5.4.0/lib/openmpi >> Pkgincludedir: >> /gpfs/gpfs0/software/rhel72/packages/openmpi/1.10.3/gcc-5.4.0/include/openmpi >> Configured architecture: powerpc64le-unknown-linux-gnu >> Configure host: ln001 >> Configured by: raeker >> Configured on: Fri Jun 17 14:26:10 EDT 2016 >> Configure host: ln001 >> Built by: raeker >> Built on: Fri Jun 17 14:48:13 EDT 2016 >> Built host: ln001 >> C bindings: yes >> C++ bindings: yes >> Fort mpif.h: yes (all) >> Fort use mpi: yes (full: ignore TKR) >> Fort use mpi size: deprecated-ompi-info-value >> Fort use mpi_f08: yes >> Fort mpi_f08 compliance: The mpi_f08 module is available, but due to >> limitations in the gfortran compiler, does not >> support the following: array subsections, direct >> passthru (where possible) to underlying Open MPI's >> C functionality >> Fort mpi_f08 subarrays: no >> Java bindings: no >> Wrapper compiler rpath: runpath >> C compiler: gcc >> C compiler absolute: >> /gpfs/gpfs0/software/rhel72/packages/gcc/5.4.0/bin/gcc >> C compiler family name: GNU >> C compiler version: 5.4.0 >> C char size: 1 >> C bool size: 1 >> C short size: 2 >> C int size: 4 >> C long size: 8 >> C float size: 4 >> C double size: 8 >> C pointer size: 8 >> C char align: 1 >> C bool align: 1 >> C int align: 4 >> C float align: 4 >> C double align: 8 >> C++ compiler: g++ >> C++ compiler absolute: >> /gpfs/gpfs0/software/rhel72/packages/gcc/5.4.0/bin/g++ >> Fort compiler: gfortran >> Fort compiler abs: >> /gpfs/gpfs0/software/rhel72/packages/gcc/5.4.0/bin/gfortran >> Fort ignore TKR: yes (!GCC$ ATTRIBUTES NO_ARG_CHECK ::) >> Fort 08 assumed shape: yes >> Fort optional args: yes >> Fort INTERFACE: yes >> Fort ISO_FORTRAN_ENV: yes >> Fort STORAGE_SIZE: yes >> Fort BIND(C) (all): yes >> Fort ISO_C_BINDING: yes >> Fort SUBROUTINE BIND(C): yes >> Fort TYPE,BIND(C): yes >> Fort T,BIND(C,name="a"): yes >> Fort PRIVATE: yes >> Fort PROTECTED: yes >> Fort ABSTRACT: yes >> Fort ASYNCHRONOUS: yes >> Fort PROCEDURE: yes >> Fort USE...ONLY: yes >> Fort C_FUNLOC: yes >> Fort f08 using wrappers: yes >> Fort MPI_SIZEOF: yes >> Fort integer size: 4 >> Fort logical size: 4 >> Fort logical value true: 1 >> Fort have integer1: yes >> Fort have integer2: yes >> Fort have integer4: yes >> Fort have integer8: yes >> Fort have integer16: no >> Fort have real4: yes >> Fort have real8: yes >> Fort have real16: yes >> Fort have complex8: yes >> Fort have complex16: yes >> Fort have complex32: yes >> Fort integer1 size: 1 >> Fort integer2 size: 2 >> Fort integer4 size: 4 >> Fort integer8 size: 8 >> Fort integer16 size: -1 >> Fort real size: 4 >> Fort real4 size: 4 >> Fort real8 size: 8 >> Fort real16 size: 16 >> Fort dbl prec size: 8 >> Fort cplx size: 8 >> Fort dbl cplx size: 16 >> Fort cplx8 size: 8 >> Fort cplx16 size: 16 >> Fort cplx32 size: 32 >> Fort integer align: 4 >> Fort integer1 align: 1 >> Fort integer2 align: 2 >> Fort integer4 align: 4 >> Fort integer8 align: 8 >> Fort integer16 align: -1 >> Fort real align: 4 >> Fort real4 align: 4 >> Fort real8 align: 8 >> Fort real16 align: 16 >> Fort dbl prec align: 8 >> Fort cplx align: 4 >> Fort dbl cplx align: 8 >> Fort cplx8 align: 4 >> Fort cplx16 align: 8 >> Fort cplx32 align: 16 >> C profiling: yes >> C++ profiling: yes >> Fort mpif.h profiling: yes >> Fort use mpi profiling: yes >> Fort use mpi_f08 prof: yes >> C++ exceptions: no >> Thread support: posix (MPI_THREAD_MULTIPLE: no, OPAL support: yes, >> OMPI progress: no, ORTE progress: yes, Event lib: >> yes) >> Sparse Groups: no >> Build CFLAGS: -O3 -DNDEBUG -finline-functions >> -fno-strict-aliasing -pthread >> Build CXXFLAGS: -O3 -DNDEBUG -finline-functions -pthread >> Build FCFLAGS: >> Build LDFLAGS: >> Build LIBS: -lrt -lm -lutil >> Wrapper extra CFLAGS: -pthread >> Wrapper extra CXXFLAGS: -pthread >> Wrapper extra FCFLAGS: -pthread >> Wrapper extra LDFLAGS: >> -L/gpfs/gpfs0/systems/sw/LSF/9.1/linux3.10-glibc2.17-ppc64le/lib >> -Wl,-rpath >> >> -Wl,/gpfs/gpfs0/systems/sw/LSF/9.1/linux3.10-glibc2.17-ppc64le/lib >> -Wl,-rpath -Wl,@{libdir} -Wl,--enable-new-dtags >> Wrapper extra LIBS: -lm -lnuma -ldl -lrt -lbat -llsf -lnsl -losmcomp >> -libverbs -lrdmacm -lutil >> Internal debug support: no >> MPI interface warnings: yes >> MPI parameter check: runtime >> Memory profiling support: no >> Memory debugging support: no >> dl support: no >> Heterogeneous support: no >> mpirun default --prefix: no >> MPI I/O support: yes >> MPI_WTIME support: gettimeofday >> Symbol vis. support: yes >> Host topology support: yes >> MPI extensions: >> FT Checkpoint support: no (checkpoint thread: no) >> C/R Enabled Debugging: no >> VampirTrace support: yes >> MPI_MAX_PROCESSOR_NAME: 256 >> MPI_MAX_ERROR_STRING: 256 >> MPI_MAX_OBJECT_NAME: 64 >> MPI_MAX_INFO_KEY: 36 >> MPI_MAX_INFO_VAL: 256 >> MPI_MAX_PORT_NAME: 1024 >> MPI_MAX_DATAREP_STRING: 128 >> MCA mca: parameter "mca_param_files" (current value: >> >> "/gpfs/gpfs0/groups/mozafari/dyoon/.openmpi/mca-params.conf:/gpfs/gpfs0/software/rhel72/packages/openmpi/1.10.3/gcc-5.4.0/etc/openmpi-mca-params.conf", >> data source: default, level: 2 user/detail, type: >> string, deprecated, synonym of: >> mca_base_param_files) >> Path for MCA configuration files containing >> variable values >> MCA mca: parameter "mca_component_path" (current value: >> >> "/gpfs/gpfs0/software/rhel72/packages/openmpi/1.10.3/gcc-5.4.0/lib/openmpi:/gpfs/gpfs0/groups/mozafari/dyoon/.openmpi/components", >> data source: default, level: 9 dev/all, type: >> string, deprecated, synonym of: >> mca_base_component_path) >> Path where to look for Open MPI and ORTE components >> MCA mca: parameter "mca_component_show_load_errors" (current >> value: "true", data source: default, level: 9 >> dev/all, type: bool, deprecated, synonym of: >> mca_base_component_show_load_errors) >> Whether to show errors for components that failed >> to load or not [snip] > > I ran my job with the following command: >> mpirun --mca btl self,sm,openib <my application> > > > My PATH and LD_LIBRARY_PATH are set as as follows: >> [dyoon@ln001 err]$ echo $LD_LIBRARY_PATH >> /gpfs/gpfs0/software/rhel72/packages/openmpi/1.10.3/gcc-5.4.0/lib:/gpfs/gpfs0/software/rhel72/packages/gcc/5.4.0/lib64:/lib64:/gpfs/gpfs0/groups/mozafari/dyoon/work/rdma_2pc_dev/lib:/gpfs/gpfs0/systems/sw/LSF/9.1/linux3.10-glibc2.17-ppc64le/lib:/lib64 >> [dyoon@ln001 err]$ echo $PATH >> /gpfs/gpfs0/software/rhel72/packages/openmpi/1.10.3/gcc-5.4.0/bin:/gpfs/gpfs0/software/rhel72/packages/gcc/5.4.0/bin:/gpfs/gpfs0/groups/mozafari/dyoon/.bin:/gpfs/gpfs0/groups/mozafari/dyoon/.local/scons/bin:/usr/mpi/gcc/openmpi-1.8.8/bin:/gpfs/gpfs0/systems/sw/LSF/9.1/linux3.10-glibc2.17-ppc64le/etc:/gpfs/gpfs0/systems/sw/LSF/9.1/linux3.10-glibc2.17-ppc64le/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/ibutils/bin:/gpfs/gpfs0/groups/mozafari/dyoon/.local/bin:/gpfs/gpfs0/groups/mozafari/dyoon/bin > > Could this be a node-specific issue? I hope this is rather a problem with my > application as I do not have much control over the setup of the cluster, and > I could not find any resource to identify the correct cause of this issue. > Any feedback or help will be greatly appreciated. Thank you. -- Jeff Squyres jsquy...@cisco.com _______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users