Stefan, what if you ulimit -c unlimited
do orted generate some core dump ? Cheers Gilles On Tuesday, April 12, 2016, Stefan Friedel < stefan.frie...@iwr.uni-heidelberg.de> wrote: > On Tue, Apr 12, 2016 at 05:11:59PM +0900, Gilles Gouaillardet wrote: > Dear Gilles, > >> which version of OpenMPI are you using ? >> > as I wrote: > >> openmpi-1.10.2, slurm-15.08.9; homes mounted via NFS/RDMA/ipoib, mpi >> > > when does the error occur ? >> is it before MPI_Init() completes ? >> is it in the middle of the job ? if yes, are you sure no task invoked >> MPI_Abort >> > During the setup of the job (in most cases) and there is no output from the > application. I will build a minimal program to get some printf debugging > ...I'll > report... > > also, you might want to check the system logs and make sure there was no >> OOM >> (Out Of Memory). >> > No OOM messages from the nodes. No relevant messages at all from the > nodes...(remote syslog is running from all nodes to a central system) > > mpirun --mca oob_tcp_if_include eth0 ... >> > I already tested this. > > mpirun --mca btl tcp,vader,self --mca btl_tcp_if_include eth0 ... >> or >> mpirun --mca btl tcp,vader,self --mca btl_tcp_if_include ib0 ... >> > Just tested this on 350 nodes - two out of seven jobs spawned one after > each > other were successfull but subsequent jobs were failing again: > > *tcp,vader,self eth0 failed > *tcp,sm,self eth0 failed > *tcp,vader,self ib0 failed > *tcp,sm,self ib0 success! > *tcp,sm,self ib0 failed :-/ > *tcp,sm,self ib0 success again! > *tcp,sm,self ib0 failed... > > hhmmm. tcp+sm is a little bit more reliable?? > > For the sake of completeness - I forgot the ompi_info output: > > Package: Open MPI root@dyaus Distribution > Open MPI: 1.10.2 > Open MPI repo revision: v1.10.1-145-g799148f > Open MPI release date: Jan 21, 2016 > Open RTE: 1.10.2 > Open RTE repo revision: v1.10.1-145-g799148f > Open RTE release date: Jan 21, 2016 > OPAL: 1.10.2 > OPAL repo revision: v1.10.1-145-g799148f > OPAL release date: Jan 21, 2016 > MPI API: 3.0.0 > Ident string: 1.10.2 > Prefix: /opt/openmpi/1.10.2/gcc/4.9.2 > Configured architecture: x86_64-pc-linux-gnu > Configure host: dyaus > Configured by: root > Configured on: Mon Apr 11 09:54:21 CEST 2016 > Configure host: dyaus > Built by: root > Built on: Mon Apr 11 10:12:25 CEST 2016 > Built host: dyaus > C bindings: yes > C++ bindings: yes > Fort mpif.h: yes (all) > Fort use mpi: yes (full: ignore TKR) > Fort use mpi size: deprecated-ompi-info-value > Fort use mpi_f08: yes > Fort mpi_f08 compliance: The mpi_f08 module is available, but due to > limitations in the gfortran compiler, does not support the following: array > subsections, direct passthru (where possible) to underlying Open MPI's C > functionality > Fort mpi_f08 subarrays: no > Java bindings: no > Wrapper compiler rpath: runpath > C compiler: gcc > C compiler absolute: /usr/bin/gcc > C compiler family name: GNU > C compiler version: 4.9.2 > C++ compiler: g++ > C++ compiler absolute: /usr/bin/g++ > Fort compiler: gfortran > Fort compiler abs: /usr/bin/gfortran > Fort ignore TKR: yes (!GCC$ ATTRIBUTES NO_ARG_CHECK ::) > Fort 08 assumed shape: yes > Fort optional args: yes > Fort INTERFACE: yes > Fort ISO_FORTRAN_ENV: yes > Fort STORAGE_SIZE: yes > Fort BIND(C) (all): yes > Fort ISO_C_BINDING: yes > Fort SUBROUTINE BIND(C): yes > Fort TYPE,BIND(C): yes > Fort T,BIND(C,name="a"): yes > Fort PRIVATE: yes > Fort PROTECTED: yes > Fort ABSTRACT: yes > Fort ASYNCHRONOUS: yes > Fort PROCEDURE: yes > Fort USE...ONLY: yes > Fort C_FUNLOC: yes > Fort f08 using wrappers: yes > Fort MPI_SIZEOF: yes > C profiling: yes > C++ profiling: yes > Fort mpif.h profiling: yes > Fort use mpi profiling: yes > Fort use mpi_f08 prof: yes > C++ exceptions: no > Thread support: posix (MPI_THREAD_MULTIPLE: no, OPAL support: > yes, OMPI progress: no, ORTE progress: yes, Event lib: yes) > Sparse Groups: no > Internal debug support: no > MPI interface warnings: yes > MPI parameter check: runtime > Memory profiling support: no > Memory debugging support: no > dl support: yes > Heterogeneous support: no > mpirun default --prefix: no > MPI I/O support: yes > MPI_WTIME support: gettimeofday > Symbol vis. support: yes > Host topology support: yes > MPI extensions: FT Checkpoint support: no (checkpoint thread: > no) > C/R Enabled Debugging: no > VampirTrace support: yes > MPI_MAX_PROCESSOR_NAME: 256 > MPI_MAX_ERROR_STRING: 256 > MPI_MAX_OBJECT_NAME: 64 > MPI_MAX_INFO_KEY: 36 > MPI_MAX_INFO_VAL: 256 > MPI_MAX_PORT_NAME: 1024 > MPI_MAX_DATAREP_STRING: 128 > MCA backtrace: execinfo (MCA v2.0.0, API v2.0.0, Component > v1.10.2) > MCA compress: gzip (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA compress: bzip (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA crs: none (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA db: hash (MCA v2.0.0, API v1.0.0, Component v1.10.2) > MCA db: print (MCA v2.0.0, API v1.0.0, Component v1.10.2) > MCA dl: dlopen (MCA v2.0.0, API v1.0.0, Component v1.10.2) > MCA event: libevent2021 (MCA v2.0.0, API v2.0.0, Component > v1.10.2) > MCA hwloc: hwloc191 (MCA v2.0.0, API v2.0.0, Component > v1.10.2) > MCA if: posix_ipv4 (MCA v2.0.0, API v2.0.0, Component > v1.10.2) > MCA if: linux_ipv6 (MCA v2.0.0, API v2.0.0, Component > v1.10.2) > MCA installdirs: env (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA installdirs: config (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA memory: linux (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA pstat: linux (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA sec: basic (MCA v2.0.0, API v1.0.0, Component v1.10.2) > MCA shmem: posix (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA shmem: sysv (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA shmem: mmap (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA timer: linux (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA dfs: app (MCA v2.0.0, API v1.0.0, Component v1.10.2) > MCA dfs: test (MCA v2.0.0, API v1.0.0, Component v1.10.2) > MCA dfs: orted (MCA v2.0.0, API v1.0.0, Component v1.10.2) > MCA errmgr: default_tool (MCA v2.0.0, API v3.0.0, Component > v1.10.2) > MCA errmgr: default_hnp (MCA v2.0.0, API v3.0.0, Component > v1.10.2) > MCA errmgr: default_app (MCA v2.0.0, API v3.0.0, Component > v1.10.2) > MCA errmgr: default_orted (MCA v2.0.0, API v3.0.0, Component > v1.10.2) > MCA ess: slurm (MCA v2.0.0, API v3.0.0, Component v1.10.2) > MCA ess: tool (MCA v2.0.0, API v3.0.0, Component v1.10.2) > MCA ess: singleton (MCA v2.0.0, API v3.0.0, Component > v1.10.2) > MCA ess: env (MCA v2.0.0, API v3.0.0, Component v1.10.2) > MCA ess: hnp (MCA v2.0.0, API v3.0.0, Component v1.10.2) > MCA filem: raw (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA grpcomm: bad (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA iof: orted (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA iof: hnp (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA iof: mr_hnp (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA iof: tool (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA iof: mr_orted (MCA v2.0.0, API v2.0.0, Component > v1.10.2) > MCA odls: default (MCA v2.0.0, API v2.0.0, Component > v1.10.2) > MCA oob: tcp (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA plm: isolated (MCA v2.0.0, API v2.0.0, Component > v1.10.2) > MCA plm: slurm (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA plm: rsh (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA ras: simulator (MCA v2.0.0, API v2.0.0, Component > v1.10.2) > MCA ras: slurm (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA ras: loadleveler (MCA v2.0.0, API v2.0.0, Component > v1.10.2) > MCA rmaps: round_robin (MCA v2.0.0, API v2.0.0, Component > v1.10.2) > MCA rmaps: ppr (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA rmaps: resilient (MCA v2.0.0, API v2.0.0, Component > v1.10.2) > MCA rmaps: rank_file (MCA v2.0.0, API v2.0.0, Component > v1.10.2) > MCA rmaps: mindist (MCA v2.0.0, API v2.0.0, Component > v1.10.2) > MCA rmaps: staged (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA rmaps: seq (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA rml: oob (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA routed: binomial (MCA v2.0.0, API v2.0.0, Component > v1.10.2) > MCA routed: radix (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA routed: direct (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA routed: debruijn (MCA v2.0.0, API v2.0.0, Component > v1.10.2) > MCA state: staged_orted (MCA v2.0.0, API v1.0.0, Component > v1.10.2) > MCA state: dvm (MCA v2.0.0, API v1.0.0, Component v1.10.2) > MCA state: orted (MCA v2.0.0, API v1.0.0, Component v1.10.2) > MCA state: tool (MCA v2.0.0, API v1.0.0, Component v1.10.2) > MCA state: app (MCA v2.0.0, API v1.0.0, Component v1.10.2) > MCA state: novm (MCA v2.0.0, API v1.0.0, Component v1.10.2) > MCA state: staged_hnp (MCA v2.0.0, API v1.0.0, Component > v1.10.2) > MCA state: hnp (MCA v2.0.0, API v1.0.0, Component v1.10.2) > MCA allocator: bucket (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA allocator: basic (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA bcol: basesmuma (MCA v2.0.0, API v2.0.0, Component > v1.10.2) > MCA bcol: ptpcoll (MCA v2.0.0, API v2.0.0, Component > v1.10.2) > MCA bml: r2 (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA btl: openib (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA btl: self (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA btl: tcp (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA btl: sm (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA btl: vader (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA coll: tuned (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA coll: self (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA coll: basic (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA coll: sm (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA coll: inter (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA coll: ml (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA coll: libnbc (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA coll: hierarch (MCA v2.0.0, API v2.0.0, Component > v1.10.2) > MCA dpm: orte (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA fbtl: posix (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA fcoll: individual (MCA v2.0.0, API v2.0.0, Component > v1.10.2) > MCA fcoll: two_phase (MCA v2.0.0, API v2.0.0, Component > v1.10.2) > MCA fcoll: static (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA fcoll: dynamic (MCA v2.0.0, API v2.0.0, Component > v1.10.2) > MCA fcoll: ylib (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA fs: ufs (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA io: romio (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA io: ompio (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA mpool: grdma (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA mpool: sm (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA mtl: psm (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA osc: pt2pt (MCA v2.0.0, API v3.0.0, Component v1.10.2) > MCA osc: sm (MCA v2.0.0, API v3.0.0, Component v1.10.2) > MCA pml: v (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA pml: ob1 (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA pml: cm (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA pml: bfo (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA pubsub: orte (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA rcache: vma (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA rte: orte (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA sbgp: basesmuma (MCA v2.0.0, API v2.0.0, Component > v1.10.2) > MCA sbgp: p2p (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA sbgp: basesmsocket (MCA v2.0.0, API v2.0.0, Component > v1.10.2) > MCA sharedfp: lockedfile (MCA v2.0.0, API v2.0.0, Component > v1.10.2) > MCA sharedfp: sm (MCA v2.0.0, API v2.0.0, Component v1.10.2) > MCA sharedfp: individual (MCA v2.0.0, API v2.0.0, Component > v1.10.2) > MCA topo: basic (MCA v2.0.0, API v2.1.0, Component v1.10.2) > MCA vprotocol: pessimist (MCA v2.0.0, API v2.0.0, Component > v1.10.2) > > MfG/Sincerely > Stefan Friedel > -- > IWR * 4.317 * INF205 * 69120 Heidelberg > T +49 6221 5414404 * F +49 6221 5414427 > stefan.frie...@iwr.uni-heidelberg.de >