Stefan,

what if you
ulimit -c unlimited

do orted generate some core dump ?

Cheers

Gilles

On Tuesday, April 12, 2016, Stefan Friedel <
stefan.frie...@iwr.uni-heidelberg.de> wrote:

> On Tue, Apr 12, 2016 at 05:11:59PM +0900, Gilles Gouaillardet wrote:
> Dear Gilles,
>
>> which version of OpenMPI are you using ?
>>
> as I wrote:
>
>>    openmpi-1.10.2, slurm-15.08.9; homes mounted via NFS/RDMA/ipoib, mpi
>>
>
> when does the error occur ?
>> is it before MPI_Init() completes ?
>> is it in the middle of the job ? if yes, are you sure no task invoked
>> MPI_Abort
>>
> During the setup of the job (in most cases) and there is no output from the
> application. I will build a minimal program to get some printf debugging
> ...I'll
> report...
>
> also, you might want to check the system logs and make sure there was no
>> OOM
>> (Out Of Memory).
>>
> No OOM messages from the nodes. No relevant messages at all from the
> nodes...(remote syslog is running from all nodes to a central system)
>
> mpirun --mca oob_tcp_if_include eth0 ...
>>
> I already tested this.
>
> mpirun --mca btl tcp,vader,self --mca btl_tcp_if_include eth0 ...
>> or
>> mpirun --mca btl tcp,vader,self --mca btl_tcp_if_include ib0 ...
>>
> Just tested this on 350 nodes - two out of seven jobs spawned one after
> each
> other were successfull but subsequent jobs were failing again:
>
> *tcp,vader,self eth0 failed
> *tcp,sm,self eth0 failed
> *tcp,vader,self ib0 failed
> *tcp,sm,self ib0 success!
> *tcp,sm,self ib0 failed :-/
> *tcp,sm,self ib0 success again!
> *tcp,sm,self ib0 failed...
>
> hhmmm. tcp+sm is a little bit more reliable??
>
> For the sake of completeness - I forgot the ompi_info output:
>
>                 Package: Open MPI root@dyaus Distribution
>                Open MPI: 1.10.2
>  Open MPI repo revision: v1.10.1-145-g799148f
>   Open MPI release date: Jan 21, 2016
>                Open RTE: 1.10.2
>  Open RTE repo revision: v1.10.1-145-g799148f
>   Open RTE release date: Jan 21, 2016
>                    OPAL: 1.10.2
>      OPAL repo revision: v1.10.1-145-g799148f
>       OPAL release date: Jan 21, 2016
>                 MPI API: 3.0.0
>            Ident string: 1.10.2
>                  Prefix: /opt/openmpi/1.10.2/gcc/4.9.2
> Configured architecture: x86_64-pc-linux-gnu
>          Configure host: dyaus
>           Configured by: root
>           Configured on: Mon Apr 11 09:54:21 CEST 2016
>          Configure host: dyaus
>                Built by: root
>                Built on: Mon Apr 11 10:12:25 CEST 2016
>              Built host: dyaus
>              C bindings: yes
>            C++ bindings: yes
>             Fort mpif.h: yes (all)
>            Fort use mpi: yes (full: ignore TKR)
>       Fort use mpi size: deprecated-ompi-info-value
>        Fort use mpi_f08: yes
> Fort mpi_f08 compliance: The mpi_f08 module is available, but due to
> limitations in the gfortran compiler, does not support the following: array
> subsections, direct passthru (where possible) to underlying Open MPI's C
> functionality
>  Fort mpi_f08 subarrays: no
>           Java bindings: no
>  Wrapper compiler rpath: runpath
>              C compiler: gcc
>     C compiler absolute: /usr/bin/gcc
>  C compiler family name: GNU
>      C compiler version: 4.9.2
>            C++ compiler: g++
>   C++ compiler absolute: /usr/bin/g++
>           Fort compiler: gfortran
>       Fort compiler abs: /usr/bin/gfortran
>         Fort ignore TKR: yes (!GCC$ ATTRIBUTES NO_ARG_CHECK ::)
>   Fort 08 assumed shape: yes
>      Fort optional args: yes
>          Fort INTERFACE: yes
>    Fort ISO_FORTRAN_ENV: yes
>       Fort STORAGE_SIZE: yes
>      Fort BIND(C) (all): yes
>      Fort ISO_C_BINDING: yes
> Fort SUBROUTINE BIND(C): yes
>       Fort TYPE,BIND(C): yes
> Fort T,BIND(C,name="a"): yes
>            Fort PRIVATE: yes
>          Fort PROTECTED: yes
>           Fort ABSTRACT: yes
>       Fort ASYNCHRONOUS: yes
>          Fort PROCEDURE: yes
>         Fort USE...ONLY: yes
>           Fort C_FUNLOC: yes
> Fort f08 using wrappers: yes
>         Fort MPI_SIZEOF: yes
>             C profiling: yes
>           C++ profiling: yes
>   Fort mpif.h profiling: yes
>  Fort use mpi profiling: yes
>   Fort use mpi_f08 prof: yes
>          C++ exceptions: no
>          Thread support: posix (MPI_THREAD_MULTIPLE: no, OPAL support:
> yes, OMPI progress: no, ORTE progress: yes, Event lib: yes)
>           Sparse Groups: no
>  Internal debug support: no
>  MPI interface warnings: yes
>     MPI parameter check: runtime
> Memory profiling support: no
> Memory debugging support: no
>              dl support: yes
>   Heterogeneous support: no
> mpirun default --prefix: no
>         MPI I/O support: yes
>       MPI_WTIME support: gettimeofday
>     Symbol vis. support: yes
>   Host topology support: yes
>          MPI extensions:   FT Checkpoint support: no (checkpoint thread:
> no)
>   C/R Enabled Debugging: no
>     VampirTrace support: yes
>  MPI_MAX_PROCESSOR_NAME: 256
>    MPI_MAX_ERROR_STRING: 256
>     MPI_MAX_OBJECT_NAME: 64
>        MPI_MAX_INFO_KEY: 36
>        MPI_MAX_INFO_VAL: 256
>       MPI_MAX_PORT_NAME: 1024
>  MPI_MAX_DATAREP_STRING: 128
>           MCA backtrace: execinfo (MCA v2.0.0, API v2.0.0, Component
> v1.10.2)
>            MCA compress: gzip (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>            MCA compress: bzip (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>                 MCA crs: none (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>                  MCA db: hash (MCA v2.0.0, API v1.0.0, Component v1.10.2)
>                  MCA db: print (MCA v2.0.0, API v1.0.0, Component v1.10.2)
>                  MCA dl: dlopen (MCA v2.0.0, API v1.0.0, Component v1.10.2)
>               MCA event: libevent2021 (MCA v2.0.0, API v2.0.0, Component
> v1.10.2)
>               MCA hwloc: hwloc191 (MCA v2.0.0, API v2.0.0, Component
> v1.10.2)
>                  MCA if: posix_ipv4 (MCA v2.0.0, API v2.0.0, Component
> v1.10.2)
>                  MCA if: linux_ipv6 (MCA v2.0.0, API v2.0.0, Component
> v1.10.2)
>         MCA installdirs: env (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>         MCA installdirs: config (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>              MCA memory: linux (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>               MCA pstat: linux (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>                 MCA sec: basic (MCA v2.0.0, API v1.0.0, Component v1.10.2)
>               MCA shmem: posix (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>               MCA shmem: sysv (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>               MCA shmem: mmap (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>               MCA timer: linux (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>                 MCA dfs: app (MCA v2.0.0, API v1.0.0, Component v1.10.2)
>                 MCA dfs: test (MCA v2.0.0, API v1.0.0, Component v1.10.2)
>                 MCA dfs: orted (MCA v2.0.0, API v1.0.0, Component v1.10.2)
>              MCA errmgr: default_tool (MCA v2.0.0, API v3.0.0, Component
> v1.10.2)
>              MCA errmgr: default_hnp (MCA v2.0.0, API v3.0.0, Component
> v1.10.2)
>              MCA errmgr: default_app (MCA v2.0.0, API v3.0.0, Component
> v1.10.2)
>              MCA errmgr: default_orted (MCA v2.0.0, API v3.0.0, Component
> v1.10.2)
>                 MCA ess: slurm (MCA v2.0.0, API v3.0.0, Component v1.10.2)
>                 MCA ess: tool (MCA v2.0.0, API v3.0.0, Component v1.10.2)
>                 MCA ess: singleton (MCA v2.0.0, API v3.0.0, Component
> v1.10.2)
>                 MCA ess: env (MCA v2.0.0, API v3.0.0, Component v1.10.2)
>                 MCA ess: hnp (MCA v2.0.0, API v3.0.0, Component v1.10.2)
>               MCA filem: raw (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>             MCA grpcomm: bad (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>                 MCA iof: orted (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>                 MCA iof: hnp (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>                 MCA iof: mr_hnp (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>                 MCA iof: tool (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>                 MCA iof: mr_orted (MCA v2.0.0, API v2.0.0, Component
> v1.10.2)
>                MCA odls: default (MCA v2.0.0, API v2.0.0, Component
> v1.10.2)
>                 MCA oob: tcp (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>                 MCA plm: isolated (MCA v2.0.0, API v2.0.0, Component
> v1.10.2)
>                 MCA plm: slurm (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>                 MCA plm: rsh (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>                 MCA ras: simulator (MCA v2.0.0, API v2.0.0, Component
> v1.10.2)
>                 MCA ras: slurm (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>                 MCA ras: loadleveler (MCA v2.0.0, API v2.0.0, Component
> v1.10.2)
>               MCA rmaps: round_robin (MCA v2.0.0, API v2.0.0, Component
> v1.10.2)
>               MCA rmaps: ppr (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>               MCA rmaps: resilient (MCA v2.0.0, API v2.0.0, Component
> v1.10.2)
>               MCA rmaps: rank_file (MCA v2.0.0, API v2.0.0, Component
> v1.10.2)
>               MCA rmaps: mindist (MCA v2.0.0, API v2.0.0, Component
> v1.10.2)
>               MCA rmaps: staged (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>               MCA rmaps: seq (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>                 MCA rml: oob (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>              MCA routed: binomial (MCA v2.0.0, API v2.0.0, Component
> v1.10.2)
>              MCA routed: radix (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>              MCA routed: direct (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>              MCA routed: debruijn (MCA v2.0.0, API v2.0.0, Component
> v1.10.2)
>               MCA state: staged_orted (MCA v2.0.0, API v1.0.0, Component
> v1.10.2)
>               MCA state: dvm (MCA v2.0.0, API v1.0.0, Component v1.10.2)
>               MCA state: orted (MCA v2.0.0, API v1.0.0, Component v1.10.2)
>               MCA state: tool (MCA v2.0.0, API v1.0.0, Component v1.10.2)
>               MCA state: app (MCA v2.0.0, API v1.0.0, Component v1.10.2)
>               MCA state: novm (MCA v2.0.0, API v1.0.0, Component v1.10.2)
>               MCA state: staged_hnp (MCA v2.0.0, API v1.0.0, Component
> v1.10.2)
>               MCA state: hnp (MCA v2.0.0, API v1.0.0, Component v1.10.2)
>           MCA allocator: bucket (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>           MCA allocator: basic (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>                MCA bcol: basesmuma (MCA v2.0.0, API v2.0.0, Component
> v1.10.2)
>                MCA bcol: ptpcoll (MCA v2.0.0, API v2.0.0, Component
> v1.10.2)
>                 MCA bml: r2 (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>                 MCA btl: openib (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>                 MCA btl: self (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>                 MCA btl: tcp (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>                 MCA btl: sm (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>                 MCA btl: vader (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>                MCA coll: tuned (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>                MCA coll: self (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>                MCA coll: basic (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>                MCA coll: sm (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>                MCA coll: inter (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>                MCA coll: ml (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>                MCA coll: libnbc (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>                MCA coll: hierarch (MCA v2.0.0, API v2.0.0, Component
> v1.10.2)
>                 MCA dpm: orte (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>                MCA fbtl: posix (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>               MCA fcoll: individual (MCA v2.0.0, API v2.0.0, Component
> v1.10.2)
>               MCA fcoll: two_phase (MCA v2.0.0, API v2.0.0, Component
> v1.10.2)
>               MCA fcoll: static (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>               MCA fcoll: dynamic (MCA v2.0.0, API v2.0.0, Component
> v1.10.2)
>               MCA fcoll: ylib (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>                  MCA fs: ufs (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>                  MCA io: romio (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>                  MCA io: ompio (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>               MCA mpool: grdma (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>               MCA mpool: sm (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>                 MCA mtl: psm (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>                 MCA osc: pt2pt (MCA v2.0.0, API v3.0.0, Component v1.10.2)
>                 MCA osc: sm (MCA v2.0.0, API v3.0.0, Component v1.10.2)
>                 MCA pml: v (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>                 MCA pml: ob1 (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>                 MCA pml: cm (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>                 MCA pml: bfo (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>              MCA pubsub: orte (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>              MCA rcache: vma (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>                 MCA rte: orte (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>                MCA sbgp: basesmuma (MCA v2.0.0, API v2.0.0, Component
> v1.10.2)
>                MCA sbgp: p2p (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>                MCA sbgp: basesmsocket (MCA v2.0.0, API v2.0.0, Component
> v1.10.2)
>            MCA sharedfp: lockedfile (MCA v2.0.0, API v2.0.0, Component
> v1.10.2)
>            MCA sharedfp: sm (MCA v2.0.0, API v2.0.0, Component v1.10.2)
>            MCA sharedfp: individual (MCA v2.0.0, API v2.0.0, Component
> v1.10.2)
>                MCA topo: basic (MCA v2.0.0, API v2.1.0, Component v1.10.2)
>           MCA vprotocol: pessimist (MCA v2.0.0, API v2.0.0, Component
> v1.10.2)
>
> MfG/Sincerely
> Stefan Friedel
> --
> IWR * 4.317 * INF205 * 69120 Heidelberg
> T +49 6221 5414404 * F +49 6221 5414427
> stefan.frie...@iwr.uni-heidelberg.de
>

Reply via email to