On Tue, Apr 12, 2016 at 05:11:59PM +0900, Gilles Gouaillardet wrote:
Dear Gilles,
which version of OpenMPI are you using ?
as I wrote:
   openmpi-1.10.2, slurm-15.08.9; homes mounted via NFS/RDMA/ipoib, mpi

when does the error occur ?
is it before MPI_Init() completes ?
is it in the middle of the job ? if yes, are you sure no task invoked MPI_Abort
During the setup of the job (in most cases) and there is no output from the
application. I will build a minimal program to get some printf debugging ...I'll
report...

also, you might want to check the system logs and make sure there was no OOM
(Out Of Memory).
No OOM messages from the nodes. No relevant messages at all from the
nodes...(remote syslog is running from all nodes to a central system)

mpirun --mca oob_tcp_if_include eth0 ...
I already tested this.

mpirun --mca btl tcp,vader,self --mca btl_tcp_if_include eth0 ...
or
mpirun --mca btl tcp,vader,self --mca btl_tcp_if_include ib0 ...
Just tested this on 350 nodes - two out of seven jobs spawned one after each
other were successfull but subsequent jobs were failing again:

*tcp,vader,self eth0 failed
*tcp,sm,self eth0 failed
*tcp,vader,self ib0 failed
*tcp,sm,self ib0 success!
*tcp,sm,self ib0 failed :-/
*tcp,sm,self ib0 success again!
*tcp,sm,self ib0 failed...

hhmmm. tcp+sm is a little bit more reliable??

For the sake of completeness - I forgot the ompi_info output:

                Package: Open MPI root@dyaus Distribution
               Open MPI: 1.10.2
 Open MPI repo revision: v1.10.1-145-g799148f
  Open MPI release date: Jan 21, 2016
               Open RTE: 1.10.2
 Open RTE repo revision: v1.10.1-145-g799148f
  Open RTE release date: Jan 21, 2016
                   OPAL: 1.10.2
     OPAL repo revision: v1.10.1-145-g799148f
      OPAL release date: Jan 21, 2016
                MPI API: 3.0.0
           Ident string: 1.10.2
                 Prefix: /opt/openmpi/1.10.2/gcc/4.9.2
Configured architecture: x86_64-pc-linux-gnu
         Configure host: dyaus
          Configured by: root
          Configured on: Mon Apr 11 09:54:21 CEST 2016
         Configure host: dyaus
               Built by: root
               Built on: Mon Apr 11 10:12:25 CEST 2016
             Built host: dyaus
             C bindings: yes
           C++ bindings: yes
            Fort mpif.h: yes (all)
           Fort use mpi: yes (full: ignore TKR)
      Fort use mpi size: deprecated-ompi-info-value
       Fort use mpi_f08: yes
Fort mpi_f08 compliance: The mpi_f08 module is available, but due to 
limitations in the gfortran compiler, does not support the following: array 
subsections, direct passthru (where possible) to underlying Open MPI's C 
functionality
 Fort mpi_f08 subarrays: no
          Java bindings: no
 Wrapper compiler rpath: runpath
             C compiler: gcc
    C compiler absolute: /usr/bin/gcc
 C compiler family name: GNU
     C compiler version: 4.9.2
           C++ compiler: g++
  C++ compiler absolute: /usr/bin/g++
          Fort compiler: gfortran
      Fort compiler abs: /usr/bin/gfortran
        Fort ignore TKR: yes (!GCC$ ATTRIBUTES NO_ARG_CHECK ::)
  Fort 08 assumed shape: yes
     Fort optional args: yes
         Fort INTERFACE: yes
   Fort ISO_FORTRAN_ENV: yes
      Fort STORAGE_SIZE: yes
     Fort BIND(C) (all): yes
     Fort ISO_C_BINDING: yes
Fort SUBROUTINE BIND(C): yes
      Fort TYPE,BIND(C): yes
Fort T,BIND(C,name="a"): yes
           Fort PRIVATE: yes
         Fort PROTECTED: yes
          Fort ABSTRACT: yes
      Fort ASYNCHRONOUS: yes
         Fort PROCEDURE: yes
        Fort USE...ONLY: yes
          Fort C_FUNLOC: yes
Fort f08 using wrappers: yes
        Fort MPI_SIZEOF: yes
            C profiling: yes
          C++ profiling: yes
  Fort mpif.h profiling: yes
 Fort use mpi profiling: yes
  Fort use mpi_f08 prof: yes
         C++ exceptions: no
         Thread support: posix (MPI_THREAD_MULTIPLE: no, OPAL support: yes, 
OMPI progress: no, ORTE progress: yes, Event lib: yes)
          Sparse Groups: no
 Internal debug support: no
 MPI interface warnings: yes
    MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
             dl support: yes
  Heterogeneous support: no
mpirun default --prefix: no
        MPI I/O support: yes
      MPI_WTIME support: gettimeofday
    Symbol vis. support: yes
  Host topology support: yes
MPI extensions: FT Checkpoint support: no (checkpoint thread: no)
  C/R Enabled Debugging: no
    VampirTrace support: yes
 MPI_MAX_PROCESSOR_NAME: 256
   MPI_MAX_ERROR_STRING: 256
    MPI_MAX_OBJECT_NAME: 64
       MPI_MAX_INFO_KEY: 36
       MPI_MAX_INFO_VAL: 256
      MPI_MAX_PORT_NAME: 1024
 MPI_MAX_DATAREP_STRING: 128
          MCA backtrace: execinfo (MCA v2.0.0, API v2.0.0, Component v1.10.2)
           MCA compress: gzip (MCA v2.0.0, API v2.0.0, Component v1.10.2)
           MCA compress: bzip (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                MCA crs: none (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                 MCA db: hash (MCA v2.0.0, API v1.0.0, Component v1.10.2)
                 MCA db: print (MCA v2.0.0, API v1.0.0, Component v1.10.2)
                 MCA dl: dlopen (MCA v2.0.0, API v1.0.0, Component v1.10.2)
              MCA event: libevent2021 (MCA v2.0.0, API v2.0.0, Component 
v1.10.2)
              MCA hwloc: hwloc191 (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                 MCA if: posix_ipv4 (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                 MCA if: linux_ipv6 (MCA v2.0.0, API v2.0.0, Component v1.10.2)
        MCA installdirs: env (MCA v2.0.0, API v2.0.0, Component v1.10.2)
        MCA installdirs: config (MCA v2.0.0, API v2.0.0, Component v1.10.2)
             MCA memory: linux (MCA v2.0.0, API v2.0.0, Component v1.10.2)
              MCA pstat: linux (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                MCA sec: basic (MCA v2.0.0, API v1.0.0, Component v1.10.2)
              MCA shmem: posix (MCA v2.0.0, API v2.0.0, Component v1.10.2)
              MCA shmem: sysv (MCA v2.0.0, API v2.0.0, Component v1.10.2)
              MCA shmem: mmap (MCA v2.0.0, API v2.0.0, Component v1.10.2)
              MCA timer: linux (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                MCA dfs: app (MCA v2.0.0, API v1.0.0, Component v1.10.2)
                MCA dfs: test (MCA v2.0.0, API v1.0.0, Component v1.10.2)
                MCA dfs: orted (MCA v2.0.0, API v1.0.0, Component v1.10.2)
             MCA errmgr: default_tool (MCA v2.0.0, API v3.0.0, Component 
v1.10.2)
             MCA errmgr: default_hnp (MCA v2.0.0, API v3.0.0, Component v1.10.2)
             MCA errmgr: default_app (MCA v2.0.0, API v3.0.0, Component v1.10.2)
             MCA errmgr: default_orted (MCA v2.0.0, API v3.0.0, Component 
v1.10.2)
                MCA ess: slurm (MCA v2.0.0, API v3.0.0, Component v1.10.2)
                MCA ess: tool (MCA v2.0.0, API v3.0.0, Component v1.10.2)
                MCA ess: singleton (MCA v2.0.0, API v3.0.0, Component v1.10.2)
                MCA ess: env (MCA v2.0.0, API v3.0.0, Component v1.10.2)
                MCA ess: hnp (MCA v2.0.0, API v3.0.0, Component v1.10.2)
              MCA filem: raw (MCA v2.0.0, API v2.0.0, Component v1.10.2)
            MCA grpcomm: bad (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                MCA iof: orted (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                MCA iof: hnp (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                MCA iof: mr_hnp (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                MCA iof: tool (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                MCA iof: mr_orted (MCA v2.0.0, API v2.0.0, Component v1.10.2)
               MCA odls: default (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                MCA oob: tcp (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                MCA plm: isolated (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                MCA plm: slurm (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                MCA plm: rsh (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                MCA ras: simulator (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                MCA ras: slurm (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                MCA ras: loadleveler (MCA v2.0.0, API v2.0.0, Component v1.10.2)
              MCA rmaps: round_robin (MCA v2.0.0, API v2.0.0, Component v1.10.2)
              MCA rmaps: ppr (MCA v2.0.0, API v2.0.0, Component v1.10.2)
              MCA rmaps: resilient (MCA v2.0.0, API v2.0.0, Component v1.10.2)
              MCA rmaps: rank_file (MCA v2.0.0, API v2.0.0, Component v1.10.2)
              MCA rmaps: mindist (MCA v2.0.0, API v2.0.0, Component v1.10.2)
              MCA rmaps: staged (MCA v2.0.0, API v2.0.0, Component v1.10.2)
              MCA rmaps: seq (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                MCA rml: oob (MCA v2.0.0, API v2.0.0, Component v1.10.2)
             MCA routed: binomial (MCA v2.0.0, API v2.0.0, Component v1.10.2)
             MCA routed: radix (MCA v2.0.0, API v2.0.0, Component v1.10.2)
             MCA routed: direct (MCA v2.0.0, API v2.0.0, Component v1.10.2)
             MCA routed: debruijn (MCA v2.0.0, API v2.0.0, Component v1.10.2)
              MCA state: staged_orted (MCA v2.0.0, API v1.0.0, Component 
v1.10.2)
              MCA state: dvm (MCA v2.0.0, API v1.0.0, Component v1.10.2)
              MCA state: orted (MCA v2.0.0, API v1.0.0, Component v1.10.2)
              MCA state: tool (MCA v2.0.0, API v1.0.0, Component v1.10.2)
              MCA state: app (MCA v2.0.0, API v1.0.0, Component v1.10.2)
              MCA state: novm (MCA v2.0.0, API v1.0.0, Component v1.10.2)
              MCA state: staged_hnp (MCA v2.0.0, API v1.0.0, Component v1.10.2)
              MCA state: hnp (MCA v2.0.0, API v1.0.0, Component v1.10.2)
          MCA allocator: bucket (MCA v2.0.0, API v2.0.0, Component v1.10.2)
          MCA allocator: basic (MCA v2.0.0, API v2.0.0, Component v1.10.2)
               MCA bcol: basesmuma (MCA v2.0.0, API v2.0.0, Component v1.10.2)
               MCA bcol: ptpcoll (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                MCA bml: r2 (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                MCA btl: openib (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                MCA btl: self (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                MCA btl: tcp (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                MCA btl: sm (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                MCA btl: vader (MCA v2.0.0, API v2.0.0, Component v1.10.2)
               MCA coll: tuned (MCA v2.0.0, API v2.0.0, Component v1.10.2)
               MCA coll: self (MCA v2.0.0, API v2.0.0, Component v1.10.2)
               MCA coll: basic (MCA v2.0.0, API v2.0.0, Component v1.10.2)
               MCA coll: sm (MCA v2.0.0, API v2.0.0, Component v1.10.2)
               MCA coll: inter (MCA v2.0.0, API v2.0.0, Component v1.10.2)
               MCA coll: ml (MCA v2.0.0, API v2.0.0, Component v1.10.2)
               MCA coll: libnbc (MCA v2.0.0, API v2.0.0, Component v1.10.2)
               MCA coll: hierarch (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                MCA dpm: orte (MCA v2.0.0, API v2.0.0, Component v1.10.2)
               MCA fbtl: posix (MCA v2.0.0, API v2.0.0, Component v1.10.2)
              MCA fcoll: individual (MCA v2.0.0, API v2.0.0, Component v1.10.2)
              MCA fcoll: two_phase (MCA v2.0.0, API v2.0.0, Component v1.10.2)
              MCA fcoll: static (MCA v2.0.0, API v2.0.0, Component v1.10.2)
              MCA fcoll: dynamic (MCA v2.0.0, API v2.0.0, Component v1.10.2)
              MCA fcoll: ylib (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                 MCA fs: ufs (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                 MCA io: romio (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                 MCA io: ompio (MCA v2.0.0, API v2.0.0, Component v1.10.2)
              MCA mpool: grdma (MCA v2.0.0, API v2.0.0, Component v1.10.2)
              MCA mpool: sm (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                MCA mtl: psm (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                MCA osc: pt2pt (MCA v2.0.0, API v3.0.0, Component v1.10.2)
                MCA osc: sm (MCA v2.0.0, API v3.0.0, Component v1.10.2)
                MCA pml: v (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                MCA pml: ob1 (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                MCA pml: cm (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                MCA pml: bfo (MCA v2.0.0, API v2.0.0, Component v1.10.2)
             MCA pubsub: orte (MCA v2.0.0, API v2.0.0, Component v1.10.2)
             MCA rcache: vma (MCA v2.0.0, API v2.0.0, Component v1.10.2)
                MCA rte: orte (MCA v2.0.0, API v2.0.0, Component v1.10.2)
               MCA sbgp: basesmuma (MCA v2.0.0, API v2.0.0, Component v1.10.2)
               MCA sbgp: p2p (MCA v2.0.0, API v2.0.0, Component v1.10.2)
               MCA sbgp: basesmsocket (MCA v2.0.0, API v2.0.0, Component 
v1.10.2)
           MCA sharedfp: lockedfile (MCA v2.0.0, API v2.0.0, Component v1.10.2)
           MCA sharedfp: sm (MCA v2.0.0, API v2.0.0, Component v1.10.2)
           MCA sharedfp: individual (MCA v2.0.0, API v2.0.0, Component v1.10.2)
               MCA topo: basic (MCA v2.0.0, API v2.1.0, Component v1.10.2)
          MCA vprotocol: pessimist (MCA v2.0.0, API v2.0.0, Component v1.10.2)

MfG/Sincerely
Stefan Friedel
--
IWR * 4.317 * INF205 * 69120 Heidelberg
T +49 6221 5414404 * F +49 6221 5414427
stefan.frie...@iwr.uni-heidelberg.de

Attachment: signature.asc
Description: PGP signature

Reply via email to