Hi Gilles,

thank you very much for your help. What does incorrect slot list
mean? My machine has two 6-core processors so that I specified
"--slot-list 0:0-5,1:0-5". Does incorrect mean that it isn't
allowed to specify more slots than available, to specify fewer
slots than available, or to specify more slots than needed for
the processes?


Kind regards

Siegmar

Am 11.01.2017 um 10:04 schrieb Gilles Gouaillardet:
Siegmar,

I was able to reproduce the issue on my vm
(No need for a real heterogeneous cluster here)

I will keep digging tomorrow.
Note that if you specify an incorrect slot list, MPI_Comm_spawn fails with a 
very unfriendly error message.
Right now, the 4th spawn'ed task crashes, so this is a different issue

Cheers,

Gilles

r...@open-mpi.org wrote:
I think there is some relevant discussion here: 
https://github.com/open-mpi/ompi/issues/1569

It looks like Gilles had (at least at one point) a fix for master when 
enable-heterogeneous, but I don’t know if that was committed.

On Jan 9, 2017, at 8:23 AM, Howard Pritchard <hpprit...@gmail.com 
<mailto:hpprit...@gmail.com>> wrote:

HI Siegmar,

You have some config parameters I wasn't trying that may have some impact.
I'll give a try with these parameters.

This should be enough info for now,

Thanks,

Howard


2017-01-09 0:59 GMT-07:00 Siegmar Gross <siegmar.gr...@informatik.hs-fulda.de 
<mailto:siegmar.gr...@informatik.hs-fulda.de>>:

    Hi Howard,

    I use the following commands to build and install the package.
    ${SYSTEM_ENV} is "Linux" and ${MACHINE_ENV} is "x86_64" for my
    Linux machine.

    mkdir openmpi-2.0.2rc3-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc
    cd openmpi-2.0.2rc3-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc

    ../openmpi-2.0.2rc3/configure \
      --prefix=/usr/local/openmpi-2.0.2_64_cc \
      --libdir=/usr/local/openmpi-2.0.2_64_cc/lib64 \
      --with-jdk-bindir=/usr/local/jdk1.8.0_66/bin \
      --with-jdk-headers=/usr/local/jdk1.8.0_66/include \
      JAVA_HOME=/usr/local/jdk1.8.0_66 \
      LDFLAGS="-m64 -mt -Wl,-z -Wl,noexecstack" CC="cc" CXX="CC" FC="f95" \
      CFLAGS="-m64 -mt" CXXFLAGS="-m64" FCFLAGS="-m64" \
      CPP="cpp" CXXCPP="cpp" \
      --enable-mpi-cxx \
      --enable-mpi-cxx-bindings \
      --enable-cxx-exceptions \
      --enable-mpi-java \
      --enable-heterogeneous \
      --enable-mpi-thread-multiple \
      --with-hwloc=internal \
      --without-verbs \
      --with-wrapper-cflags="-m64 -mt" \
      --with-wrapper-cxxflags="-m64" \
      --with-wrapper-fcflags="-m64" \
      --with-wrapper-ldflags="-mt" \
      --enable-debug \
      |& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_cc

    make |& tee log.make.$SYSTEM_ENV.$MACHINE_ENV.64_cc
    rm -r /usr/local/openmpi-2.0.2_64_cc.old
    mv /usr/local/openmpi-2.0.2_64_cc /usr/local/openmpi-2.0.2_64_cc.old
    make install |& tee log.make-install.$SYSTEM_ENV.$MACHINE_ENV.64_cc
    make check |& tee log.make-check.$SYSTEM_ENV.$MACHINE_ENV.64_cc


    I get a different error if I run the program with gdb.

    loki spawn 118 gdb /usr/local/openmpi-2.0.2_64_cc/bin/mpiexec
    GNU gdb (GDB; SUSE Linux Enterprise 12) 7.11.1
    Copyright (C) 2016 Free Software Foundation, Inc.
    License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html 
<http://gnu.org/licenses/gpl.html>>
    This is free software: you are free to change and redistribute it.
    There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
    and "show warranty" for details.
    This GDB was configured as "x86_64-suse-linux".
    Type "show configuration" for configuration details.
    For bug reporting instructions, please see:
    <http://bugs.opensuse.org/>.
    Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/ 
<http://www.gnu.org/software/gdb/documentation/>>.
    For help, type "help".
    Type "apropos word" to search for commands related to "word"...
    Reading symbols from /usr/local/openmpi-2.0.2_64_cc/bin/mpiexec...done.
    (gdb) r -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master
    Starting program: /usr/local/openmpi-2.0.2_64_cc/bin/mpiexec -np 1 --host 
loki --slot-list 0:0-5,1:0-5 spawn_master
    Missing separate debuginfos, use: zypper install 
glibc-debuginfo-2.24-2.3.x86_64
    [Thread debugging using libthread_db enabled]
    Using host libthread_db library "/lib64/libthread_db.so.1".
    [New Thread 0x7ffff3b97700 (LWP 13582)]
    [New Thread 0x7ffff18a4700 (LWP 13583)]
    [New Thread 0x7ffff10a3700 (LWP 13584)]
    [New Thread 0x7fffebbba700 (LWP 13585)]
    Detaching after fork from child process 13586.

    Parent process 0 running on loki
      I create 4 slave processes

    Detaching after fork from child process 13589.
    Detaching after fork from child process 13590.
    Detaching after fork from child process 13591.
    [loki:13586] OPAL ERROR: Timeout in file 
../../../../openmpi-2.0.2rc3/opal/mca/pmix/base/pmix_base_fns.c at line 193
    [loki:13586] *** An error occurred in MPI_Comm_spawn
    [loki:13586] *** reported by process [2873294849,0]
    [loki:13586] *** on communicator MPI_COMM_WORLD
    [loki:13586] *** MPI_ERR_UNKNOWN: unknown error
    [loki:13586] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will 
now abort,
    [loki:13586] ***    and potentially your MPI job)
    [Thread 0x7fffebbba700 (LWP 13585) exited]
    [Thread 0x7ffff10a3700 (LWP 13584) exited]
    [Thread 0x7ffff18a4700 (LWP 13583) exited]
    [Thread 0x7ffff3b97700 (LWP 13582) exited]
    [Inferior 1 (process 13567) exited with code 016]
    Missing separate debuginfos, use: zypper install 
libpciaccess0-debuginfo-0.13.2-5.1.x86_64 libudev1-debuginfo-210-116.3.3.x86_64
    (gdb) bt
    No stack.
    (gdb)

    Do you need anything else?


    Kind regards

    Siegmar

    Am 08.01.2017 um 17:02 schrieb Howard Pritchard:

        HI Siegmar,

        Could you post the configury options you use when building the 2.0.2rc3?
        Maybe that will help in trying to reproduce the segfault you are 
observing.

        Howard


        2017-01-07 2:30 GMT-07:00 Siegmar Gross <siegmar.gr...@informatik.hs-fulda.de 
<mailto:siegmar.gr...@informatik.hs-fulda.de>
        <mailto:siegmar.gr...@informatik.hs-fulda.de 
<mailto:siegmar.gr...@informatik.hs-fulda.de>>>:

            Hi,

            I have installed openmpi-2.0.2rc3 on my "SUSE Linux Enterprise
            Server 12 (x86_64)" with Sun C 5.14 and gcc-6.3.0. Unfortunately,
            I still get the same error that I reported for rc2.

            I would be grateful, if somebody can fix the problem before
            releasing the final version. Thank you very much for any help
            in advance.


            Kind regards

            Siegmar
            _______________________________________________
            users mailing list
            users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> 
<mailto:users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>>
            https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
        <https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users>>




        _______________________________________________
        users mailing list
        users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
        https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users>

    _______________________________________________
    users mailing list
    users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
    https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users>


_______________________________________________
users mailing list
users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users



_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to