Hi Gilles,
thank you very much for your help. What does incorrect slot list
mean? My machine has two 6-core processors so that I specified
"--slot-list 0:0-5,1:0-5". Does incorrect mean that it isn't
allowed to specify more slots than available, to specify fewer
slots than available, or to specify more slots than needed for
the processes?
Kind regards
Siegmar
Am 11.01.2017 um 10:04 schrieb Gilles Gouaillardet:
Siegmar,
I was able to reproduce the issue on my vm
(No need for a real heterogeneous cluster here)
I will keep digging tomorrow.
Note that if you specify an incorrect slot list, MPI_Comm_spawn fails with a
very unfriendly error message.
Right now, the 4th spawn'ed task crashes, so this is a different issue
Cheers,
Gilles
r...@open-mpi.org wrote:
I think there is some relevant discussion here:
https://github.com/open-mpi/ompi/issues/1569
It looks like Gilles had (at least at one point) a fix for master when
enable-heterogeneous, but I don’t know if that was committed.
On Jan 9, 2017, at 8:23 AM, Howard Pritchard <hpprit...@gmail.com
<mailto:hpprit...@gmail.com>> wrote:
HI Siegmar,
You have some config parameters I wasn't trying that may have some impact.
I'll give a try with these parameters.
This should be enough info for now,
Thanks,
Howard
2017-01-09 0:59 GMT-07:00 Siegmar Gross <siegmar.gr...@informatik.hs-fulda.de
<mailto:siegmar.gr...@informatik.hs-fulda.de>>:
Hi Howard,
I use the following commands to build and install the package.
${SYSTEM_ENV} is "Linux" and ${MACHINE_ENV} is "x86_64" for my
Linux machine.
mkdir openmpi-2.0.2rc3-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc
cd openmpi-2.0.2rc3-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc
../openmpi-2.0.2rc3/configure \
--prefix=/usr/local/openmpi-2.0.2_64_cc \
--libdir=/usr/local/openmpi-2.0.2_64_cc/lib64 \
--with-jdk-bindir=/usr/local/jdk1.8.0_66/bin \
--with-jdk-headers=/usr/local/jdk1.8.0_66/include \
JAVA_HOME=/usr/local/jdk1.8.0_66 \
LDFLAGS="-m64 -mt -Wl,-z -Wl,noexecstack" CC="cc" CXX="CC" FC="f95" \
CFLAGS="-m64 -mt" CXXFLAGS="-m64" FCFLAGS="-m64" \
CPP="cpp" CXXCPP="cpp" \
--enable-mpi-cxx \
--enable-mpi-cxx-bindings \
--enable-cxx-exceptions \
--enable-mpi-java \
--enable-heterogeneous \
--enable-mpi-thread-multiple \
--with-hwloc=internal \
--without-verbs \
--with-wrapper-cflags="-m64 -mt" \
--with-wrapper-cxxflags="-m64" \
--with-wrapper-fcflags="-m64" \
--with-wrapper-ldflags="-mt" \
--enable-debug \
|& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_cc
make |& tee log.make.$SYSTEM_ENV.$MACHINE_ENV.64_cc
rm -r /usr/local/openmpi-2.0.2_64_cc.old
mv /usr/local/openmpi-2.0.2_64_cc /usr/local/openmpi-2.0.2_64_cc.old
make install |& tee log.make-install.$SYSTEM_ENV.$MACHINE_ENV.64_cc
make check |& tee log.make-check.$SYSTEM_ENV.$MACHINE_ENV.64_cc
I get a different error if I run the program with gdb.
loki spawn 118 gdb /usr/local/openmpi-2.0.2_64_cc/bin/mpiexec
GNU gdb (GDB; SUSE Linux Enterprise 12) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html
<http://gnu.org/licenses/gpl.html>>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-suse-linux".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://bugs.opensuse.org/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/
<http://www.gnu.org/software/gdb/documentation/>>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/local/openmpi-2.0.2_64_cc/bin/mpiexec...done.
(gdb) r -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master
Starting program: /usr/local/openmpi-2.0.2_64_cc/bin/mpiexec -np 1 --host
loki --slot-list 0:0-5,1:0-5 spawn_master
Missing separate debuginfos, use: zypper install
glibc-debuginfo-2.24-2.3.x86_64
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7ffff3b97700 (LWP 13582)]
[New Thread 0x7ffff18a4700 (LWP 13583)]
[New Thread 0x7ffff10a3700 (LWP 13584)]
[New Thread 0x7fffebbba700 (LWP 13585)]
Detaching after fork from child process 13586.
Parent process 0 running on loki
I create 4 slave processes
Detaching after fork from child process 13589.
Detaching after fork from child process 13590.
Detaching after fork from child process 13591.
[loki:13586] OPAL ERROR: Timeout in file
../../../../openmpi-2.0.2rc3/opal/mca/pmix/base/pmix_base_fns.c at line 193
[loki:13586] *** An error occurred in MPI_Comm_spawn
[loki:13586] *** reported by process [2873294849,0]
[loki:13586] *** on communicator MPI_COMM_WORLD
[loki:13586] *** MPI_ERR_UNKNOWN: unknown error
[loki:13586] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will
now abort,
[loki:13586] *** and potentially your MPI job)
[Thread 0x7fffebbba700 (LWP 13585) exited]
[Thread 0x7ffff10a3700 (LWP 13584) exited]
[Thread 0x7ffff18a4700 (LWP 13583) exited]
[Thread 0x7ffff3b97700 (LWP 13582) exited]
[Inferior 1 (process 13567) exited with code 016]
Missing separate debuginfos, use: zypper install
libpciaccess0-debuginfo-0.13.2-5.1.x86_64 libudev1-debuginfo-210-116.3.3.x86_64
(gdb) bt
No stack.
(gdb)
Do you need anything else?
Kind regards
Siegmar
Am 08.01.2017 um 17:02 schrieb Howard Pritchard:
HI Siegmar,
Could you post the configury options you use when building the 2.0.2rc3?
Maybe that will help in trying to reproduce the segfault you are
observing.
Howard
2017-01-07 2:30 GMT-07:00 Siegmar Gross <siegmar.gr...@informatik.hs-fulda.de
<mailto:siegmar.gr...@informatik.hs-fulda.de>
<mailto:siegmar.gr...@informatik.hs-fulda.de
<mailto:siegmar.gr...@informatik.hs-fulda.de>>>:
Hi,
I have installed openmpi-2.0.2rc3 on my "SUSE Linux Enterprise
Server 12 (x86_64)" with Sun C 5.14 and gcc-6.3.0. Unfortunately,
I still get the same error that I reported for rc2.
I would be grateful, if somebody can fix the problem before
releasing the final version. Thank you very much for any help
in advance.
Kind regards
Siegmar
_______________________________________________
users mailing list
users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
<mailto:users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users>>
_______________________________________________
users mailing list
users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users