Hi Gilles,
I configured Open MPI with the following command.
../openmpi-v3.x-201705250239-d5200ea/configure \
--prefix=/usr/local/openmpi-3.0.0_64_cc \
--libdir=/usr/local/openmpi-3.0.0_64_cc/lib64 \
--with-jdk-bindir=/usr/local/jdk1.8.0_66/bin \
--with-jdk-headers=/usr/local/jdk1.8.0_66/include \
JAVA_HOME=/usr/local/jdk1.8.0_66 \
LDFLAGS="-m64 -mt -Wl,-z -Wl,noexecstack -L/usr/local/lib64
-L/usr/local/cuda/lib64" \
CC="cc" CXX="CC" FC="f95" \
CFLAGS="-m64 -mt -I/usr/local/include -I/usr/local/cuda/include" \
CXXFLAGS="-m64 -I/usr/local/include -I/usr/local/cuda/include" \
FCFLAGS="-m64" \
CPP="cpp -I/usr/local/include -I/usr/local/cuda/include" \
CXXCPP="cpp -I/usr/local/include -I/usr/local/cuda/include" \
--enable-mpi-cxx \
--enable-cxx-exceptions \
--enable-mpi-java \
--with-cuda=/usr/local/cuda \
--with-valgrind=/usr/local/valgrind \
--enable-mpi-thread-multiple \
--with-hwloc=internal \
--without-verbs \
--with-wrapper-cflags="-m64 -mt" \
--with-wrapper-cxxflags="-m64" \
--with-wrapper-fcflags="-m64" \
--with-wrapper-ldflags="-mt" \
--enable-debug \
|& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_cc
Do you know when the fixes pending in the big ORTE update PR
are committed? Perhaps Ralph has a point suggesting not to spend
time with the problem if it may already be resolved. Nevertheless,
I added the requested information after the commands below.
Am 31.05.2017 um 04:43 schrieb Gilles Gouaillardet:
Ralph,
the issue Siegmar initially reported was
loki hello_1 111 mpiexec -np 3 --host loki:2,exin hello_1_mpi
per what you wrote, this should be equivalent to
loki hello_1 111 mpiexec -np 3 --host loki:2,exin:1 hello_1_mpi
and this is what i initially wanted to double check (but i made a typo in my
reply)
anyway, the logs Siegmar posted indicate the two commands produce the same
output
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 3 slots
that were requested by the application:
hello_1_mpi
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
to me, this is incorrect since the command line made 3 available slots.
also, i am unable to reproduce any of these issues :-(
Siegmar,
can you please post your configure command line, and try these commands from
loki
mpiexec -np 3 --host loki:2,exin --mca plm_base_verbose 5 hostname
loki hello_1 112 mpiexec -np 3 --host loki:2,exin --mca plm_base_verbose 5
hostname
[loki:25620] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL
[loki:25620] plm:base:set_hnp_name: initial bias 25620 nodename hash 3121685933
[loki:25620] plm:base:set_hnp_name: final jobfam 64424
[loki:25620] [[64424,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[loki:25620] [[64424,0],0] plm:base:receive start comm
[loki:25620] [[64424,0],0] plm:base:setup_job
[loki:25620] [[64424,0],0] plm:base:setup_vm
[loki:25620] [[64424,0],0] plm:base:setup_vm creating map
[loki:25620] [[64424,0],0] setup:vm: working unmanaged allocation
[loki:25620] [[64424,0],0] using dash_host
[loki:25620] [[64424,0],0] checking node loki
[loki:25620] [[64424,0],0] ignoring myself
[loki:25620] [[64424,0],0] checking node exin
[loki:25620] [[64424,0],0] plm:base:setup_vm add new daemon [[64424,0],1]
[loki:25620] [[64424,0],0] plm:base:setup_vm assigning new daemon [[64424,0],1]
to node exin
[loki:25620] [[64424,0],0] plm:rsh: launching vm
[loki:25620] [[64424,0],0] plm:rsh: local shell: 2 (tcsh)
[loki:25620] [[64424,0],0] plm:rsh: assuming same remote shell as local shell
[loki:25620] [[64424,0],0] plm:rsh: remote shell: 2 (tcsh)
[loki:25620] [[64424,0],0] plm:rsh: final template argv:
/usr/bin/ssh <template> orted -mca ess "env" -mca ess_base_jobid "4222091264" -mca ess_base_vpid "<template>" -mca ess_base_num_procs "2" -mca
orte_hnp_uri "4222091264.0;tcp://193.174.24.40:38978" -mca orte_node_regex "loki,exin" --mca plm_base_verbose "5" -mca plm "rsh"
[loki:25620] [[64424,0],0] plm:rsh:launch daemon 0 not a child of mine
[loki:25620] [[64424,0],0] plm:rsh: adding node exin to launch list
[loki:25620] [[64424,0],0] plm:rsh: activating launch event
[loki:25620] [[64424,0],0] plm:rsh: recording launch of daemon [[64424,0],1]
[loki:25620] [[64424,0],0] plm:rsh: executing: (/usr/bin/ssh) [/usr/bin/ssh exin orted -mca ess "env" -mca ess_base_jobid "4222091264" -mca ess_base_vpid 1
-mca ess_base_num_procs "2" -mca orte_hnp_uri "4222091264.0;tcp://193.174.24.40:38978" -mca orte_node_regex "loki,exin" --mca plm_base_verbose "5" -mca plm
"rsh"]
[exin:19816] [[64424,0],1] plm:rsh_lookup on agent ssh : rsh path NULL
[exin:19816] [[64424,0],1] plm:rsh_setup on agent ssh : rsh path NULL
[exin:19816] [[64424,0],1] plm:base:receive start comm
[loki:25620] [[64424,0],0] plm:base:orted_report_launch from daemon
[[64424,0],1]
[loki:25620] [[64424,0],0] plm:base:orted_report_launch from daemon
[[64424,0],1] on node exin
[loki:25620] [[64424,0],0] RECEIVED TOPOLOGY SIG
0N:2S:0L3:12L2:24L1:12C:24H:x86_64 FROM NODE exin
[loki:25620] [[64424,0],0] NEW TOPOLOGY - ADDING
[loki:25620] [[64424,0],0] plm:base:orted_report_launch completed for daemon
[[64424,0],1] at contact 4222091264.1;tcp://192.168.75.71:49169
[loki:25620] [[64424,0],0] plm:base:orted_report_launch recvd 2 of 2 reported
daemons
[loki:25620] [[64424,0],0] complete_setup on job [64424,1]
[loki:25620] [[64424,0],0] plm:base:launch_apps for job [64424,1]
[exin:19816] [[64424,0],1] plm:rsh: remote spawn called
[exin:19816] [[64424,0],1] plm:rsh: remote spawn - have no children!
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 3 slots
that were requested by the application:
hostname
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
[loki:25620] [[64424,0],0] plm:base:orted_cmd sending orted_exit commands
[exin:19816] [[64424,0],1] plm:base:receive stop comm
[loki:25620] [[64424,0],0] plm:base:receive stop comm
loki hello_1 112
mpiexec -np 1 --host exin --mca plm_base_verbose 5 hostname
loki hello_1 113 mpiexec -np 1 --host exin --mca plm_base_verbose 5 hostname
[loki:25750] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL
[loki:25750] plm:base:set_hnp_name: initial bias 25750 nodename hash 3121685933
[loki:25750] plm:base:set_hnp_name: final jobfam 64298
[loki:25750] [[64298,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[loki:25750] [[64298,0],0] plm:base:receive start comm
[loki:25750] [[64298,0],0] plm:base:setup_job
[loki:25750] [[64298,0],0] plm:base:setup_vm
[loki:25750] [[64298,0],0] plm:base:setup_vm creating map
[loki:25750] [[64298,0],0] setup:vm: working unmanaged allocation
[loki:25750] [[64298,0],0] using dash_host
[loki:25750] [[64298,0],0] checking node exin
[loki:25750] [[64298,0],0] plm:base:setup_vm add new daemon [[64298,0],1]
[loki:25750] [[64298,0],0] plm:base:setup_vm assigning new daemon [[64298,0],1]
to node exin
[loki:25750] [[64298,0],0] plm:rsh: launching vm
[loki:25750] [[64298,0],0] plm:rsh: local shell: 2 (tcsh)
[loki:25750] [[64298,0],0] plm:rsh: assuming same remote shell as local shell
[loki:25750] [[64298,0],0] plm:rsh: remote shell: 2 (tcsh)
[loki:25750] [[64298,0],0] plm:rsh: final template argv:
/usr/bin/ssh <template> orted -mca ess "env" -mca ess_base_jobid "4213833728" -mca ess_base_vpid "<template>" -mca ess_base_num_procs "2" -mca
orte_hnp_uri "4213833728.0;tcp://193.174.24.40:53840" -mca orte_node_regex "loki,exin" --mca plm_base_verbose "5" -mca plm "rsh"
[loki:25750] [[64298,0],0] plm:rsh:launch daemon 0 not a child of mine
[loki:25750] [[64298,0],0] plm:rsh: adding node exin to launch list
[loki:25750] [[64298,0],0] plm:rsh: activating launch event
[loki:25750] [[64298,0],0] plm:rsh: recording launch of daemon [[64298,0],1]
[loki:25750] [[64298,0],0] plm:rsh: executing: (/usr/bin/ssh) [/usr/bin/ssh exin orted -mca ess "env" -mca ess_base_jobid "4213833728" -mca ess_base_vpid 1
-mca ess_base_num_procs "2" -mca orte_hnp_uri "4213833728.0;tcp://193.174.24.40:53840" -mca orte_node_regex "loki,exin" --mca plm_base_verbose "5" -mca plm
"rsh"]
[exin:19978] [[64298,0],1] plm:rsh_lookup on agent ssh : rsh path NULL
[exin:19978] [[64298,0],1] plm:rsh_setup on agent ssh : rsh path NULL
[exin:19978] [[64298,0],1] plm:base:receive start comm
[loki:25750] [[64298,0],0] plm:base:orted_report_launch from daemon
[[64298,0],1]
[loki:25750] [[64298,0],0] plm:base:orted_report_launch from daemon
[[64298,0],1] on node exin
[loki:25750] [[64298,0],0] RECEIVED TOPOLOGY SIG
0N:2S:0L3:12L2:24L1:12C:24H:x86_64 FROM NODE exin
[loki:25750] [[64298,0],0] NEW TOPOLOGY - ADDING
[loki:25750] [[64298,0],0] plm:base:orted_report_launch completed for daemon
[[64298,0],1] at contact 4213833728.1;tcp://192.168.75.71:56878
[loki:25750] [[64298,0],0] plm:base:orted_report_launch recvd 2 of 2 reported
daemons
[loki:25750] [[64298,0],0] plm:base:setting slots for node loki by cores
[loki:25750] [[64298,0],0] complete_setup on job [64298,1]
[loki:25750] [[64298,0],0] plm:base:launch_apps for job [64298,1]
[exin:19978] [[64298,0],1] plm:rsh: remote spawn called
[exin:19978] [[64298,0],1] plm:rsh: remote spawn - have no children!
[loki:25750] [[64298,0],0] plm:base:receive processing msg
[loki:25750] [[64298,0],0] plm:base:receive update proc state command from
[[64298,0],1]
[loki:25750] [[64298,0],0] plm:base:receive got update_proc_state for job
[64298,1]
[loki:25750] [[64298,0],0] plm:base:receive got update_proc_state for vpid 0
state RUNNING exit_code 0
[loki:25750] [[64298,0],0] plm:base:receive done processing commands
[loki:25750] [[64298,0],0] plm:base:launch wiring up iof for job [64298,1]
[loki:25750] [[64298,0],0] plm:base:launch job [64298,1] is not a dynamic spawn
exin
[loki:25750] [[64298,0],0] plm:base:receive processing msg
[loki:25750] [[64298,0],0] plm:base:receive update proc state command from
[[64298,0],1]
[loki:25750] [[64298,0],0] plm:base:receive got update_proc_state for job
[64298,1]
[loki:25750] [[64298,0],0] plm:base:receive got update_proc_state for vpid 0
state NORMALLY TERMINATED exit_code 0
[loki:25750] [[64298,0],0] plm:base:receive done processing commands
[loki:25750] [[64298,0],0] plm:base:orted_cmd sending orted_exit commands
[exin:19978] [[64298,0],1] plm:base:receive stop comm
[loki:25750] [[64298,0],0] plm:base:receive stop comm
loki hello_1 113
mpiexec -np 1 --host exin ldd ./hello_1_mpi
I have to adapt the path, because the executables are not in the
local directory (due to my old heterogeneous environment).
loki hello_1 169 mpiexec -np 1 --host exin which -a hello_1_mpi
/home/fd1026/Linux/x86_64/bin/hello_1_mpi
loki hello_1 165 mpiexec -np 1 --host exin ldd
$HOME/Linux/x86_64/bin/hello_1_mpi
linux-vdso.so.1 (0x00007ffc81ffb000)
libmpi.so.0 => /usr/local/openmpi-3.0.0_64_cc/lib64/libmpi.so.0
(0x00007f7e242ac000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f7e2408f000)
libc.so.6 => /lib64/libc.so.6 (0x00007f7e23cec000)
libopen-rte.so.0 =>
/usr/local/openmpi-3.0.0_64_cc/lib64/libopen-rte.so.0 (0x00007f7e23569000)
libopen-pal.so.0 =>
/usr/local/openmpi-3.0.0_64_cc/lib64/libopen-pal.so.0 (0x00007f7e22ddd000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007f7e22bd9000)
libnuma.so.1 => /usr/local/lib64/libnuma.so.1 (0x00007f7e229cc000)
libudev.so.1 => /usr/lib64/libudev.so.1 (0x00007f7e227ac000)
libpciaccess.so.0 => /usr/lib64/libpciaccess.so.0 (0x00007f7e225a2000)
libnvidia-ml.so.1 => /usr/lib64/libnvidia-ml.so.1 (0x00007f7e22262000)
librt.so.1 => /lib64/librt.so.1 (0x00007f7e2205a000)
libm.so.6 => /lib64/libm.so.6 (0x00007f7e21d5d000)
libutil.so.1 => /lib64/libutil.so.1 (0x00007f7e21b59000)
libz.so.1 => /lib64/libz.so.1 (0x00007f7e21943000)
/lib64/ld-linux-x86-64.so.2 (0x0000555d41aa1000)
libselinux.so.1 => /lib64/libselinux.so.1 (0x00007f7e2171c000)
libcap.so.2 => /lib64/libcap.so.2 (0x00007f7e21517000)
libresolv.so.2 => /lib64/libresolv.so.2 (0x00007f7e21300000)
libpcre.so.1 => /usr/lib64/libpcre.so.1 (0x00007f7e21090000)
loki hello_1 166
Thank you very much for your help
Siegmar
if Open MPI is not installed on a shared filesystem (NFS for example), please
also double check
both install were built from the same source and with the same options
Cheers,
Gilles
On 5/30/2017 10:20 PM, r...@open-mpi.org wrote:
This behavior is as-expected. When you specify "-host foo,barβ, you have told us to assign one slot to each of those nodes. Thus, running 3 procs exceeds
the number of slots you assigned.
You can tell it to set the #slots to the #cores it discovers on the node by
using β-host foo:*,bar:*β
I cannot replicate your behavior of "-np 3 -host foo:2,bar:3β running more than
3 procs
On May 30, 2017, at 5:24 AM, Siegmar Gross
<siegmar.gr...@informatik.hs-fulda.de> wrote:
Hi Gilles,
what if you ?
mpiexec --host loki:1,exin:1 -np 3 hello_1_mpi
I need as many slots as processes so that I use "-np 2".
"mpiexec --host loki,exin -np 2 hello_1_mpi" works as well. The command
breaks, if I use at least "-np 3" and distribute the processes across at
least two machines.
loki hello_1 118 mpiexec --host loki:1,exin:1 -np 2 hello_1_mpi
Process 0 of 2 running on loki
Process 1 of 2 running on exin
Now 1 slave tasks are sending greetings.
Greetings from task 1:
message type: 3
msg length: 131 characters
message:
hostname: exin
operating system: Linux
release: 4.4.49-92.11-default
processor: x86_64
loki hello_1 119
are loki and exin different ? (os, sockets, core)
Yes, loki is a real machine and exin is a virtual one. "exin" uses a newer
kernel.
loki fd1026 108 uname -a
Linux loki 4.4.38-93-default #1 SMP Wed Dec 14 12:59:43 UTC 2016 (2d3e9d4)
x86_64 x86_64 x86_64 GNU/Linux
loki fd1026 109 ssh exin uname -a
Linux exin 4.4.49-92.11-default #1 SMP Fri Feb 17 08:29:30 UTC 2017 (8f9478a)
x86_64 x86_64 x86_64 GNU/Linux
loki fd1026 110
The number of sockets and cores is identical, but the processor types are
different as you can see at the end of my previous email. "loki" uses two
"Intel(R) Xeon(R) CPU E5-2620 v3" processors and "exin" two "Intel Core
Processor (Haswell, no TSX)" from QEMU. I can provide a pdf file with both
topologies (89 K) if you are interested in the output from lstopo. I've
added some runs. Most interesting in my opinion are the last two
"mpiexec --host exin:2,loki:3 -np 3 hello_1_mpi" and
"mpiexec -np 3 --host exin:2,loki:3 hello_1_mpi".
Why does mpiexec create five processes although I've asked for only three
processes? Why do I have to break the program with <Ctrl-c> for the first
of the above commands?
loki hello_1 110 mpiexec --host loki:2,exin:1 -np 3 hello_1_mpi
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 3 slots
that were requested by the application:
hello_1_mpi
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
loki hello_1 111 mpiexec --host exin:3 -np 3 hello_1_mpi
Process 0 of 3 running on exin
Process 1 of 3 running on exin
Process 2 of 3 running on exin
...
loki hello_1 115 mpiexec --host exin:2,loki:3 -np 3 hello_1_mpi
Process 1 of 3 running on loki
Process 0 of 3 running on loki
Process 2 of 3 running on loki
...
Process 0 of 3 running on exin
Process 1 of 3 running on exin
[exin][[52173,1],1][../../../../../openmpi-v3.x-201705250239-d5200ea/opal/mca/btl/tcp/btl_tcp_endpoint.c:794:mca_btl_tcp_endpoint_complete_connect]
connect() to 193.xxx.xxx.xxx failed: Connection refused (111)
^Cloki hello_1 116
loki hello_1 116 mpiexec -np 3 --host exin:2,loki:3 hello_1_mpi
Process 0 of 3 running on loki
Process 2 of 3 running on loki
Process 1 of 3 running on loki
...
Process 1 of 3 running on exin
Process 0 of 3 running on exin
[exin][[51638,1],1][../../../../../openmpi-v3.x-201705250239-d5200ea/opal/mca/btl/tcp/btl_tcp_endpoint.c:590:mca_btl_tcp_endpoint_recv_blocking] recv(16,
0/8) failed: Connection reset by peer (104)
[exin:31909]
../../../../../openmpi-v3.x-201705250239-d5200ea/ompi/mca/pml/ob1/pml_ob1_sendreq.c:191
FATAL
loki hello_1 117
Do you need anything else?
Kind regards and thank you very much for your help
Siegmar
Cheers,
Gilles
----- Original Message -----
Hi,
I have installed openmpi-v3.x-201705250239-d5200ea on my "SUSE Linux
Enterprise Server 12.2 (x86_64)" with Sun C 5.14 and gcc-7.1.0.
Depending on the machine that I use to start my processes, I have
a problem with "--host" for versions "v3.x" and "master", while
everything works as expected with earlier versions.
loki hello_1 111 mpiexec -np 3 --host loki:2,exin hello_1_mpi
----------------------------------------------------------------------
----
There are not enough slots available in the system to satisfy the 3
slots
that were requested by the application:
hello_1_mpi
Either request fewer slots for your application, or make more slots
available
for use.
----------------------------------------------------------------------
----
Everything is ok if I use the same command on "exin".
exin fd1026 107 mpiexec -np 3 --host loki:2,exin hello_1_mpi
Process 0 of 3 running on loki
Process 1 of 3 running on loki
Process 2 of 3 running on exin
...
Everything is also ok if I use openmpi-v2.x-201705260340-58c6b3c on "
loki".
loki hello_1 114 which mpiexec
/usr/local/openmpi-2.1.2_64_cc/bin/mpiexec
loki hello_1 115 mpiexec -np 3 --host loki:2,exin hello_1_mpi
Process 0 of 3 running on loki
Process 1 of 3 running on loki
Process 2 of 3 running on exin
...
"exin" is a virtual machine on QEMU so that it uses a slightly
different
processor architecture, e.g., it has no L3 cache but larger L2 caches.
loki fd1026 117 cat /proc/cpuinfo | grep -e "model name" -e "physical
id" -e
"cpu cores" -e "cache size" | sort | uniq
cache size : 15360 KB
cpu cores : 6
model name : Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz
physical id : 0
physical id : 1
loki fd1026 118 ssh exin cat /proc/cpuinfo | grep -e "model name" -e "
physical
id" -e "cpu cores" -e "cache size" | sort | uniq
cache size : 4096 KB
cpu cores : 6
model name : Intel Core Processor (Haswell, no TSX)
physical id : 0
physical id : 1
Any ideas what's different in the newer versions of Open MPI? Is the
new
behavior intended? I would be grateful, if somebody can fix the
problem,
if "mpiexec -np 3 --host loki:2,exin hello_1_mpi" should print my
messages
in versions "3.x" and "master" as well, if the programs are started on
any
machine. Do you need anything else? Thank you very much for any help
in
advance.
Kind regards
Siegmar
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users