Hi Ralph and Gilles,
the program breaks only, if I combine "--host" and "--slot-list". Perhaps this
information is helpful. I use a different machine now, so that you can see that
the problem is not restricted to "loki".
pc03 spawn 115 ompi_info | grep -e "OPAL repo revision:" -e "C compiler
absolute:"
OPAL repo revision: v1.10.2-201-gd23dda8
C compiler absolute: /usr/local/gcc-6.1.0/bin/gcc
pc03 spawn 116 uname -a
Linux pc03 3.12.55-52.42-default #1 SMP Thu Mar 3 10:35:46 UTC 2016 (4354e1d)
x86_64 x86_64 x86_64 GNU/Linux
pc03 spawn 117 cat host_pc03.openmpi
pc03.informatik.hs-fulda.de slots=12 max_slots=12
pc03 spawn 118 mpicc simple_spawn.c
pc03 spawn 119 mpiexec -np 1 --report-bindings a.out
[pc03:03711] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]:
[BB/../../../../..][../../../../../..]
[pid 3713] starting up!
0 completed MPI_Init
Parent [pid 3713] about to spawn!
[pc03:03711] MCW rank 0 bound to socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt
0-1]], socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt 0-1]], socket 1[core
10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: [../../../../../..][BB/BB/BB/BB/BB/BB]
[pc03:03711] MCW rank 1 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt
0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt
0-1]], socket 0[core 5[hwt 0-1]]: [BB/BB/BB/BB/BB/BB][../../../../../..]
[pc03:03711] MCW rank 2 bound to socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt
0-1]], socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt 0-1]], socket 1[core
10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: [../../../../../..][BB/BB/BB/BB/BB/BB]
[pid 3715] starting up!
[pid 3716] starting up!
[pid 3717] starting up!
Parent done with spawn
Parent sending message to child
0 completed MPI_Init
Hello from the child 0 of 3 on host pc03 pid 3715
1 completed MPI_Init
Hello from the child 1 of 3 on host pc03 pid 3716
2 completed MPI_Init
Hello from the child 2 of 3 on host pc03 pid 3717
Child 0 received msg: 38
Child 0 disconnected
Child 2 disconnected
Parent disconnected
Child 1 disconnected
3713: exiting
3715: exiting
3716: exiting
3717: exiting
pc03 spawn 120 mpiexec -np 1 --hostfile host_pc03.openmpi --slot-list
0:0-1,1:0-1 --report-bindings a.out
[pc03:03729] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt
0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]:
[BB/BB/../../../..][BB/BB/../../../..]
[pid 3731] starting up!
0 completed MPI_Init
Parent [pid 3731] about to spawn!
[pc03:03729] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt
0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]:
[BB/BB/../../../..][BB/BB/../../../..]
[pc03:03729] MCW rank 1 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt
0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]:
[BB/BB/../../../..][BB/BB/../../../..]
[pc03:03729] MCW rank 2 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt
0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]:
[BB/BB/../../../..][BB/BB/../../../..]
[pid 3733] starting up!
[pid 3734] starting up!
[pid 3735] starting up!
Parent done with spawn
Parent sending message to child
2 completed MPI_Init
Hello from the child 2 of 3 on host pc03 pid 3735
1 completed MPI_Init
Hello from the child 1 of 3 on host pc03 pid 3734
0 completed MPI_Init
Hello from the child 0 of 3 on host pc03 pid 3733
Child 0 received msg: 38
Child 0 disconnected
Child 2 disconnected
Child 1 disconnected
Parent disconnected
3731: exiting
3734: exiting
3733: exiting
3735: exiting
pc03 spawn 121 mpiexec -np 1 --host pc03 --slot-list 0:0-1,1:0-1
--report-bindings a.out
[pc03:03744] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt
0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]:
[BB/BB/../../../..][BB/BB/../../../..]
[pid 3746] starting up!
0 completed MPI_Init
Parent [pid 3746] about to spawn!
[pc03:03744] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt
0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]:
[BB/BB/../../../..][BB/BB/../../../..]
[pc03:03744] MCW rank 2 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt
0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]:
[BB/BB/../../../..][BB/BB/../../../..]
[pid 3748] starting up!
[pid 3749] starting up!
[pc03:03749] *** Process received signal ***
[pc03:03749] Signal: Segmentation fault (11)
[pc03:03749] Signal code: Address not mapped (1)
[pc03:03749] Failing at address: 0x8
[pc03:03749] [ 0] /lib64/libpthread.so.0(+0xf870)[0x7fe6f0d1f870]
[pc03:03749] [ 1]
/usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_proc_self+0x35)[0x7fe6f0f825b0]
[pc03:03749] [ 2]
/usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_comm_init+0x68b)[0x7fe6f0f61b08]
[pc03:03749] [ 3]
/usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_mpi_init+0xa90)[0x7fe6f0f87e8a]
[pc03:03749] [ 4]
/usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(MPI_Init+0x1a0)[0x7fe6f0fc42ae]
[pc03:03749] [ 5] a.out[0x400d0c]
[pc03:03749] [ 6] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fe6f0989b05]
[pc03:03749] [ 7] a.out[0x400bf9]
[pc03:03749] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 2 with PID 3749 on node pc03 exited on signal
11 (Segmentation fault).
--------------------------------------------------------------------------
pc03 spawn 122
Kind regards
Siegmar
On 05/24/16 15:44, Ralph Castain wrote:
On May 24, 2016, at 6:21 AM, Siegmar Gross
<siegmar.gr...@informatik.hs-fulda.de> wrote:
Hi Ralph,
I copy the relevant lines to this place, so that it is easier to see what
happens. "a.out" is your program, which I compiled with mpicc.
loki spawn 153 ompi_info | grep -e "OPAL repo revision:" -e "C compiler
absolute:"
OPAL repo revision: v1.10.2-201-gd23dda8
C compiler absolute: /usr/local/gcc-6.1.0/bin/gcc
loki spawn 154 mpicc simple_spawn.c
loki spawn 155 mpiexec -np 1 a.out
[pid 24008] starting up!
0 completed MPI_Init
...
"mpiexec -np 1 a.out" works.
I don’t know what “a.out” is, but it looks like there is some memory
corruption there.
"a.out" is still your program. I get the same error on different
machines, so that it is not very likely, that the (hardware) memory
is corrupted.
loki spawn 156 mpiexec -np 1 --host loki --slot-list 0-5 a.out
[pid 24102] starting up!
0 completed MPI_Init
Parent [pid 24102] about to spawn!
[pid 24104] starting up!
[pid 24105] starting up!
[loki:24105] *** Process received signal ***
[loki:24105] Signal: Segmentation fault (11)
[loki:24105] Signal code: Address not mapped (1)
...
"mpiexec -np 1 --host loki --slot-list 0-5 a.out" breaks with a segmentation
faUlt. Can I do something, so that you can find out, what happens?
I honestly have no idea - perhaps Gilles can help, as I have no access to that
kind of environment. We aren’t seeing such problems elsewhere, so it is likely
something local.
Kind regards
Siegmar
On 05/24/16 15:07, Ralph Castain wrote:
On May 24, 2016, at 4:19 AM, Siegmar Gross
<siegmar.gr...@informatik.hs-fulda.de
<mailto:siegmar.gr...@informatik.hs-fulda.de>> wrote:
Hi Ralph,
thank you very much for your answer and your example program.
On 05/23/16 17:45, Ralph Castain wrote:
I cannot replicate the problem - both scenarios work fine for me. I’m not
convinced your test code is correct, however, as you call Comm_free the
inter-communicator but didn’t call Comm_disconnect. Checkout the attached
for a correct code and see if it works for you.
I thought that I only need MPI_Comm_Disconnect, if I would have established a
connection with MPI_Comm_connect before. The man page for MPI_Comm_free states
"This operation marks the communicator object for deallocation. The
handle is set to MPI_COMM_NULL. Any pending operations that use this
communicator will complete normally; the object is actually deallocated only
if there are no other active references to it.".
The man page for MPI_Comm_disconnect states
"MPI_Comm_disconnect waits for all pending communication on comm to complete
internally, deallocates the communicator object, and sets the handle to
MPI_COMM_NULL. It is a collective operation.".
I don't see a difference for my spawned processes, because both functions will
"wait" until all pending operations have finished, before the object will be
destroyed. Nevertheless, perhaps my small example program worked all the years
by chance.
However, I don't understand, why my program works with
"mpiexec -np 1 --host loki,loki,loki,loki,loki spawn_master" and breaks with
"mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master". You are right,
my slot-list is equivalent to "-bind-to none". I could also have used
"mpiexec -np 1 --host loki --oversubscribe spawn_master" which works as well.
Well, you are only giving us one slot when you specify "-host loki”, and then
you are trying to launch multiple processes into it. The “slot-list” option only
tells us what cpus to bind each process to - it doesn’t allocate process slots.
So you have to tell us how many processes are allowed to run on this node.
The program breaks with "There are not enough slots available in the system
to satisfy ...", if I only use "--host loki" or different host names,
without mentioning five host names, using "slot-list", or "oversubscribe",
Unfortunately "--host <host name>:<number of slots>" isn't available for
openmpi-1.10.3rc2 to specify the number of available slots.
Correct - we did not backport the new syntax
Your program behaves the same way as mine, so that MPI_Comm_disconnect
will not solve my problem. I had to modify your program in a negligible way
to get it compiled.
loki spawn 153 ompi_info | grep -e "OPAL repo revision:" -e "C compiler
absolute:"
OPAL repo revision: v1.10.2-201-gd23dda8
C compiler absolute: /usr/local/gcc-6.1.0/bin/gcc
loki spawn 154 mpicc simple_spawn.c
loki spawn 155 mpiexec -np 1 a.out
[pid 24008] starting up!
0 completed MPI_Init
Parent [pid 24008] about to spawn!
[pid 24010] starting up!
[pid 24011] starting up!
[pid 24012] starting up!
Parent done with spawn
Parent sending message to child
0 completed MPI_Init
Hello from the child 0 of 3 on host loki pid 24010
1 completed MPI_Init
Hello from the child 1 of 3 on host loki pid 24011
2 completed MPI_Init
Hello from the child 2 of 3 on host loki pid 24012
Child 0 received msg: 38
Child 0 disconnected
Child 1 disconnected
Child 2 disconnected
Parent disconnected
24012: exiting
24010: exiting
24008: exiting
24011: exiting
Is something wrong with my command line? I didn't use slot-list before, so
that I'm not sure, if I use it in the intended way.
I don’t know what “a.out” is, but it looks like there is some memory corruption
there.
loki spawn 156 mpiexec -np 1 --host loki --slot-list 0-5 a.out
[pid 24102] starting up!
0 completed MPI_Init
Parent [pid 24102] about to spawn!
[pid 24104] starting up!
[pid 24105] starting up!
[loki:24105] *** Process received signal ***
[loki:24105] Signal: Segmentation fault (11)
[loki:24105] Signal code: Address not mapped (1)
[loki:24105] Failing at address: 0x8
[loki:24105] [ 0] /lib64/libpthread.so.0(+0xf870)[0x7f39aa76f870]
[loki:24105] [ 1]
/usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_proc_self+0x35)[0x7f39aa9d25b0]
[loki:24105] [ 2]
/usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_comm_init+0x68b)[0x7f39aa9b1b08]
[loki:24105] [ 3] *** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[loki:24104] Local abort before MPI_INIT completed successfully; not able to
aggregate error messages, and not able to guarantee that all other processes
were killed!
/usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_mpi_init+0xa90)[0x7f39aa9d7e8a]
[loki:24105] [ 4]
/usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(MPI_Init+0x1a0)[0x7f39aaa142ae]
[loki:24105] [ 5] a.out[0x400d0c]
[loki:24105] [ 6] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f39aa3d9b05]
[loki:24105] [ 7] a.out[0x400bf9]
[loki:24105] *** End of error message ***
-------------------------------------------------------
Child job 2 terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus
causing
the job to be terminated. The first process to do so was:
Process name: [[49560,2],0]
Exit code: 1
--------------------------------------------------------------------------
loki spawn 157
Hopefully, you will find out what happens. Please let me know, if I can
help you in any way.
Kind regards
Siegmar
FWIW: I don’t know how many cores you have on your sockets, but if you
have 6 cores/socket, then your slot-list is equivalent to “—bind-to none”
as the slot-list applies to every process being launched
On May 23, 2016, at 6:26 AM, Siegmar Gross
<siegmar.gr...@informatik.hs-fulda.de
<mailto:siegmar.gr...@informatik.hs-fulda.de>> wrote:
Hi,
I installed openmpi-1.10.3rc2 on my "SUSE Linux Enterprise Server
12 (x86_64)" with Sun C 5.13 and gcc-6.1.0. Unfortunately I get
a segmentation fault for "--slot-list" for one of my small programs.
loki spawn 119 ompi_info | grep -e "OPAL repo revision:" -e "C compiler
absolute:"
OPAL repo revision: v1.10.2-201-gd23dda8
C compiler absolute: /usr/local/gcc-6.1.0/bin/gcc
loki spawn 120 mpiexec -np 1 --host loki,loki,loki,loki,loki spawn_master
Parent process 0 running on loki
I create 4 slave processes
Parent process 0: tasks in MPI_COMM_WORLD: 1
tasks in COMM_CHILD_PROCESSES local group: 1
tasks in COMM_CHILD_PROCESSES remote group: 4
Slave process 0 of 4 running on loki
Slave process 1 of 4 running on loki
Slave process 2 of 4 running on loki
spawn_slave 2: argv[0]: spawn_slave
Slave process 3 of 4 running on loki
spawn_slave 0: argv[0]: spawn_slave
spawn_slave 1: argv[0]: spawn_slave
spawn_slave 3: argv[0]: spawn_slave
loki spawn 121 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master
Parent process 0 running on loki
I create 4 slave processes
[loki:17326] *** Process received signal ***
[loki:17326] Signal: Segmentation fault (11)
[loki:17326] Signal code: Address not mapped (1)
[loki:17326] Failing at address: 0x8
[loki:17326] [ 0] /lib64/libpthread.so.0(+0xf870)[0x7f4e469b3870]
[loki:17326] [ 1] *** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[loki:17324] Local abort before MPI_INIT completed successfully; not able to
aggregate error messages, and not able to guarantee that all other processes
were killed!
/usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_proc_self+0x35)[0x7f4e46c165b0]
[loki:17326] [ 2]
/usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_comm_init+0x68b)[0x7f4e46bf5b08]
[loki:17326] [ 3] *** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[loki:17325] Local abort before MPI_INIT completed successfully; not able to
aggregate error messages, and not able to guarantee that all other processes
were killed!
/usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_mpi_init+0xa90)[0x7f4e46c1be8a]
[loki:17326] [ 4]
/usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(MPI_Init+0x180)[0x7f4e46c5828e]
[loki:17326] [ 5] spawn_slave[0x40097e]
[loki:17326] [ 6] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f4e4661db05]
[loki:17326] [ 7] spawn_slave[0x400a54]
[loki:17326] *** End of error message ***
-------------------------------------------------------
Child job 2 terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status,
thus causing
the job to be terminated. The first process to do so was:
Process name: [[56340,2],0]
Exit code: 1
--------------------------------------------------------------------------
loki spawn 122
I would be grateful, if somebody can fix the problem. Thank you
very much for any help in advance.
Kind regards
Siegmar
_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2016/05/29281.php
_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this
post: http://www.open-mpi.org/community/lists/users/2016/05/29284.php
<simple_spawn_modified.c>_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2016/05/29300.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2016/05/29301.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2016/05/29304.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2016/05/29307.php