Hi Ralph,
thank you very much for your answer and your example program.
On 05/23/16 17:45, Ralph Castain wrote:
I cannot replicate the problem - both scenarios work fine for me. I’m not
convinced your test code is correct, however, as you call Comm_free the
inter-communicator but didn’t call Comm_disconnect. Checkout the attached
for a correct code and see if it works for you.
I thought that I only need MPI_Comm_Disconnect, if I would have established a
connection with MPI_Comm_connect before. The man page for MPI_Comm_free states
"This operation marks the communicator object for deallocation. The
handle is set to MPI_COMM_NULL. Any pending operations that use this
communicator will complete normally; the object is actually deallocated only
if there are no other active references to it.".
The man page for MPI_Comm_disconnect states
"MPI_Comm_disconnect waits for all pending communication on comm to complete
internally, deallocates the communicator object, and sets the handle to
MPI_COMM_NULL. It is a collective operation.".
I don't see a difference for my spawned processes, because both functions will
"wait" until all pending operations have finished, before the object will be
destroyed. Nevertheless, perhaps my small example program worked all the years
by chance.
However, I don't understand, why my program works with
"mpiexec -np 1 --host loki,loki,loki,loki,loki spawn_master" and breaks with
"mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master". You are right,
my slot-list is equivalent to "-bind-to none". I could also have used
"mpiexec -np 1 --host loki --oversubscribe spawn_master" which works as well.
The program breaks with "There are not enough slots available in the system
to satisfy ...", if I only use "--host loki" or different host names,
without mentioning five host names, using "slot-list", or "oversubscribe",
Unfortunately "--host <host name>:<number of slots>" isn't available for
openmpi-1.10.3rc2 to specify the number of available slots.
Your program behaves the same way as mine, so that MPI_Comm_disconnect
will not solve my problem. I had to modify your program in a negligible way
to get it compiled.
loki spawn 153 ompi_info | grep -e "OPAL repo revision:" -e "C compiler
absolute:"
OPAL repo revision: v1.10.2-201-gd23dda8
C compiler absolute: /usr/local/gcc-6.1.0/bin/gcc
loki spawn 154 mpicc simple_spawn.c
loki spawn 155 mpiexec -np 1 a.out
[pid 24008] starting up!
0 completed MPI_Init
Parent [pid 24008] about to spawn!
[pid 24010] starting up!
[pid 24011] starting up!
[pid 24012] starting up!
Parent done with spawn
Parent sending message to child
0 completed MPI_Init
Hello from the child 0 of 3 on host loki pid 24010
1 completed MPI_Init
Hello from the child 1 of 3 on host loki pid 24011
2 completed MPI_Init
Hello from the child 2 of 3 on host loki pid 24012
Child 0 received msg: 38
Child 0 disconnected
Child 1 disconnected
Child 2 disconnected
Parent disconnected
24012: exiting
24010: exiting
24008: exiting
24011: exiting
Is something wrong with my command line? I didn't use slot-list before, so
that I'm not sure, if I use it in the intended way.
loki spawn 156 mpiexec -np 1 --host loki --slot-list 0-5 a.out
[pid 24102] starting up!
0 completed MPI_Init
Parent [pid 24102] about to spawn!
[pid 24104] starting up!
[pid 24105] starting up!
[loki:24105] *** Process received signal ***
[loki:24105] Signal: Segmentation fault (11)
[loki:24105] Signal code: Address not mapped (1)
[loki:24105] Failing at address: 0x8
[loki:24105] [ 0] /lib64/libpthread.so.0(+0xf870)[0x7f39aa76f870]
[loki:24105] [ 1]
/usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_proc_self+0x35)[0x7f39aa9d25b0]
[loki:24105] [ 2]
/usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_comm_init+0x68b)[0x7f39aa9b1b08]
[loki:24105] [ 3] *** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[loki:24104] Local abort before MPI_INIT completed successfully; not able to
aggregate error messages, and not able to guarantee that all other processes
were killed!
/usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_mpi_init+0xa90)[0x7f39aa9d7e8a]
[loki:24105] [ 4]
/usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(MPI_Init+0x1a0)[0x7f39aaa142ae]
[loki:24105] [ 5] a.out[0x400d0c]
[loki:24105] [ 6] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f39aa3d9b05]
[loki:24105] [ 7] a.out[0x400bf9]
[loki:24105] *** End of error message ***
-------------------------------------------------------
Child job 2 terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus
causing
the job to be terminated. The first process to do so was:
Process name: [[49560,2],0]
Exit code: 1
--------------------------------------------------------------------------
loki spawn 157
Hopefully, you will find out what happens. Please let me know, if I can
help you in any way.
Kind regards
Siegmar
FWIW: I don’t know how many cores you have on your sockets, but if you
have 6 cores/socket, then your slot-list is equivalent to “—bind-to none”
as the slot-list applies to every process being launched
On May 23, 2016, at 6:26 AM, Siegmar Gross
<siegmar.gr...@informatik.hs-fulda.de> wrote:
Hi,
I installed openmpi-1.10.3rc2 on my "SUSE Linux Enterprise Server
12 (x86_64)" with Sun C 5.13 and gcc-6.1.0. Unfortunately I get
a segmentation fault for "--slot-list" for one of my small programs.
loki spawn 119 ompi_info | grep -e "OPAL repo revision:" -e "C compiler
absolute:"
OPAL repo revision: v1.10.2-201-gd23dda8
C compiler absolute: /usr/local/gcc-6.1.0/bin/gcc
loki spawn 120 mpiexec -np 1 --host loki,loki,loki,loki,loki spawn_master
Parent process 0 running on loki
I create 4 slave processes
Parent process 0: tasks in MPI_COMM_WORLD: 1
tasks in COMM_CHILD_PROCESSES local group: 1
tasks in COMM_CHILD_PROCESSES remote group: 4
Slave process 0 of 4 running on loki
Slave process 1 of 4 running on loki
Slave process 2 of 4 running on loki
spawn_slave 2: argv[0]: spawn_slave
Slave process 3 of 4 running on loki
spawn_slave 0: argv[0]: spawn_slave
spawn_slave 1: argv[0]: spawn_slave
spawn_slave 3: argv[0]: spawn_slave
loki spawn 121 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master
Parent process 0 running on loki
I create 4 slave processes
[loki:17326] *** Process received signal ***
[loki:17326] Signal: Segmentation fault (11)
[loki:17326] Signal code: Address not mapped (1)
[loki:17326] Failing at address: 0x8
[loki:17326] [ 0] /lib64/libpthread.so.0(+0xf870)[0x7f4e469b3870]
[loki:17326] [ 1] *** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[loki:17324] Local abort before MPI_INIT completed successfully; not able to
aggregate error messages, and not able to guarantee that all other processes
were killed!
/usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_proc_self+0x35)[0x7f4e46c165b0]
[loki:17326] [ 2]
/usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_comm_init+0x68b)[0x7f4e46bf5b08]
[loki:17326] [ 3] *** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[loki:17325] Local abort before MPI_INIT completed successfully; not able to
aggregate error messages, and not able to guarantee that all other processes
were killed!
/usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_mpi_init+0xa90)[0x7f4e46c1be8a]
[loki:17326] [ 4]
/usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(MPI_Init+0x180)[0x7f4e46c5828e]
[loki:17326] [ 5] spawn_slave[0x40097e]
[loki:17326] [ 6] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f4e4661db05]
[loki:17326] [ 7] spawn_slave[0x400a54]
[loki:17326] *** End of error message ***
-------------------------------------------------------
Child job 2 terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus
causing
the job to be terminated. The first process to do so was:
Process name: [[56340,2],0]
Exit code: 1
--------------------------------------------------------------------------
loki spawn 122
I would be grateful, if somebody can fix the problem. Thank you
very much for any help in advance.
Kind regards
Siegmar
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2016/05/29281.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2016/05/29284.php
/* #include "orte_config.h" */
#include <stdio.h>
#include <sys/types.h>
#include <unistd.h>
#include <mpi.h>
int gethostname(char *name, size_t namelen);
int main(int argc, char* argv[])
{
int msg, rc;
MPI_Comm parent, child;
int rank, size;
/* char hostname[OPAL_MAXHOSTNAMELEN]; */
char hostname[128];
pid_t pid;
pid = getpid();
printf("[pid %ld] starting up!\n", (long)pid);
MPI_Init(NULL, NULL);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
printf("%d completed MPI_Init\n", rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_get_parent(&parent);
/* If we get COMM_NULL back, then we're the parent */
if (MPI_COMM_NULL == parent) {
pid = getpid();
printf("Parent [pid %ld] about to spawn!\n", (long)pid);
if (MPI_SUCCESS != (rc = MPI_Comm_spawn(argv[0], MPI_ARGV_NULL, 3, MPI_INFO_NULL,
0, MPI_COMM_WORLD, &child, MPI_ERRCODES_IGNORE))) {
printf("Child failed to spawn\n");
return rc;
}
printf("Parent done with spawn\n");
if (0 == rank) {
msg = 38;
printf("Parent sending message to child\n");
MPI_Send(&msg, 1, MPI_INT, 0, 1, child);
}
MPI_Comm_disconnect(&child);
printf("Parent disconnected\n");
}
/* Otherwise, we're the child */
else {
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
gethostname(hostname, sizeof(hostname));
pid = getpid();
printf("Hello from the child %d of %d on host %s pid %ld\n", rank, 3, hostname, (long)pid);
if (0 == rank) {
MPI_Recv(&msg, 1, MPI_INT, 0, 1, parent, MPI_STATUS_IGNORE);
printf("Child %d received msg: %d\n", rank, msg);
}
MPI_Comm_disconnect(&parent);
printf("Child %d disconnected\n", rank);
}
MPI_Finalize();
fprintf(stderr, "%d: exiting\n", pid);
return 0;
}