> On May 24, 2016, at 4:19 AM, Siegmar Gross > <siegmar.gr...@informatik.hs-fulda.de> wrote: > > Hi Ralph, > > thank you very much for your answer and your example program. > > On 05/23/16 17:45, Ralph Castain wrote: >> I cannot replicate the problem - both scenarios work fine for me. I’m not >> convinced your test code is correct, however, as you call Comm_free the >> inter-communicator but didn’t call Comm_disconnect. Checkout the attached >> for a correct code and see if it works for you. > > I thought that I only need MPI_Comm_Disconnect, if I would have established a > connection with MPI_Comm_connect before. The man page for MPI_Comm_free states > > "This operation marks the communicator object for deallocation. The > handle is set to MPI_COMM_NULL. Any pending operations that use this > communicator will complete normally; the object is actually deallocated only > if there are no other active references to it.". > > The man page for MPI_Comm_disconnect states > > "MPI_Comm_disconnect waits for all pending communication on comm to complete > internally, deallocates the communicator object, and sets the handle to > MPI_COMM_NULL. It is a collective operation.". > > I don't see a difference for my spawned processes, because both functions will > "wait" until all pending operations have finished, before the object will be > destroyed. Nevertheless, perhaps my small example program worked all the years > by chance. > > However, I don't understand, why my program works with > "mpiexec -np 1 --host loki,loki,loki,loki,loki spawn_master" and breaks with > "mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master". You are > right, > my slot-list is equivalent to "-bind-to none". I could also have used > "mpiexec -np 1 --host loki --oversubscribe spawn_master" which works as well.
Well, you are only giving us one slot when you specify "-host loki”, and then you are trying to launch multiple processes into it. The “slot-list” option only tells us what cpus to bind each process to - it doesn’t allocate process slots. So you have to tell us how many processes are allowed to run on this node. > > The program breaks with "There are not enough slots available in the system > to satisfy ...", if I only use "--host loki" or different host names, > without mentioning five host names, using "slot-list", or "oversubscribe", > Unfortunately "--host <host name>:<number of slots>" isn't available for > openmpi-1.10.3rc2 to specify the number of available slots. Correct - we did not backport the new syntax > > Your program behaves the same way as mine, so that MPI_Comm_disconnect > will not solve my problem. I had to modify your program in a negligible way > to get it compiled. > > loki spawn 153 ompi_info | grep -e "OPAL repo revision:" -e "C compiler > absolute:" > OPAL repo revision: v1.10.2-201-gd23dda8 > C compiler absolute: /usr/local/gcc-6.1.0/bin/gcc > loki spawn 154 mpicc simple_spawn.c > loki spawn 155 mpiexec -np 1 a.out > [pid 24008] starting up! > 0 completed MPI_Init > Parent [pid 24008] about to spawn! > [pid 24010] starting up! > [pid 24011] starting up! > [pid 24012] starting up! > Parent done with spawn > Parent sending message to child > 0 completed MPI_Init > Hello from the child 0 of 3 on host loki pid 24010 > 1 completed MPI_Init > Hello from the child 1 of 3 on host loki pid 24011 > 2 completed MPI_Init > Hello from the child 2 of 3 on host loki pid 24012 > Child 0 received msg: 38 > Child 0 disconnected > Child 1 disconnected > Child 2 disconnected > Parent disconnected > 24012: exiting > 24010: exiting > 24008: exiting > 24011: exiting > > > Is something wrong with my command line? I didn't use slot-list before, so > that I'm not sure, if I use it in the intended way. I don’t know what “a.out” is, but it looks like there is some memory corruption there. > > loki spawn 156 mpiexec -np 1 --host loki --slot-list 0-5 a.out > [pid 24102] starting up! > 0 completed MPI_Init > Parent [pid 24102] about to spawn! > [pid 24104] starting up! > [pid 24105] starting up! > [loki:24105] *** Process received signal *** > [loki:24105] Signal: Segmentation fault (11) > [loki:24105] Signal code: Address not mapped (1) > [loki:24105] Failing at address: 0x8 > [loki:24105] [ 0] /lib64/libpthread.so.0(+0xf870)[0x7f39aa76f870] > [loki:24105] [ 1] > /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_proc_self+0x35)[0x7f39aa9d25b0] > [loki:24105] [ 2] > /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_comm_init+0x68b)[0x7f39aa9b1b08] > [loki:24105] [ 3] *** An error occurred in MPI_Init > *** on a NULL communicator > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, > *** and potentially your MPI job) > [loki:24104] Local abort before MPI_INIT completed successfully; not able to > aggregate error messages, and not able to guarantee that all other processes > were killed! > /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_mpi_init+0xa90)[0x7f39aa9d7e8a] > [loki:24105] [ 4] > /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(MPI_Init+0x1a0)[0x7f39aaa142ae] > [loki:24105] [ 5] a.out[0x400d0c] > [loki:24105] [ 6] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f39aa3d9b05] > [loki:24105] [ 7] a.out[0x400bf9] > [loki:24105] *** End of error message *** > ------------------------------------------------------- > Child job 2 terminated normally, but 1 process returned > a non-zero exit code.. Per user-direction, the job has been aborted. > ------------------------------------------------------- > -------------------------------------------------------------------------- > mpiexec detected that one or more processes exited with non-zero status, thus > causing > the job to be terminated. The first process to do so was: > > Process name: [[49560,2],0] > Exit code: 1 > -------------------------------------------------------------------------- > loki spawn 157 > > > Hopefully, you will find out what happens. Please let me know, if I can > help you in any way. > > Kind regards > > Siegmar > > >> FWIW: I don’t know how many cores you have on your sockets, but if you >> have 6 cores/socket, then your slot-list is equivalent to “—bind-to none” >> as the slot-list applies to every process being launched >> >> >> >> >> >>> On May 23, 2016, at 6:26 AM, Siegmar Gross >>> <siegmar.gr...@informatik.hs-fulda.de> wrote: >>> >>> Hi, >>> >>> I installed openmpi-1.10.3rc2 on my "SUSE Linux Enterprise Server >>> 12 (x86_64)" with Sun C 5.13 and gcc-6.1.0. Unfortunately I get >>> a segmentation fault for "--slot-list" for one of my small programs. >>> >>> >>> loki spawn 119 ompi_info | grep -e "OPAL repo revision:" -e "C compiler >>> absolute:" >>> OPAL repo revision: v1.10.2-201-gd23dda8 >>> C compiler absolute: /usr/local/gcc-6.1.0/bin/gcc >>> >>> >>> loki spawn 120 mpiexec -np 1 --host loki,loki,loki,loki,loki spawn_master >>> >>> Parent process 0 running on loki >>> I create 4 slave processes >>> >>> Parent process 0: tasks in MPI_COMM_WORLD: 1 >>> tasks in COMM_CHILD_PROCESSES local group: 1 >>> tasks in COMM_CHILD_PROCESSES remote group: 4 >>> >>> Slave process 0 of 4 running on loki >>> Slave process 1 of 4 running on loki >>> Slave process 2 of 4 running on loki >>> spawn_slave 2: argv[0]: spawn_slave >>> Slave process 3 of 4 running on loki >>> spawn_slave 0: argv[0]: spawn_slave >>> spawn_slave 1: argv[0]: spawn_slave >>> spawn_slave 3: argv[0]: spawn_slave >>> >>> >>> >>> >>> loki spawn 121 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 >>> spawn_master >>> >>> Parent process 0 running on loki >>> I create 4 slave processes >>> >>> [loki:17326] *** Process received signal *** >>> [loki:17326] Signal: Segmentation fault (11) >>> [loki:17326] Signal code: Address not mapped (1) >>> [loki:17326] Failing at address: 0x8 >>> [loki:17326] [ 0] /lib64/libpthread.so.0(+0xf870)[0x7f4e469b3870] >>> [loki:17326] [ 1] *** An error occurred in MPI_Init >>> *** on a NULL communicator >>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >>> *** and potentially your MPI job) >>> [loki:17324] Local abort before MPI_INIT completed successfully; not able >>> to aggregate error messages, and not able to guarantee that all other >>> processes were killed! >>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_proc_self+0x35)[0x7f4e46c165b0] >>> [loki:17326] [ 2] >>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_comm_init+0x68b)[0x7f4e46bf5b08] >>> [loki:17326] [ 3] *** An error occurred in MPI_Init >>> *** on a NULL communicator >>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >>> *** and potentially your MPI job) >>> [loki:17325] Local abort before MPI_INIT completed successfully; not able >>> to aggregate error messages, and not able to guarantee that all other >>> processes were killed! >>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_mpi_init+0xa90)[0x7f4e46c1be8a] >>> [loki:17326] [ 4] >>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(MPI_Init+0x180)[0x7f4e46c5828e] >>> [loki:17326] [ 5] spawn_slave[0x40097e] >>> [loki:17326] [ 6] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f4e4661db05] >>> [loki:17326] [ 7] spawn_slave[0x400a54] >>> [loki:17326] *** End of error message *** >>> ------------------------------------------------------- >>> Child job 2 terminated normally, but 1 process returned >>> a non-zero exit code.. Per user-direction, the job has been aborted. >>> ------------------------------------------------------- >>> -------------------------------------------------------------------------- >>> mpiexec detected that one or more processes exited with non-zero status, >>> thus causing >>> the job to be terminated. The first process to do so was: >>> >>> Process name: [[56340,2],0] >>> Exit code: 1 >>> -------------------------------------------------------------------------- >>> loki spawn 122 >>> >>> >>> >>> >>> I would be grateful, if somebody can fix the problem. Thank you >>> very much for any help in advance. >>> >>> >>> Kind regards >>> >>> Siegmar >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2016/05/29281.php >> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org <mailto:us...@open-mpi.org> >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users >> <https://www.open-mpi.org/mailman/listinfo.cgi/users> >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2016/05/29284.php >> <http://www.open-mpi.org/community/lists/users/2016/05/29284.php> >> > <simple_spawn_modified.c>_______________________________________________ > users mailing list > us...@open-mpi.org <mailto:us...@open-mpi.org> > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > <https://www.open-mpi.org/mailman/listinfo.cgi/users> > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/05/29300.php > <http://www.open-mpi.org/community/lists/users/2016/05/29300.php>