> On May 24, 2016, at 4:19 AM, Siegmar Gross 
> <siegmar.gr...@informatik.hs-fulda.de> wrote:
> 
> Hi Ralph,
> 
> thank you very much for your answer and your example program.
> 
> On 05/23/16 17:45, Ralph Castain wrote:
>> I cannot replicate the problem - both scenarios work fine for me. I’m not
>> convinced your test code is correct, however, as you call Comm_free the
>> inter-communicator but didn’t call Comm_disconnect. Checkout the attached
>> for a correct code and see if it works for you.
> 
> I thought that I only need MPI_Comm_Disconnect, if I would have established a
> connection with MPI_Comm_connect before. The man page for MPI_Comm_free states
> 
> "This  operation marks the communicator object for deallocation. The
> handle is set to MPI_COMM_NULL. Any pending operations that use this
> communicator will complete normally; the object is actually deallocated only
> if there are no other active references to it.".
> 
> The man page for MPI_Comm_disconnect states
> 
> "MPI_Comm_disconnect waits for all pending communication on comm to complete
> internally, deallocates the communicator object, and sets the handle to
> MPI_COMM_NULL. It is  a  collective operation.".
> 
> I don't see a difference for my spawned processes, because both functions will
> "wait" until all pending operations have finished, before the object will be
> destroyed. Nevertheless, perhaps my small example program worked all the years
> by chance.
> 
> However, I don't understand, why my program works with
> "mpiexec -np 1 --host loki,loki,loki,loki,loki spawn_master" and breaks with
> "mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master". You are 
> right,
> my slot-list is equivalent to "-bind-to none". I could also have used
> "mpiexec -np 1 --host loki --oversubscribe spawn_master" which works as well.

Well, you are only giving us one slot when you specify "-host loki”, and then 
you are trying to launch multiple processes into it. The “slot-list” option 
only tells us what cpus to bind each process to - it doesn’t allocate process 
slots. So you have to tell us how many processes are allowed to run on this 
node.

> 
> The program breaks with "There are not enough slots available in the system
> to satisfy ...", if I only use "--host loki" or different host names,
> without mentioning five host names, using "slot-list", or "oversubscribe",
> Unfortunately "--host <host name>:<number of slots>" isn't available for
> openmpi-1.10.3rc2 to specify the number of available slots.

Correct - we did not backport the new syntax

> 
> Your program behaves the same way as mine, so that MPI_Comm_disconnect
> will not solve my problem. I had to modify your program in a negligible way
> to get it compiled.
> 
> loki spawn 153 ompi_info | grep -e "OPAL repo revision:" -e "C compiler 
> absolute:"
>      OPAL repo revision: v1.10.2-201-gd23dda8
>     C compiler absolute: /usr/local/gcc-6.1.0/bin/gcc
> loki spawn 154 mpicc simple_spawn.c
> loki spawn 155 mpiexec -np 1 a.out
> [pid 24008] starting up!
> 0 completed MPI_Init
> Parent [pid 24008] about to spawn!
> [pid 24010] starting up!
> [pid 24011] starting up!
> [pid 24012] starting up!
> Parent done with spawn
> Parent sending message to child
> 0 completed MPI_Init
> Hello from the child 0 of 3 on host loki pid 24010
> 1 completed MPI_Init
> Hello from the child 1 of 3 on host loki pid 24011
> 2 completed MPI_Init
> Hello from the child 2 of 3 on host loki pid 24012
> Child 0 received msg: 38
> Child 0 disconnected
> Child 1 disconnected
> Child 2 disconnected
> Parent disconnected
> 24012: exiting
> 24010: exiting
> 24008: exiting
> 24011: exiting
> 
> 
> Is something wrong with my command line? I didn't use slot-list before, so
> that I'm not sure, if I use it in the intended way.

I don’t know what “a.out” is, but it looks like there is some memory corruption 
there.

> 
> loki spawn 156 mpiexec -np 1 --host loki --slot-list 0-5 a.out
> [pid 24102] starting up!
> 0 completed MPI_Init
> Parent [pid 24102] about to spawn!
> [pid 24104] starting up!
> [pid 24105] starting up!
> [loki:24105] *** Process received signal ***
> [loki:24105] Signal: Segmentation fault (11)
> [loki:24105] Signal code: Address not mapped (1)
> [loki:24105] Failing at address: 0x8
> [loki:24105] [ 0] /lib64/libpthread.so.0(+0xf870)[0x7f39aa76f870]
> [loki:24105] [ 1] 
> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_proc_self+0x35)[0x7f39aa9d25b0]
> [loki:24105] [ 2] 
> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_comm_init+0x68b)[0x7f39aa9b1b08]
> [loki:24105] [ 3] *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***    and potentially your MPI job)
> [loki:24104] Local abort before MPI_INIT completed successfully; not able to 
> aggregate error messages, and not able to guarantee that all other processes 
> were killed!
> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_mpi_init+0xa90)[0x7f39aa9d7e8a]
> [loki:24105] [ 4] 
> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(MPI_Init+0x1a0)[0x7f39aaa142ae]
> [loki:24105] [ 5] a.out[0x400d0c]
> [loki:24105] [ 6] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f39aa3d9b05]
> [loki:24105] [ 7] a.out[0x400bf9]
> [loki:24105] *** End of error message ***
> -------------------------------------------------------
> Child job 2 terminated normally, but 1 process returned
> a non-zero exit code.. Per user-direction, the job has been aborted.
> -------------------------------------------------------
> --------------------------------------------------------------------------
> mpiexec detected that one or more processes exited with non-zero status, thus 
> causing
> the job to be terminated. The first process to do so was:
> 
>  Process name: [[49560,2],0]
>  Exit code:    1
> --------------------------------------------------------------------------
> loki spawn 157
> 
> 
> Hopefully, you will find out what happens. Please let me know, if I can
> help you in any way.
> 
> Kind regards
> 
> Siegmar
> 
> 
>> FWIW: I don’t know how many cores you have on your sockets, but if you
>> have 6 cores/socket, then your slot-list is equivalent to “—bind-to none”
>> as the slot-list applies to every process being launched
>> 
>> 
>> 
>> 
>> 
>>> On May 23, 2016, at 6:26 AM, Siegmar Gross 
>>> <siegmar.gr...@informatik.hs-fulda.de> wrote:
>>> 
>>> Hi,
>>> 
>>> I installed openmpi-1.10.3rc2 on my "SUSE Linux Enterprise Server
>>> 12 (x86_64)" with Sun C 5.13  and gcc-6.1.0. Unfortunately I get
>>> a segmentation fault for "--slot-list" for one of my small programs.
>>> 
>>> 
>>> loki spawn 119 ompi_info | grep -e "OPAL repo revision:" -e "C compiler 
>>> absolute:"
>>>     OPAL repo revision: v1.10.2-201-gd23dda8
>>>    C compiler absolute: /usr/local/gcc-6.1.0/bin/gcc
>>> 
>>> 
>>> loki spawn 120 mpiexec -np 1 --host loki,loki,loki,loki,loki spawn_master
>>> 
>>> Parent process 0 running on loki
>>> I create 4 slave processes
>>> 
>>> Parent process 0: tasks in MPI_COMM_WORLD:                    1
>>>                 tasks in COMM_CHILD_PROCESSES local group:  1
>>>                 tasks in COMM_CHILD_PROCESSES remote group: 4
>>> 
>>> Slave process 0 of 4 running on loki
>>> Slave process 1 of 4 running on loki
>>> Slave process 2 of 4 running on loki
>>> spawn_slave 2: argv[0]: spawn_slave
>>> Slave process 3 of 4 running on loki
>>> spawn_slave 0: argv[0]: spawn_slave
>>> spawn_slave 1: argv[0]: spawn_slave
>>> spawn_slave 3: argv[0]: spawn_slave
>>> 
>>> 
>>> 
>>> 
>>> loki spawn 121 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 
>>> spawn_master
>>> 
>>> Parent process 0 running on loki
>>> I create 4 slave processes
>>> 
>>> [loki:17326] *** Process received signal ***
>>> [loki:17326] Signal: Segmentation fault (11)
>>> [loki:17326] Signal code: Address not mapped (1)
>>> [loki:17326] Failing at address: 0x8
>>> [loki:17326] [ 0] /lib64/libpthread.so.0(+0xf870)[0x7f4e469b3870]
>>> [loki:17326] [ 1] *** An error occurred in MPI_Init
>>> *** on a NULL communicator
>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>>> ***    and potentially your MPI job)
>>> [loki:17324] Local abort before MPI_INIT completed successfully; not able 
>>> to aggregate error messages, and not able to guarantee that all other 
>>> processes were killed!
>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_proc_self+0x35)[0x7f4e46c165b0]
>>> [loki:17326] [ 2] 
>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_comm_init+0x68b)[0x7f4e46bf5b08]
>>> [loki:17326] [ 3] *** An error occurred in MPI_Init
>>> *** on a NULL communicator
>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>>> ***    and potentially your MPI job)
>>> [loki:17325] Local abort before MPI_INIT completed successfully; not able 
>>> to aggregate error messages, and not able to guarantee that all other 
>>> processes were killed!
>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_mpi_init+0xa90)[0x7f4e46c1be8a]
>>> [loki:17326] [ 4] 
>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(MPI_Init+0x180)[0x7f4e46c5828e]
>>> [loki:17326] [ 5] spawn_slave[0x40097e]
>>> [loki:17326] [ 6] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f4e4661db05]
>>> [loki:17326] [ 7] spawn_slave[0x400a54]
>>> [loki:17326] *** End of error message ***
>>> -------------------------------------------------------
>>> Child job 2 terminated normally, but 1 process returned
>>> a non-zero exit code.. Per user-direction, the job has been aborted.
>>> -------------------------------------------------------
>>> --------------------------------------------------------------------------
>>> mpiexec detected that one or more processes exited with non-zero status, 
>>> thus causing
>>> the job to be terminated. The first process to do so was:
>>> 
>>> Process name: [[56340,2],0]
>>> Exit code:    1
>>> --------------------------------------------------------------------------
>>> loki spawn 122
>>> 
>>> 
>>> 
>>> 
>>> I would be grateful, if somebody can fix the problem. Thank you
>>> very much for any help in advance.
>>> 
>>> 
>>> Kind regards
>>> 
>>> Siegmar
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2016/05/29281.php
>> 
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users 
>> <https://www.open-mpi.org/mailman/listinfo.cgi/users>
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2016/05/29284.php 
>> <http://www.open-mpi.org/community/lists/users/2016/05/29284.php>
>> 
> <simple_spawn_modified.c>_______________________________________________
> users mailing list
> us...@open-mpi.org <mailto:us...@open-mpi.org>
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users 
> <https://www.open-mpi.org/mailman/listinfo.cgi/users>
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/05/29300.php 
> <http://www.open-mpi.org/community/lists/users/2016/05/29300.php>

Reply via email to