> On May 24, 2016, at 6:21 AM, Siegmar Gross 
> <siegmar.gr...@informatik.hs-fulda.de> wrote:
> 
> Hi Ralph,
> 
> I copy the relevant lines to this place, so that it is easier to see what
> happens. "a.out" is your program, which I compiled with mpicc.
> 
> >> loki spawn 153 ompi_info | grep -e "OPAL repo revision:" -e "C compiler
> >> absolute:"
> >>      OPAL repo revision: v1.10.2-201-gd23dda8
> >>     C compiler absolute: /usr/local/gcc-6.1.0/bin/gcc
> >> loki spawn 154 mpicc simple_spawn.c
> 
> >> loki spawn 155 mpiexec -np 1 a.out
> >> [pid 24008] starting up!
> >> 0 completed MPI_Init
> ...
> 
> "mpiexec -np 1 a.out" works.
> 
> 
> 
> > I don’t know what “a.out” is, but it looks like there is some memory
> > corruption there.
> 
> "a.out" is still your program. I get the same error on different
> machines, so that it is not very likely, that the (hardware) memory
> is corrupted.
> 
> 
> >> loki spawn 156 mpiexec -np 1 --host loki --slot-list 0-5 a.out
> >> [pid 24102] starting up!
> >> 0 completed MPI_Init
> >> Parent [pid 24102] about to spawn!
> >> [pid 24104] starting up!
> >> [pid 24105] starting up!
> >> [loki:24105] *** Process received signal ***
> >> [loki:24105] Signal: Segmentation fault (11)
> >> [loki:24105] Signal code: Address not mapped (1)
> ...
> 
> "mpiexec -np 1 --host loki --slot-list 0-5 a.out" breaks with a segmentation
> faUlt. Can I do something, so that you can find out, what happens?

I honestly have no idea - perhaps Gilles can help, as I have no access to that 
kind of environment. We aren’t seeing such problems elsewhere, so it is likely 
something local.

> 
> 
> Kind regards
> 
> Siegmar
> 
> 
> 
> On 05/24/16 15:07, Ralph Castain wrote:
>> 
>>> On May 24, 2016, at 4:19 AM, Siegmar Gross
>>> <siegmar.gr...@informatik.hs-fulda.de
>>> <mailto:siegmar.gr...@informatik.hs-fulda.de>> wrote:
>>> 
>>> Hi Ralph,
>>> 
>>> thank you very much for your answer and your example program.
>>> 
>>> On 05/23/16 17:45, Ralph Castain wrote:
>>>> I cannot replicate the problem - both scenarios work fine for me. I’m not
>>>> convinced your test code is correct, however, as you call Comm_free the
>>>> inter-communicator but didn’t call Comm_disconnect. Checkout the attached
>>>> for a correct code and see if it works for you.
>>> 
>>> I thought that I only need MPI_Comm_Disconnect, if I would have established 
>>> a
>>> connection with MPI_Comm_connect before. The man page for MPI_Comm_free 
>>> states
>>> 
>>> "This  operation marks the communicator object for deallocation. The
>>> handle is set to MPI_COMM_NULL. Any pending operations that use this
>>> communicator will complete normally; the object is actually deallocated only
>>> if there are no other active references to it.".
>>> 
>>> The man page for MPI_Comm_disconnect states
>>> 
>>> "MPI_Comm_disconnect waits for all pending communication on comm to complete
>>> internally, deallocates the communicator object, and sets the handle to
>>> MPI_COMM_NULL. It is  a  collective operation.".
>>> 
>>> I don't see a difference for my spawned processes, because both functions 
>>> will
>>> "wait" until all pending operations have finished, before the object will be
>>> destroyed. Nevertheless, perhaps my small example program worked all the 
>>> years
>>> by chance.
>>> 
>>> However, I don't understand, why my program works with
>>> "mpiexec -np 1 --host loki,loki,loki,loki,loki spawn_master" and breaks with
>>> "mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master". You are 
>>> right,
>>> my slot-list is equivalent to "-bind-to none". I could also have used
>>> "mpiexec -np 1 --host loki --oversubscribe spawn_master" which works as 
>>> well.
>> 
>> Well, you are only giving us one slot when you specify "-host loki”, and then
>> you are trying to launch multiple processes into it. The “slot-list” option 
>> only
>> tells us what cpus to bind each process to - it doesn’t allocate process 
>> slots.
>> So you have to tell us how many processes are allowed to run on this node.
>> 
>>> 
>>> The program breaks with "There are not enough slots available in the system
>>> to satisfy ...", if I only use "--host loki" or different host names,
>>> without mentioning five host names, using "slot-list", or "oversubscribe",
>>> Unfortunately "--host <host name>:<number of slots>" isn't available for
>>> openmpi-1.10.3rc2 to specify the number of available slots.
>> 
>> Correct - we did not backport the new syntax
>> 
>>> 
>>> Your program behaves the same way as mine, so that MPI_Comm_disconnect
>>> will not solve my problem. I had to modify your program in a negligible way
>>> to get it compiled.
>>> 
>>> loki spawn 153 ompi_info | grep -e "OPAL repo revision:" -e "C compiler 
>>> absolute:"
>>>     OPAL repo revision: v1.10.2-201-gd23dda8
>>>    C compiler absolute: /usr/local/gcc-6.1.0/bin/gcc
>>> loki spawn 154 mpicc simple_spawn.c
>>> loki spawn 155 mpiexec -np 1 a.out
>>> [pid 24008] starting up!
>>> 0 completed MPI_Init
>>> Parent [pid 24008] about to spawn!
>>> [pid 24010] starting up!
>>> [pid 24011] starting up!
>>> [pid 24012] starting up!
>>> Parent done with spawn
>>> Parent sending message to child
>>> 0 completed MPI_Init
>>> Hello from the child 0 of 3 on host loki pid 24010
>>> 1 completed MPI_Init
>>> Hello from the child 1 of 3 on host loki pid 24011
>>> 2 completed MPI_Init
>>> Hello from the child 2 of 3 on host loki pid 24012
>>> Child 0 received msg: 38
>>> Child 0 disconnected
>>> Child 1 disconnected
>>> Child 2 disconnected
>>> Parent disconnected
>>> 24012: exiting
>>> 24010: exiting
>>> 24008: exiting
>>> 24011: exiting
>>> 
>>> 
>>> Is something wrong with my command line? I didn't use slot-list before, so
>>> that I'm not sure, if I use it in the intended way.
>> 
>> I don’t know what “a.out” is, but it looks like there is some memory 
>> corruption
>> there.
>> 
>>> 
>>> loki spawn 156 mpiexec -np 1 --host loki --slot-list 0-5 a.out
>>> [pid 24102] starting up!
>>> 0 completed MPI_Init
>>> Parent [pid 24102] about to spawn!
>>> [pid 24104] starting up!
>>> [pid 24105] starting up!
>>> [loki:24105] *** Process received signal ***
>>> [loki:24105] Signal: Segmentation fault (11)
>>> [loki:24105] Signal code: Address not mapped (1)
>>> [loki:24105] Failing at address: 0x8
>>> [loki:24105] [ 0] /lib64/libpthread.so.0(+0xf870)[0x7f39aa76f870]
>>> [loki:24105] [ 1]
>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_proc_self+0x35)[0x7f39aa9d25b0]
>>> [loki:24105] [ 2]
>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_comm_init+0x68b)[0x7f39aa9b1b08]
>>> [loki:24105] [ 3] *** An error occurred in MPI_Init
>>> *** on a NULL communicator
>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>>> ***    and potentially your MPI job)
>>> [loki:24104] Local abort before MPI_INIT completed successfully; not able to
>>> aggregate error messages, and not able to guarantee that all other processes
>>> were killed!
>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_mpi_init+0xa90)[0x7f39aa9d7e8a]
>>> [loki:24105] [ 4]
>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(MPI_Init+0x1a0)[0x7f39aaa142ae]
>>> [loki:24105] [ 5] a.out[0x400d0c]
>>> [loki:24105] [ 6] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f39aa3d9b05]
>>> [loki:24105] [ 7] a.out[0x400bf9]
>>> [loki:24105] *** End of error message ***
>>> -------------------------------------------------------
>>> Child job 2 terminated normally, but 1 process returned
>>> a non-zero exit code.. Per user-direction, the job has been aborted.
>>> -------------------------------------------------------
>>> --------------------------------------------------------------------------
>>> mpiexec detected that one or more processes exited with non-zero status, 
>>> thus
>>> causing
>>> the job to be terminated. The first process to do so was:
>>> 
>>> Process name: [[49560,2],0]
>>> Exit code:    1
>>> --------------------------------------------------------------------------
>>> loki spawn 157
>>> 
>>> 
>>> Hopefully, you will find out what happens. Please let me know, if I can
>>> help you in any way.
>>> 
>>> Kind regards
>>> 
>>> Siegmar
>>> 
>>> 
>>>> FWIW: I don’t know how many cores you have on your sockets, but if you
>>>> have 6 cores/socket, then your slot-list is equivalent to “—bind-to none”
>>>> as the slot-list applies to every process being launched
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On May 23, 2016, at 6:26 AM, Siegmar Gross
>>>>> <siegmar.gr...@informatik.hs-fulda.de
>>>>> <mailto:siegmar.gr...@informatik.hs-fulda.de>> wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> I installed openmpi-1.10.3rc2 on my "SUSE Linux Enterprise Server
>>>>> 12 (x86_64)" with Sun C 5.13  and gcc-6.1.0. Unfortunately I get
>>>>> a segmentation fault for "--slot-list" for one of my small programs.
>>>>> 
>>>>> 
>>>>> loki spawn 119 ompi_info | grep -e "OPAL repo revision:" -e "C compiler
>>>>> absolute:"
>>>>>    OPAL repo revision: v1.10.2-201-gd23dda8
>>>>>   C compiler absolute: /usr/local/gcc-6.1.0/bin/gcc
>>>>> 
>>>>> 
>>>>> loki spawn 120 mpiexec -np 1 --host loki,loki,loki,loki,loki spawn_master
>>>>> 
>>>>> Parent process 0 running on loki
>>>>> I create 4 slave processes
>>>>> 
>>>>> Parent process 0: tasks in MPI_COMM_WORLD:                    1
>>>>>                tasks in COMM_CHILD_PROCESSES local group:  1
>>>>>                tasks in COMM_CHILD_PROCESSES remote group: 4
>>>>> 
>>>>> Slave process 0 of 4 running on loki
>>>>> Slave process 1 of 4 running on loki
>>>>> Slave process 2 of 4 running on loki
>>>>> spawn_slave 2: argv[0]: spawn_slave
>>>>> Slave process 3 of 4 running on loki
>>>>> spawn_slave 0: argv[0]: spawn_slave
>>>>> spawn_slave 1: argv[0]: spawn_slave
>>>>> spawn_slave 3: argv[0]: spawn_slave
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> loki spawn 121 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 
>>>>> spawn_master
>>>>> 
>>>>> Parent process 0 running on loki
>>>>> I create 4 slave processes
>>>>> 
>>>>> [loki:17326] *** Process received signal ***
>>>>> [loki:17326] Signal: Segmentation fault (11)
>>>>> [loki:17326] Signal code: Address not mapped (1)
>>>>> [loki:17326] Failing at address: 0x8
>>>>> [loki:17326] [ 0] /lib64/libpthread.so.0(+0xf870)[0x7f4e469b3870]
>>>>> [loki:17326] [ 1] *** An error occurred in MPI_Init
>>>>> *** on a NULL communicator
>>>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>>>>> ***    and potentially your MPI job)
>>>>> [loki:17324] Local abort before MPI_INIT completed successfully; not able 
>>>>> to
>>>>> aggregate error messages, and not able to guarantee that all other 
>>>>> processes
>>>>> were killed!
>>>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_proc_self+0x35)[0x7f4e46c165b0]
>>>>> [loki:17326] [ 2]
>>>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_comm_init+0x68b)[0x7f4e46bf5b08]
>>>>> [loki:17326] [ 3] *** An error occurred in MPI_Init
>>>>> *** on a NULL communicator
>>>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>>>>> ***    and potentially your MPI job)
>>>>> [loki:17325] Local abort before MPI_INIT completed successfully; not able 
>>>>> to
>>>>> aggregate error messages, and not able to guarantee that all other 
>>>>> processes
>>>>> were killed!
>>>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_mpi_init+0xa90)[0x7f4e46c1be8a]
>>>>> [loki:17326] [ 4]
>>>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(MPI_Init+0x180)[0x7f4e46c5828e]
>>>>> [loki:17326] [ 5] spawn_slave[0x40097e]
>>>>> [loki:17326] [ 6] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f4e4661db05]
>>>>> [loki:17326] [ 7] spawn_slave[0x400a54]
>>>>> [loki:17326] *** End of error message ***
>>>>> -------------------------------------------------------
>>>>> Child job 2 terminated normally, but 1 process returned
>>>>> a non-zero exit code.. Per user-direction, the job has been aborted.
>>>>> -------------------------------------------------------
>>>>> --------------------------------------------------------------------------
>>>>> mpiexec detected that one or more processes exited with non-zero status,
>>>>> thus causing
>>>>> the job to be terminated. The first process to do so was:
>>>>> 
>>>>> Process name: [[56340,2],0]
>>>>> Exit code:    1
>>>>> --------------------------------------------------------------------------
>>>>> loki spawn 122
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> I would be grateful, if somebody can fix the problem. Thank you
>>>>> very much for any help in advance.
>>>>> 
>>>>> 
>>>>> Kind regards
>>>>> 
>>>>> Siegmar
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> Link to this post:
>>>>> http://www.open-mpi.org/community/lists/users/2016/05/29281.php
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this
>>>> post: http://www.open-mpi.org/community/lists/users/2016/05/29284.php
>>>> 
>>> <simple_spawn_modified.c>_______________________________________________
>>> users mailing list
>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2016/05/29300.php
>> 
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2016/05/29301.php
>> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/05/29304.php

Reply via email to