I addressed the “not enough slots” problem here:  
https://github.com/open-mpi/ompi-release/pull/1163 
<https://github.com/open-mpi/ompi-release/pull/1163>

The multiple slot-list problem is new to me - we’ve never had someone try that 
before, and I’m not sure how that would work given that the slot-list is an MCA 
param and can have only one value. Probably something for the future.


> On May 15, 2016, at 7:55 AM, Ralph Castain <r...@open-mpi.org> wrote:
> 
> You are showing different cmd lines then last time :-)
> 
> I’ll try to take a look as time permits
> 
>> On May 15, 2016, at 7:47 AM, Siegmar Gross 
>> <siegmar.gr...@informatik.hs-fulda.de> wrote:
>> 
>> Hi Jeff,
>> 
>> today I upgraded to the latest version and I still have
>> problems. I compiled with gcc-6.1.0 and I tried to compile
>> with Sun C 5.14 beta. Sun C still broke with "unrecognized
>> option '-path'" which was reported before, so that I use
>> my gcc version. By the way, this problem is solved for
>> openmpi-v2.x-dev-1425-ga558e90 and openmpi-dev-4050-g7f65c2b.
>> 
>> loki hello_2 124 ompi_info | grep -e "OPAL repo revision" -e "C compiler 
>> absolute"
>>     OPAL repo revision: v1.10.2-189-gfc05056
>>    C compiler absolute: /usr/local/gcc-6.1.0/bin/gcc
>> loki hello_2 125 mpiexec -np 1 --host loki hello_2_mpi : -np 1 --host loki 
>> --slot-list 0:0-5,1:0-5 hello_2_slave_mpi
>> --------------------------------------------------------------------------
>> There are not enough slots available in the system to satisfy the 1 slots
>> that were requested by the application:
>> hello_2_slave_mpi
>> 
>> Either request fewer slots for your application, or make more slots available
>> for use.
>> --------------------------------------------------------------------------
>> 
>> 
>> 
>> I get a result, if I add "--slot-list" to the master process
>> as well. I changed "-np 2" to "-np 1" for the slave processes
>> to show new problems.
>> 
>> loki hello_2 126 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 
>> hello_2_mpi : -np 1 --host loki --slot-list 0:0-5,1:0-5 hello_2_slave_mpi
>> Process 0 of 2 running on loki
>> Process 1 of 2 running on loki
>> 
>> Now 1 slave tasks are sending greetings.
>> 
>> Greetings from task 1:
>> message type:        3
>> msg length:          132 characters
>> message:
>>   hostname:          loki
>>   operating system:  Linux
>>   release:           3.12.55-52.42-default
>>   processor:         x86_64
>> 
>> 
>> Now lets increase the number of slave processes to 2.
>> I still get only greetings from one slave process and
>> if I increase the number of slave processes to 3, I get
>> a segmentation fault. It's nearly the same for
>> openmpi-v2.x-dev-1425-ga558e90 (the only difference is
>> that the program hangs forever for 3 slave processes
>> for my cc and gcc version). Everything works as expected
>> for openmpi-dev-4050-g7f65c2b (although it takes very long
>> until I get all messages). It even works, if I put
>> "--slot-list" only once on the command line as you can see
>> below.
>> 
>> loki hello_2 127 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 
>> hello_2_mpi : -np 2 --host loki --slot-list 0:0-5,1:0-5 hello_2_slave_mpi
>> Process 0 of 2 running on loki
>> Process 1 of 2 running on loki
>> 
>> Now 1 slave tasks are sending greetings.
>> 
>> Greetings from task 1:
>> message type:        3
>> msg length:          132 characters
>> message:
>>   hostname:          loki
>>   operating system:  Linux
>>   release:           3.12.55-52.42-default
>>   processor:         x86_64
>> 
>> 
>> loki hello_2 128 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 
>> hello_2_mpi : -np 3 --host loki --slot-list 0:0-5,1:0-5 hello_2_slave_mpi
>> [loki:28536] *** Process received signal ***
>> [loki:28536] Signal: Segmentation fault (11)
>> [loki:28536] Signal code: Address not mapped (1)
>> [loki:28536] Failing at address: 0x8
>> [loki:28536] [ 0] /lib64/libpthread.so.0(+0xf870)[0x7fd40eb75870]
>> [loki:28536] [ 1] 
>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_proc_self+0x35)[0x7fd40edd85b0]
>> [loki:28536] [ 2] 
>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_comm_init+0x68b)[0x7fd40edb7b08]
>> [loki:28536] [ 3] 
>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_mpi_init+0xa90)[0x7fd40eddde8a]
>> [loki:28536] [ 4] 
>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(MPI_Init+0x180)[0x7fd40ee1a28e]
>> [loki:28536] [ 5] hello_2_slave_mpi[0x400bee]
>> [loki:28536] [ 6] *** An error occurred in MPI_Init
>> *** on a NULL communicator
>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>> ***    and potentially your MPI job)
>> [loki:28534] Local abort before MPI_INIT completed successfully; not able to 
>> aggregate error messages, and not able to guarantee that all other processes 
>> were killed!
>> *** An error occurred in MPI_Init
>> *** on a NULL communicator
>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>> ***    and potentially your MPI job)
>> [loki:28535] Local abort before MPI_INIT completed successfully; not able to 
>> aggregate error messages, and not able to guarantee that all other processes 
>> were killed!
>> /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fd40e7dfb05]
>> [loki:28536] [ 7] hello_2_slave_mpi[0x400fb0]
>> [loki:28536] *** End of error message ***
>> -------------------------------------------------------
>> Primary job  terminated normally, but 1 process returned
>> a non-zero exit code.. Per user-direction, the job has been aborted.
>> -------------------------------------------------------
>> --------------------------------------------------------------------------
>> mpiexec detected that one or more processes exited with non-zero status, 
>> thus causing
>> the job to be terminated. The first process to do so was:
>> 
>> Process name: [[61640,1],0]
>> Exit code:    1
>> --------------------------------------------------------------------------
>> loki hello_2 129
>> 
>> 
>> 
>> loki hello_2 114 ompi_info | grep -e "OPAL repo revision" -e "C compiler 
>> absolute"
>>     OPAL repo revision: dev-4050-g7f65c2b
>>    C compiler absolute: /opt/solstudio12.5b/bin/cc
>> loki hello_2 115 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 
>> hello_2_mpi : -np 3 --host loki --slot-list 0:0-5,1:0-5 hello_2_slave_mpi
>> Process 0 of 4 running on loki
>> Process 1 of 4 running on loki
>> Process 2 of 4 running on loki
>> Process 3 of 4 running on loki
>> ...
>> 
>> 
>> It even works, if I put "--slot-list" only once on the command
>> line.
>> 
>> loki hello_2 116 mpiexec -np 1 --host loki hello_2_mpi : -np 3 --host loki 
>> --slot-list 0:0-5,1:0-5 hello_2_slave_mpi
>> Process 1 of 4 running on loki
>> Process 2 of 4 running on loki
>> Process 0 of 4 running on loki
>> Process 3 of 4 running on loki
>> ...
>> 
>> 
>> Hopefully you know what happens and why it happens so that
>> you can fix the problem for openmpi-1.10.x and openmpi-2.x.
>> My three spawn programs work with openmpi-master as well
>> while "spawn_master" breaks on both openmpi-1.10.x and
>> openmpi-2.x with the same failure as my hello master/slave
>> program.
>> 
>> Do you know when the Java problem will be solved?
>> 
>> 
>> Kind regards
>> 
>> Siegmar
>> 
>> 
>> 
>> Am 15.05.2016 um 01:27 schrieb Ralph Castain:
>>> 
>>>> On May 7, 2016, at 1:13 AM, Siegmar Gross 
>>>> <siegmar.gr...@informatik.hs-fulda.de> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> yesterday I installed openmpi-v1.10.2-176-g9d45e07 on my "SUSE Linux
>>>> Enterprise Server 12 (x86_64)" with Sun C 5.13  and gcc-5.3.0. The
>>>> following programs don't run anymore.
>>>> 
>>>> 
>>>> loki hello_2 112 ompi_info | grep -e "OPAL repo revision" -e "C compiler 
>>>> absolute"
>>>>    OPAL repo revision: v1.10.2-176-g9d45e07
>>>>   C compiler absolute: /opt/solstudio12.4/bin/cc
>>>> loki hello_2 113 mpiexec -np 1 --host loki hello_2_mpi : -np 2 --host 
>>>> loki,loki hello_2_slave_mpi
>>>> --------------------------------------------------------------------------
>>>> There are not enough slots available in the system to satisfy the 2 slots
>>>> that were requested by the application:
>>>> hello_2_slave_mpi
>>>> 
>>>> Either request fewer slots for your application, or make more slots 
>>>> available
>>>> for use.
>>>> --------------------------------------------------------------------------
>>>> loki hello_2 114
>>>> 
>>> 
>>> The above worked fine for me with:
>>> 
>>> OPAL repo revision: v1.10.2-182-g52c7573
>>> 
>>> You might try updating.
>>> 
>>>> 
>>>> 
>>>> Everything worked as expected with openmpi-v1.10.0-178-gb80f802.
>>>> 
>>>> loki hello_2 114 ompi_info | grep -e "OPAL repo revision" -e "C compiler 
>>>> absolute"
>>>>    OPAL repo revision: v1.10.0-178-gb80f802
>>>>   C compiler absolute: /opt/solstudio12.4/bin/cc
>>>> loki hello_2 115 mpiexec -np 1 --host loki hello_2_mpi : -np 2 --host 
>>>> loki,loki hello_2_slave_mpi
>>>> Process 0 of 3 running on loki
>>>> Process 1 of 3 running on loki
>>>> Process 2 of 3 running on loki
>>>> 
>>>> Now 2 slave tasks are sending greetings.
>>>> 
>>>> Greetings from task 2:
>>>> message type:        3
>>>> ...
>>>> 
>>>> 
>>>> I have the same problem with openmpi-v2.x-dev-1404-g74d8ea0, if I use
>>>> the following commands.
>>>> 
>>>> mpiexec -np 1 --host loki hello_2_mpi : -np 2 --host loki,loki 
>>>> hello_2_slave_mpi
>>>> mpiexec -np 1 --host loki hello_2_mpi : -np 2 --host loki,nfs1 
>>>> hello_2_slave_mpi
>>>> mpiexec -np 1 --host loki hello_2_mpi : -np 2 --host loki --slot-list 
>>>> 0:0-5,1:0-5 hello_2_slave_mpi
>>>> 
>>>> 
>>>> I have also the same problem with openmpi-dev-4010-g6c9d65c, if I use
>>>> the following command.
>>>> 
>>>> mpiexec -np 1 --host loki hello_2_mpi : -np 2 --host loki,loki 
>>>> hello_2_slave_mpi
>>>> 
>>>> 
>>>> openmpi-dev-4010-g6c9d65c works as expected with the following commands.
>>>> 
>>>> mpiexec -np 1 --host loki hello_2_mpi : -np 2 --host loki,nfs1 
>>>> hello_2_slave_mpi
>>>> mpiexec -np 1 --host loki hello_2_mpi : -np 2 --host loki --slot-list 
>>>> 0:0-5,1:0-5 hello_2_slave_mpi
>>>> 
>>>> 
>>>> Has the interface changed so that I'm not allowed to use some of my
>>>> commands any longer? I would be grateful, if somebody can fix the
>>>> problem if it is a problem. Thank you very much for any help in
>>>> advance.
>>>> 
>>>> 
>>>> 
>>>> Kind regards
>>>> 
>>>> Siegmar
>>>> <hello_2_mpi.c><hello_2_slave_mpi.c>_______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/users/2016/05/29126.php
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2016/05/29205.php
>>> 
>> <hello_2_mpi.c><hello_2_slave_mpi.c>_______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2016/05/29206.php
> 

Reply via email to