I addressed the “not enough slots” problem here: https://github.com/open-mpi/ompi-release/pull/1163 <https://github.com/open-mpi/ompi-release/pull/1163>
The multiple slot-list problem is new to me - we’ve never had someone try that before, and I’m not sure how that would work given that the slot-list is an MCA param and can have only one value. Probably something for the future. > On May 15, 2016, at 7:55 AM, Ralph Castain <r...@open-mpi.org> wrote: > > You are showing different cmd lines then last time :-) > > I’ll try to take a look as time permits > >> On May 15, 2016, at 7:47 AM, Siegmar Gross >> <siegmar.gr...@informatik.hs-fulda.de> wrote: >> >> Hi Jeff, >> >> today I upgraded to the latest version and I still have >> problems. I compiled with gcc-6.1.0 and I tried to compile >> with Sun C 5.14 beta. Sun C still broke with "unrecognized >> option '-path'" which was reported before, so that I use >> my gcc version. By the way, this problem is solved for >> openmpi-v2.x-dev-1425-ga558e90 and openmpi-dev-4050-g7f65c2b. >> >> loki hello_2 124 ompi_info | grep -e "OPAL repo revision" -e "C compiler >> absolute" >> OPAL repo revision: v1.10.2-189-gfc05056 >> C compiler absolute: /usr/local/gcc-6.1.0/bin/gcc >> loki hello_2 125 mpiexec -np 1 --host loki hello_2_mpi : -np 1 --host loki >> --slot-list 0:0-5,1:0-5 hello_2_slave_mpi >> -------------------------------------------------------------------------- >> There are not enough slots available in the system to satisfy the 1 slots >> that were requested by the application: >> hello_2_slave_mpi >> >> Either request fewer slots for your application, or make more slots available >> for use. >> -------------------------------------------------------------------------- >> >> >> >> I get a result, if I add "--slot-list" to the master process >> as well. I changed "-np 2" to "-np 1" for the slave processes >> to show new problems. >> >> loki hello_2 126 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 >> hello_2_mpi : -np 1 --host loki --slot-list 0:0-5,1:0-5 hello_2_slave_mpi >> Process 0 of 2 running on loki >> Process 1 of 2 running on loki >> >> Now 1 slave tasks are sending greetings. >> >> Greetings from task 1: >> message type: 3 >> msg length: 132 characters >> message: >> hostname: loki >> operating system: Linux >> release: 3.12.55-52.42-default >> processor: x86_64 >> >> >> Now lets increase the number of slave processes to 2. >> I still get only greetings from one slave process and >> if I increase the number of slave processes to 3, I get >> a segmentation fault. It's nearly the same for >> openmpi-v2.x-dev-1425-ga558e90 (the only difference is >> that the program hangs forever for 3 slave processes >> for my cc and gcc version). Everything works as expected >> for openmpi-dev-4050-g7f65c2b (although it takes very long >> until I get all messages). It even works, if I put >> "--slot-list" only once on the command line as you can see >> below. >> >> loki hello_2 127 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 >> hello_2_mpi : -np 2 --host loki --slot-list 0:0-5,1:0-5 hello_2_slave_mpi >> Process 0 of 2 running on loki >> Process 1 of 2 running on loki >> >> Now 1 slave tasks are sending greetings. >> >> Greetings from task 1: >> message type: 3 >> msg length: 132 characters >> message: >> hostname: loki >> operating system: Linux >> release: 3.12.55-52.42-default >> processor: x86_64 >> >> >> loki hello_2 128 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 >> hello_2_mpi : -np 3 --host loki --slot-list 0:0-5,1:0-5 hello_2_slave_mpi >> [loki:28536] *** Process received signal *** >> [loki:28536] Signal: Segmentation fault (11) >> [loki:28536] Signal code: Address not mapped (1) >> [loki:28536] Failing at address: 0x8 >> [loki:28536] [ 0] /lib64/libpthread.so.0(+0xf870)[0x7fd40eb75870] >> [loki:28536] [ 1] >> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_proc_self+0x35)[0x7fd40edd85b0] >> [loki:28536] [ 2] >> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_comm_init+0x68b)[0x7fd40edb7b08] >> [loki:28536] [ 3] >> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_mpi_init+0xa90)[0x7fd40eddde8a] >> [loki:28536] [ 4] >> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(MPI_Init+0x180)[0x7fd40ee1a28e] >> [loki:28536] [ 5] hello_2_slave_mpi[0x400bee] >> [loki:28536] [ 6] *** An error occurred in MPI_Init >> *** on a NULL communicator >> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >> *** and potentially your MPI job) >> [loki:28534] Local abort before MPI_INIT completed successfully; not able to >> aggregate error messages, and not able to guarantee that all other processes >> were killed! >> *** An error occurred in MPI_Init >> *** on a NULL communicator >> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >> *** and potentially your MPI job) >> [loki:28535] Local abort before MPI_INIT completed successfully; not able to >> aggregate error messages, and not able to guarantee that all other processes >> were killed! >> /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fd40e7dfb05] >> [loki:28536] [ 7] hello_2_slave_mpi[0x400fb0] >> [loki:28536] *** End of error message *** >> ------------------------------------------------------- >> Primary job terminated normally, but 1 process returned >> a non-zero exit code.. Per user-direction, the job has been aborted. >> ------------------------------------------------------- >> -------------------------------------------------------------------------- >> mpiexec detected that one or more processes exited with non-zero status, >> thus causing >> the job to be terminated. The first process to do so was: >> >> Process name: [[61640,1],0] >> Exit code: 1 >> -------------------------------------------------------------------------- >> loki hello_2 129 >> >> >> >> loki hello_2 114 ompi_info | grep -e "OPAL repo revision" -e "C compiler >> absolute" >> OPAL repo revision: dev-4050-g7f65c2b >> C compiler absolute: /opt/solstudio12.5b/bin/cc >> loki hello_2 115 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 >> hello_2_mpi : -np 3 --host loki --slot-list 0:0-5,1:0-5 hello_2_slave_mpi >> Process 0 of 4 running on loki >> Process 1 of 4 running on loki >> Process 2 of 4 running on loki >> Process 3 of 4 running on loki >> ... >> >> >> It even works, if I put "--slot-list" only once on the command >> line. >> >> loki hello_2 116 mpiexec -np 1 --host loki hello_2_mpi : -np 3 --host loki >> --slot-list 0:0-5,1:0-5 hello_2_slave_mpi >> Process 1 of 4 running on loki >> Process 2 of 4 running on loki >> Process 0 of 4 running on loki >> Process 3 of 4 running on loki >> ... >> >> >> Hopefully you know what happens and why it happens so that >> you can fix the problem for openmpi-1.10.x and openmpi-2.x. >> My three spawn programs work with openmpi-master as well >> while "spawn_master" breaks on both openmpi-1.10.x and >> openmpi-2.x with the same failure as my hello master/slave >> program. >> >> Do you know when the Java problem will be solved? >> >> >> Kind regards >> >> Siegmar >> >> >> >> Am 15.05.2016 um 01:27 schrieb Ralph Castain: >>> >>>> On May 7, 2016, at 1:13 AM, Siegmar Gross >>>> <siegmar.gr...@informatik.hs-fulda.de> wrote: >>>> >>>> Hi, >>>> >>>> yesterday I installed openmpi-v1.10.2-176-g9d45e07 on my "SUSE Linux >>>> Enterprise Server 12 (x86_64)" with Sun C 5.13 and gcc-5.3.0. The >>>> following programs don't run anymore. >>>> >>>> >>>> loki hello_2 112 ompi_info | grep -e "OPAL repo revision" -e "C compiler >>>> absolute" >>>> OPAL repo revision: v1.10.2-176-g9d45e07 >>>> C compiler absolute: /opt/solstudio12.4/bin/cc >>>> loki hello_2 113 mpiexec -np 1 --host loki hello_2_mpi : -np 2 --host >>>> loki,loki hello_2_slave_mpi >>>> -------------------------------------------------------------------------- >>>> There are not enough slots available in the system to satisfy the 2 slots >>>> that were requested by the application: >>>> hello_2_slave_mpi >>>> >>>> Either request fewer slots for your application, or make more slots >>>> available >>>> for use. >>>> -------------------------------------------------------------------------- >>>> loki hello_2 114 >>>> >>> >>> The above worked fine for me with: >>> >>> OPAL repo revision: v1.10.2-182-g52c7573 >>> >>> You might try updating. >>> >>>> >>>> >>>> Everything worked as expected with openmpi-v1.10.0-178-gb80f802. >>>> >>>> loki hello_2 114 ompi_info | grep -e "OPAL repo revision" -e "C compiler >>>> absolute" >>>> OPAL repo revision: v1.10.0-178-gb80f802 >>>> C compiler absolute: /opt/solstudio12.4/bin/cc >>>> loki hello_2 115 mpiexec -np 1 --host loki hello_2_mpi : -np 2 --host >>>> loki,loki hello_2_slave_mpi >>>> Process 0 of 3 running on loki >>>> Process 1 of 3 running on loki >>>> Process 2 of 3 running on loki >>>> >>>> Now 2 slave tasks are sending greetings. >>>> >>>> Greetings from task 2: >>>> message type: 3 >>>> ... >>>> >>>> >>>> I have the same problem with openmpi-v2.x-dev-1404-g74d8ea0, if I use >>>> the following commands. >>>> >>>> mpiexec -np 1 --host loki hello_2_mpi : -np 2 --host loki,loki >>>> hello_2_slave_mpi >>>> mpiexec -np 1 --host loki hello_2_mpi : -np 2 --host loki,nfs1 >>>> hello_2_slave_mpi >>>> mpiexec -np 1 --host loki hello_2_mpi : -np 2 --host loki --slot-list >>>> 0:0-5,1:0-5 hello_2_slave_mpi >>>> >>>> >>>> I have also the same problem with openmpi-dev-4010-g6c9d65c, if I use >>>> the following command. >>>> >>>> mpiexec -np 1 --host loki hello_2_mpi : -np 2 --host loki,loki >>>> hello_2_slave_mpi >>>> >>>> >>>> openmpi-dev-4010-g6c9d65c works as expected with the following commands. >>>> >>>> mpiexec -np 1 --host loki hello_2_mpi : -np 2 --host loki,nfs1 >>>> hello_2_slave_mpi >>>> mpiexec -np 1 --host loki hello_2_mpi : -np 2 --host loki --slot-list >>>> 0:0-5,1:0-5 hello_2_slave_mpi >>>> >>>> >>>> Has the interface changed so that I'm not allowed to use some of my >>>> commands any longer? I would be grateful, if somebody can fix the >>>> problem if it is a problem. Thank you very much for any help in >>>> advance. >>>> >>>> >>>> >>>> Kind regards >>>> >>>> Siegmar >>>> <hello_2_mpi.c><hello_2_slave_mpi.c>_______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2016/05/29126.php >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2016/05/29205.php >>> >> <hello_2_mpi.c><hello_2_slave_mpi.c>_______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2016/05/29206.php >