> On May 24, 2016, at 6:21 AM, Siegmar Gross > <siegmar.gr...@informatik.hs-fulda.de> wrote: > > Hi Ralph, > > I copy the relevant lines to this place, so that it is easier to see what > happens. "a.out" is your program, which I compiled with mpicc. > > >> loki spawn 153 ompi_info | grep -e "OPAL repo revision:" -e "C compiler > >> absolute:" > >> OPAL repo revision: v1.10.2-201-gd23dda8 > >> C compiler absolute: /usr/local/gcc-6.1.0/bin/gcc > >> loki spawn 154 mpicc simple_spawn.c > > >> loki spawn 155 mpiexec -np 1 a.out > >> [pid 24008] starting up! > >> 0 completed MPI_Init > ... > > "mpiexec -np 1 a.out" works. > > > > > I don’t know what “a.out” is, but it looks like there is some memory > > corruption there. > > "a.out" is still your program. I get the same error on different > machines, so that it is not very likely, that the (hardware) memory > is corrupted. > > > >> loki spawn 156 mpiexec -np 1 --host loki --slot-list 0-5 a.out > >> [pid 24102] starting up! > >> 0 completed MPI_Init > >> Parent [pid 24102] about to spawn! > >> [pid 24104] starting up! > >> [pid 24105] starting up! > >> [loki:24105] *** Process received signal *** > >> [loki:24105] Signal: Segmentation fault (11) > >> [loki:24105] Signal code: Address not mapped (1) > ... > > "mpiexec -np 1 --host loki --slot-list 0-5 a.out" breaks with a segmentation > faUlt. Can I do something, so that you can find out, what happens?
I honestly have no idea - perhaps Gilles can help, as I have no access to that kind of environment. We aren’t seeing such problems elsewhere, so it is likely something local. > > > Kind regards > > Siegmar > > > > On 05/24/16 15:07, Ralph Castain wrote: >> >>> On May 24, 2016, at 4:19 AM, Siegmar Gross >>> <siegmar.gr...@informatik.hs-fulda.de >>> <mailto:siegmar.gr...@informatik.hs-fulda.de>> wrote: >>> >>> Hi Ralph, >>> >>> thank you very much for your answer and your example program. >>> >>> On 05/23/16 17:45, Ralph Castain wrote: >>>> I cannot replicate the problem - both scenarios work fine for me. I’m not >>>> convinced your test code is correct, however, as you call Comm_free the >>>> inter-communicator but didn’t call Comm_disconnect. Checkout the attached >>>> for a correct code and see if it works for you. >>> >>> I thought that I only need MPI_Comm_Disconnect, if I would have established >>> a >>> connection with MPI_Comm_connect before. The man page for MPI_Comm_free >>> states >>> >>> "This operation marks the communicator object for deallocation. The >>> handle is set to MPI_COMM_NULL. Any pending operations that use this >>> communicator will complete normally; the object is actually deallocated only >>> if there are no other active references to it.". >>> >>> The man page for MPI_Comm_disconnect states >>> >>> "MPI_Comm_disconnect waits for all pending communication on comm to complete >>> internally, deallocates the communicator object, and sets the handle to >>> MPI_COMM_NULL. It is a collective operation.". >>> >>> I don't see a difference for my spawned processes, because both functions >>> will >>> "wait" until all pending operations have finished, before the object will be >>> destroyed. Nevertheless, perhaps my small example program worked all the >>> years >>> by chance. >>> >>> However, I don't understand, why my program works with >>> "mpiexec -np 1 --host loki,loki,loki,loki,loki spawn_master" and breaks with >>> "mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master". You are >>> right, >>> my slot-list is equivalent to "-bind-to none". I could also have used >>> "mpiexec -np 1 --host loki --oversubscribe spawn_master" which works as >>> well. >> >> Well, you are only giving us one slot when you specify "-host loki”, and then >> you are trying to launch multiple processes into it. The “slot-list” option >> only >> tells us what cpus to bind each process to - it doesn’t allocate process >> slots. >> So you have to tell us how many processes are allowed to run on this node. >> >>> >>> The program breaks with "There are not enough slots available in the system >>> to satisfy ...", if I only use "--host loki" or different host names, >>> without mentioning five host names, using "slot-list", or "oversubscribe", >>> Unfortunately "--host <host name>:<number of slots>" isn't available for >>> openmpi-1.10.3rc2 to specify the number of available slots. >> >> Correct - we did not backport the new syntax >> >>> >>> Your program behaves the same way as mine, so that MPI_Comm_disconnect >>> will not solve my problem. I had to modify your program in a negligible way >>> to get it compiled. >>> >>> loki spawn 153 ompi_info | grep -e "OPAL repo revision:" -e "C compiler >>> absolute:" >>> OPAL repo revision: v1.10.2-201-gd23dda8 >>> C compiler absolute: /usr/local/gcc-6.1.0/bin/gcc >>> loki spawn 154 mpicc simple_spawn.c >>> loki spawn 155 mpiexec -np 1 a.out >>> [pid 24008] starting up! >>> 0 completed MPI_Init >>> Parent [pid 24008] about to spawn! >>> [pid 24010] starting up! >>> [pid 24011] starting up! >>> [pid 24012] starting up! >>> Parent done with spawn >>> Parent sending message to child >>> 0 completed MPI_Init >>> Hello from the child 0 of 3 on host loki pid 24010 >>> 1 completed MPI_Init >>> Hello from the child 1 of 3 on host loki pid 24011 >>> 2 completed MPI_Init >>> Hello from the child 2 of 3 on host loki pid 24012 >>> Child 0 received msg: 38 >>> Child 0 disconnected >>> Child 1 disconnected >>> Child 2 disconnected >>> Parent disconnected >>> 24012: exiting >>> 24010: exiting >>> 24008: exiting >>> 24011: exiting >>> >>> >>> Is something wrong with my command line? I didn't use slot-list before, so >>> that I'm not sure, if I use it in the intended way. >> >> I don’t know what “a.out” is, but it looks like there is some memory >> corruption >> there. >> >>> >>> loki spawn 156 mpiexec -np 1 --host loki --slot-list 0-5 a.out >>> [pid 24102] starting up! >>> 0 completed MPI_Init >>> Parent [pid 24102] about to spawn! >>> [pid 24104] starting up! >>> [pid 24105] starting up! >>> [loki:24105] *** Process received signal *** >>> [loki:24105] Signal: Segmentation fault (11) >>> [loki:24105] Signal code: Address not mapped (1) >>> [loki:24105] Failing at address: 0x8 >>> [loki:24105] [ 0] /lib64/libpthread.so.0(+0xf870)[0x7f39aa76f870] >>> [loki:24105] [ 1] >>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_proc_self+0x35)[0x7f39aa9d25b0] >>> [loki:24105] [ 2] >>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_comm_init+0x68b)[0x7f39aa9b1b08] >>> [loki:24105] [ 3] *** An error occurred in MPI_Init >>> *** on a NULL communicator >>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >>> *** and potentially your MPI job) >>> [loki:24104] Local abort before MPI_INIT completed successfully; not able to >>> aggregate error messages, and not able to guarantee that all other processes >>> were killed! >>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_mpi_init+0xa90)[0x7f39aa9d7e8a] >>> [loki:24105] [ 4] >>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(MPI_Init+0x1a0)[0x7f39aaa142ae] >>> [loki:24105] [ 5] a.out[0x400d0c] >>> [loki:24105] [ 6] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f39aa3d9b05] >>> [loki:24105] [ 7] a.out[0x400bf9] >>> [loki:24105] *** End of error message *** >>> ------------------------------------------------------- >>> Child job 2 terminated normally, but 1 process returned >>> a non-zero exit code.. Per user-direction, the job has been aborted. >>> ------------------------------------------------------- >>> -------------------------------------------------------------------------- >>> mpiexec detected that one or more processes exited with non-zero status, >>> thus >>> causing >>> the job to be terminated. The first process to do so was: >>> >>> Process name: [[49560,2],0] >>> Exit code: 1 >>> -------------------------------------------------------------------------- >>> loki spawn 157 >>> >>> >>> Hopefully, you will find out what happens. Please let me know, if I can >>> help you in any way. >>> >>> Kind regards >>> >>> Siegmar >>> >>> >>>> FWIW: I don’t know how many cores you have on your sockets, but if you >>>> have 6 cores/socket, then your slot-list is equivalent to “—bind-to none” >>>> as the slot-list applies to every process being launched >>>> >>>> >>>> >>>> >>>> >>>>> On May 23, 2016, at 6:26 AM, Siegmar Gross >>>>> <siegmar.gr...@informatik.hs-fulda.de >>>>> <mailto:siegmar.gr...@informatik.hs-fulda.de>> wrote: >>>>> >>>>> Hi, >>>>> >>>>> I installed openmpi-1.10.3rc2 on my "SUSE Linux Enterprise Server >>>>> 12 (x86_64)" with Sun C 5.13 and gcc-6.1.0. Unfortunately I get >>>>> a segmentation fault for "--slot-list" for one of my small programs. >>>>> >>>>> >>>>> loki spawn 119 ompi_info | grep -e "OPAL repo revision:" -e "C compiler >>>>> absolute:" >>>>> OPAL repo revision: v1.10.2-201-gd23dda8 >>>>> C compiler absolute: /usr/local/gcc-6.1.0/bin/gcc >>>>> >>>>> >>>>> loki spawn 120 mpiexec -np 1 --host loki,loki,loki,loki,loki spawn_master >>>>> >>>>> Parent process 0 running on loki >>>>> I create 4 slave processes >>>>> >>>>> Parent process 0: tasks in MPI_COMM_WORLD: 1 >>>>> tasks in COMM_CHILD_PROCESSES local group: 1 >>>>> tasks in COMM_CHILD_PROCESSES remote group: 4 >>>>> >>>>> Slave process 0 of 4 running on loki >>>>> Slave process 1 of 4 running on loki >>>>> Slave process 2 of 4 running on loki >>>>> spawn_slave 2: argv[0]: spawn_slave >>>>> Slave process 3 of 4 running on loki >>>>> spawn_slave 0: argv[0]: spawn_slave >>>>> spawn_slave 1: argv[0]: spawn_slave >>>>> spawn_slave 3: argv[0]: spawn_slave >>>>> >>>>> >>>>> >>>>> >>>>> loki spawn 121 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 >>>>> spawn_master >>>>> >>>>> Parent process 0 running on loki >>>>> I create 4 slave processes >>>>> >>>>> [loki:17326] *** Process received signal *** >>>>> [loki:17326] Signal: Segmentation fault (11) >>>>> [loki:17326] Signal code: Address not mapped (1) >>>>> [loki:17326] Failing at address: 0x8 >>>>> [loki:17326] [ 0] /lib64/libpthread.so.0(+0xf870)[0x7f4e469b3870] >>>>> [loki:17326] [ 1] *** An error occurred in MPI_Init >>>>> *** on a NULL communicator >>>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >>>>> *** and potentially your MPI job) >>>>> [loki:17324] Local abort before MPI_INIT completed successfully; not able >>>>> to >>>>> aggregate error messages, and not able to guarantee that all other >>>>> processes >>>>> were killed! >>>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_proc_self+0x35)[0x7f4e46c165b0] >>>>> [loki:17326] [ 2] >>>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_comm_init+0x68b)[0x7f4e46bf5b08] >>>>> [loki:17326] [ 3] *** An error occurred in MPI_Init >>>>> *** on a NULL communicator >>>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >>>>> *** and potentially your MPI job) >>>>> [loki:17325] Local abort before MPI_INIT completed successfully; not able >>>>> to >>>>> aggregate error messages, and not able to guarantee that all other >>>>> processes >>>>> were killed! >>>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_mpi_init+0xa90)[0x7f4e46c1be8a] >>>>> [loki:17326] [ 4] >>>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(MPI_Init+0x180)[0x7f4e46c5828e] >>>>> [loki:17326] [ 5] spawn_slave[0x40097e] >>>>> [loki:17326] [ 6] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f4e4661db05] >>>>> [loki:17326] [ 7] spawn_slave[0x400a54] >>>>> [loki:17326] *** End of error message *** >>>>> ------------------------------------------------------- >>>>> Child job 2 terminated normally, but 1 process returned >>>>> a non-zero exit code.. Per user-direction, the job has been aborted. >>>>> ------------------------------------------------------- >>>>> -------------------------------------------------------------------------- >>>>> mpiexec detected that one or more processes exited with non-zero status, >>>>> thus causing >>>>> the job to be terminated. The first process to do so was: >>>>> >>>>> Process name: [[56340,2],0] >>>>> Exit code: 1 >>>>> -------------------------------------------------------------------------- >>>>> loki spawn 122 >>>>> >>>>> >>>>> >>>>> >>>>> I would be grateful, if somebody can fix the problem. Thank you >>>>> very much for any help in advance. >>>>> >>>>> >>>>> Kind regards >>>>> >>>>> Siegmar >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org <mailto:us...@open-mpi.org> >>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/users/2016/05/29281.php >>>> >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org <mailto:us...@open-mpi.org> >>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users >>>> Link to this >>>> post: http://www.open-mpi.org/community/lists/users/2016/05/29284.php >>>> >>> <simple_spawn_modified.c>_______________________________________________ >>> users mailing list >>> us...@open-mpi.org <mailto:us...@open-mpi.org> >>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2016/05/29300.php >> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2016/05/29301.php >> > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/05/29304.php