Hi,
I've successfully built openmpi-v1.10.1-140-g31ff573 on my machine
(SUSE Linux Enterprise Server 12.0 x86_64) with gcc-5.2.0 and
Sun C 5.13. Unfortunately I get a runtime error for a small
program spawning a process. Everything works as expected with my
programs "spawn_multiple_master" and "spawn_intra_comm". It doesn't
matter if I use my cc or gcc version of Open MPI.
loki spawn 136 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master
Parent process 0 running on loki
I create 4 slave processes
[loki:18287] *** Process received signal ***
[loki:18287] Signal: Segmentation fault (11)
[loki:18287] Signal code: Address not mapped (1)
[loki:18287] Failing at address: 0x8
[loki:18287] [ 0] /lib64/libpthread.so.0(+0xf890)[0x7fd2c9a9a890]
[loki:18287] [ 1]
/usr/local/openmpi-1.10.2_64_gcc/lib64/libmpi.so.12(ompi_proc_self+0x35)[0x7fd2c9cfd53a]
[loki:18287] [ 2] *** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[loki:18285] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to
guarantee that all other processes were killed!
/usr/local/openmpi-1.10.2_64_gcc/lib64/libmpi.so.12(ompi_comm_init+0x68b)[0x7fd2c9cdcadd]
[loki:18287] [ 3]
/usr/local/openmpi-1.10.2_64_gcc/lib64/libmpi.so.12(ompi_mpi_init+0xa69)[0x7fd2c9d02ddb]
[loki:18287] [ 4] *** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[loki:18286] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to
guarantee that all other processes were killed!
/usr/local/openmpi-1.10.2_64_gcc/lib64/libmpi.so.12(MPI_Init+0x180)[0x7fd2c9d3f0ac]
[loki:18287] [ 5] spawn_slave[0x40097e]
[loki:18287] [ 6] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fd2c9705b05]
[loki:18287] [ 7] spawn_slave[0x400a54]
[loki:18287] *** End of error message ***
-------------------------------------------------------
Child job 2 terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus
causing
the job to be terminated. The first process to do so was:
Process name: [[55509,2],0]
Exit code: 1
--------------------------------------------------------------------------
loki spawn 136
loki spawn 136 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5
spawn_multiple_master
Parent process 0 running on loki
I create 3 slave processes.
Parent process 0: tasks in MPI_COMM_WORLD: 1
tasks in COMM_CHILD_PROCESSES local group: 1
tasks in COMM_CHILD_PROCESSES remote group: 2
Slave process 1 of 2 running on loki
spawn_slave 1: argv[0]: spawn_slave
spawn_slave 1: argv[1]: program type 2
spawn_slave 1: argv[2]: another parameter
Slave process 0 of 2 running on loki
spawn_slave 0: argv[0]: spawn_slave
spawn_slave 0: argv[1]: program type 1
loki spawn 137 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5
spawn_intra_comm
Parent process 0: I create 2 slave processes
Parent process 0 running on loki
MPI_COMM_WORLD ntasks: 1
COMM_CHILD_PROCESSES ntasks_local: 1
COMM_CHILD_PROCESSES ntasks_remote: 1
COMM_ALL_PROCESSES ntasks: 2
mytid in COMM_ALL_PROCESSES: 0
Child process 0 running on loki
MPI_COMM_WORLD ntasks: 1
COMM_ALL_PROCESSES ntasks: 2
mytid in COMM_ALL_PROCESSES: 1
loki spawn 138
I would be grateful if somebody can fix the problem. Please let me
know if you need anything else. Thank you very much for any help in
advance.
Best regards
Siegmar