Hi,

I've successfully built openmpi-v1.10.1-140-g31ff573 on my machine
(SUSE Linux Enterprise Server 12.0 x86_64) with gcc-5.2.0 and
Sun C 5.13. Unfortunately I get a runtime error for a small
program spawning a process. Everything works as expected with my
programs "spawn_multiple_master" and "spawn_intra_comm". It doesn't
matter if I use my cc or gcc version of Open MPI.


loki spawn 136 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master

Parent process 0 running on loki
  I create 4 slave processes

[loki:18287] *** Process received signal ***
[loki:18287] Signal: Segmentation fault (11)
[loki:18287] Signal code: Address not mapped (1)
[loki:18287] Failing at address: 0x8
[loki:18287] [ 0] /lib64/libpthread.so.0(+0xf890)[0x7fd2c9a9a890]
[loki:18287] [ 1] 
/usr/local/openmpi-1.10.2_64_gcc/lib64/libmpi.so.12(ompi_proc_self+0x35)[0x7fd2c9cfd53a]
[loki:18287] [ 2] *** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[loki:18285] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed!
/usr/local/openmpi-1.10.2_64_gcc/lib64/libmpi.so.12(ompi_comm_init+0x68b)[0x7fd2c9cdcadd]
[loki:18287] [ 3] 
/usr/local/openmpi-1.10.2_64_gcc/lib64/libmpi.so.12(ompi_mpi_init+0xa69)[0x7fd2c9d02ddb]
[loki:18287] [ 4] *** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[loki:18286] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed!
/usr/local/openmpi-1.10.2_64_gcc/lib64/libmpi.so.12(MPI_Init+0x180)[0x7fd2c9d3f0ac]
[loki:18287] [ 5] spawn_slave[0x40097e]
[loki:18287] [ 6] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fd2c9705b05]
[loki:18287] [ 7] spawn_slave[0x400a54]
[loki:18287] *** End of error message ***
-------------------------------------------------------
Child job 2 terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus 
causing
the job to be terminated. The first process to do so was:

  Process name: [[55509,2],0]
  Exit code:    1
--------------------------------------------------------------------------
loki spawn 136




loki spawn 136 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 
spawn_multiple_master

Parent process 0 running on loki
  I create 3 slave processes.

Parent process 0: tasks in MPI_COMM_WORLD:                    1
                  tasks in COMM_CHILD_PROCESSES local group:  1
                  tasks in COMM_CHILD_PROCESSES remote group: 2

Slave process 1 of 2 running on loki
spawn_slave 1: argv[0]: spawn_slave
spawn_slave 1: argv[1]: program type 2
spawn_slave 1: argv[2]: another parameter
Slave process 0 of 2 running on loki
spawn_slave 0: argv[0]: spawn_slave
spawn_slave 0: argv[1]: program type 1


loki spawn 137 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 
spawn_intra_comm
Parent process 0: I create 2 slave processes

Parent process 0 running on loki
    MPI_COMM_WORLD ntasks:              1
    COMM_CHILD_PROCESSES ntasks_local:  1
    COMM_CHILD_PROCESSES ntasks_remote: 1
    COMM_ALL_PROCESSES ntasks:          2
    mytid in COMM_ALL_PROCESSES:        0

Child process 0 running on loki
    MPI_COMM_WORLD ntasks:              1
    COMM_ALL_PROCESSES ntasks:          2
    mytid in COMM_ALL_PROCESSES:        1
loki spawn 138



I would be grateful if somebody can fix the problem. Please let me
know if you need anything else. Thank you very much for any help in
advance.


Best regards

Siegmar

Reply via email to