I’m afraid I honestly can’t make any sense of it. It seems you at least have a 
simple workaround (use a hostfile instead of -host), yes?


> On May 26, 2016, at 5:48 AM, Siegmar Gross 
> <siegmar.gr...@informatik.hs-fulda.de> wrote:
> 
> Hi Ralph and Gilles,
> 
> it's strange that the program works with "--host" and "--slot-list"
> in your environment and not in mine. I get the following output, if
> I run the program in gdb without a breakpoint.
> 
> 
> loki spawn 142 gdb /usr/local/openmpi-1.10.3_64_gcc/bin/mpiexec
> GNU gdb (GDB; SUSE Linux Enterprise 12) 7.9.1
> ...
> (gdb) set args -np 1 --host loki --slot-list 0:0-1,1:0-1 simple_spawn
> (gdb) run
> Starting program: /usr/local/openmpi-1.10.3_64_gcc/bin/mpiexec -np 1 --host 
> loki --slot-list 0:0-1,1:0-1 simple_spawn
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib64/libthread_db.so.1".
> Detaching after fork from child process 18031.
> [pid 18031] starting up!
> 0 completed MPI_Init
> Parent [pid 18031] about to spawn!
> Detaching after fork from child process 18033.
> Detaching after fork from child process 18034.
> [pid 18033] starting up!
> [pid 18034] starting up!
> [loki:18034] *** Process received signal ***
> [loki:18034] Signal: Segmentation fault (11)
> ...
> 
> 
> 
> I get a different output, if I run the program in gdb with
> a breakpoint.
> 
> gdb /usr/local/openmpi-1.10.3_64_gcc/bin/mpiexec
> (gdb) set args -np 1 --host loki --slot-list 0:0-1,1:0-1 simple_spawn
> (gbd) set follow-fork-mode child
> (gdb) break ompi_proc_self
> (gdb) run
> (gdb) next
> 
> Repeating "next" very often results in the following output.
> 
> ...
> Starting program: 
> /home/fd1026/work/skripte/master/parallel/prog/mpi/spawn/simple_spawn
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib64/libthread_db.so.1".
> [pid 13277] starting up!
> [New Thread 0x7ffff42ef700 (LWP 13289)]
> 
> Breakpoint 1, ompi_proc_self (size=0x7fffffffc060)
>    at ../../openmpi-1.10.3rc3/ompi/proc/proc.c:413
> 413         ompi_proc_t **procs = (ompi_proc_t**) 
> malloc(sizeof(ompi_proc_t*));
> (gdb) n
> 414         if (NULL == procs) {
> (gdb)
> 423         OBJ_RETAIN(ompi_proc_local_proc);
> (gdb)
> 424         *procs = ompi_proc_local_proc;
> (gdb)
> 425         *size = 1;
> (gdb)
> 426         return procs;
> (gdb)
> 427     }
> (gdb)
> ompi_comm_init () at ../../openmpi-1.10.3rc3/ompi/communicator/comm_init.c:138
> 138         group->grp_my_rank       = 0;
> (gdb)
> 139         group->grp_proc_count    = (int)size;
> ...
> 193         ompi_comm_reg_init();
> (gdb)
> 196         ompi_comm_request_init ();
> (gdb)
> 198         return OMPI_SUCCESS;
> (gdb)
> 199     }
> (gdb)
> ompi_mpi_init (argc=0, argv=0x0, requested=0, provided=0x7fffffffc21c)
>    at ../../openmpi-1.10.3rc3/ompi/runtime/ompi_mpi_init.c:738
> 738         if (OMPI_SUCCESS != (ret = ompi_file_init())) {
> (gdb)
> 744         if (OMPI_SUCCESS != (ret = ompi_win_init())) {
> (gdb)
> 750         if (OMPI_SUCCESS != (ret = ompi_attr_init())) {
> ...
> 988         ompi_mpi_initialized = true;
> (gdb)
> 991         if (ompi_enable_timing && 0 == OMPI_PROC_MY_NAME->vpid) {
> (gdb)
> 999         return MPI_SUCCESS;
> (gdb)
> 1000    }
> (gdb)
> PMPI_Init (argc=0x0, argv=0x0) at pinit.c:94
> 94          if (MPI_SUCCESS != err) {
> (gdb)
> 104         return MPI_SUCCESS;
> (gdb)
> 105     }
> (gdb)
> 0x0000000000400d0c in main ()
> (gdb)
> Single stepping until exit from function main,
> which has no line number information.
> 0 completed MPI_Init
> Parent [pid 13277] about to spawn!
> [New process 13472]
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib64/libthread_db.so.1".
> process 13472 is executing new program: 
> /usr/local/openmpi-1.10.3_64_gcc/bin/orted
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib64/libthread_db.so.1".
> [New process 13474]
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib64/libthread_db.so.1".
> process 13474 is executing new program: 
> /home/fd1026/work/skripte/master/parallel/prog/mpi/spawn/simple_spawn
> [pid 13475] starting up!
> [pid 13476] starting up!
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib64/libthread_db.so.1".
> [pid 13474] starting up!
> [New Thread 0x7ffff491b700 (LWP 13480)]
> [Switching to Thread 0x7ffff7ff1740 (LWP 13474)]
> 
> Breakpoint 1, ompi_proc_self (size=0x7fffffffba30)
>    at ../../openmpi-1.10.3rc3/ompi/proc/proc.c:413
> 413         ompi_proc_t **procs = (ompi_proc_t**) 
> malloc(sizeof(ompi_proc_t*));
> (gdb)
> 414         if (NULL == procs) {
> ...
> 426         return procs;
> (gdb)
> 427     }
> (gdb)
> ompi_comm_init () at ../../openmpi-1.10.3rc3/ompi/communicator/comm_init.c:138
> 138         group->grp_my_rank       = 0;
> (gdb)
> 139         group->grp_proc_count    = (int)size;
> (gdb)
> 140         OMPI_GROUP_SET_INTRINSIC (group);
> ...
> 193         ompi_comm_reg_init();
> (gdb)
> 196         ompi_comm_request_init ();
> (gdb)
> 198         return OMPI_SUCCESS;
> (gdb)
> 199     }
> (gdb)
> ompi_mpi_init (argc=0, argv=0x0, requested=0, provided=0x7fffffffbbec)
>    at ../../openmpi-1.10.3rc3/ompi/runtime/ompi_mpi_init.c:738
> 738         if (OMPI_SUCCESS != (ret = ompi_file_init())) {
> (gdb)
> 744         if (OMPI_SUCCESS != (ret = ompi_win_init())) {
> (gdb)
> 750         if (OMPI_SUCCESS != (ret = ompi_attr_init())) {
> ...
> 863         if (OMPI_SUCCESS != (ret = ompi_pubsub_base_select())) {
> (gdb)
> 869         if (OMPI_SUCCESS != (ret = 
> mca_base_framework_open(&ompi_dpm_base_framework, 0))) {
> (gdb)
> 873         if (OMPI_SUCCESS != (ret = ompi_dpm_base_select())) {
> (gdb)
> 884         if ( OMPI_SUCCESS !=
> (gdb)
> 894         if (OMPI_SUCCESS !=
> (gdb)
> 900         if (OMPI_SUCCESS !=
> (gdb)
> 911         if (OMPI_SUCCESS != (ret = ompi_dpm.dyn_init())) {
> (gdb)
> Parent done with spawn
> Parent sending message to child
> 2 completed MPI_Init
> Hello from the child 2 of 3 on host loki pid 13476
> 1 completed MPI_Init
> Hello from the child 1 of 3 on host loki pid 13475
> 921         if (OMPI_SUCCESS != (ret = ompi_cr_init())) {
> (gdb)
> 931         opal_progress_event_users_decrement();
> (gdb)
> 934         opal_progress_set_yield_when_idle(ompi_mpi_yield_when_idle);
> (gdb)
> 937         if (ompi_mpi_event_tick_rate >= 0) {
> (gdb)
> 946         if (OMPI_SUCCESS != (ret = ompi_mpiext_init())) {
> (gdb)
> 953         if (ret != OMPI_SUCCESS) {
> (gdb)
> 972         OBJ_CONSTRUCT(&ompi_registered_datareps, opal_list_t);
> (gdb)
> 977         OBJ_CONSTRUCT( &ompi_mpi_f90_integer_hashtable, 
> opal_hash_table_t);
> (gdb)
> 978         opal_hash_table_init(&ompi_mpi_f90_integer_hashtable, 16 /* why 
> not? */);
> (gdb)
> 980         OBJ_CONSTRUCT( &ompi_mpi_f90_real_hashtable, opal_hash_table_t);
> (gdb)
> 981         opal_hash_table_init(&ompi_mpi_f90_real_hashtable, 
> FLT_MAX_10_EXP);
> (gdb)
> 983         OBJ_CONSTRUCT( &ompi_mpi_f90_complex_hashtable, 
> opal_hash_table_t);
> (gdb)
> 984         opal_hash_table_init(&ompi_mpi_f90_complex_hashtable, 
> FLT_MAX_10_EXP);
> (gdb)
> 988         ompi_mpi_initialized = true;
> (gdb)
> 991         if (ompi_enable_timing && 0 == OMPI_PROC_MY_NAME->vpid) {
> (gdb)
> 999         return MPI_SUCCESS;
> (gdb)
> 1000    }
> (gdb)
> PMPI_Init (argc=0x0, argv=0x0) at pinit.c:94
> 94          if (MPI_SUCCESS != err) {
> (gdb)
> 104         return MPI_SUCCESS;
> (gdb)
> 105     }
> (gdb)
> 0x0000000000400d0c in main ()
> (gdb)
> Single stepping until exit from function main,
> which has no line number information.
> 0 completed MPI_Init
> Hello from the child 0 of 3 on host loki pid 13474
> 
> Child 2 disconnected
> Child 1 disconnected
> Child 0 received msg: 38
> Parent disconnected
> 13277: exiting
> 
> Program received signal SIGTERM, Terminated.
> 0x0000000000400f0a in main ()
> (gdb)
> Single stepping until exit from function main,
> which has no line number information.
> [tcsetpgrp failed in terminal_inferior: No such process]
> [Thread 0x7ffff491b700 (LWP 13480) exited]
> 
> Program terminated with signal SIGTERM, Terminated.
> The program no longer exists.
> (gdb)
> The program is not being run.
> (gdb)
> The program is not being run.
> (gdb) info break
> Num     Type           Disp Enb Address            What
> 1       breakpoint     keep y   0x00007ffff7aa35c7 in ompi_proc_self
>                                                   at 
> ../../openmpi-1.10.3rc3/ompi/proc/proc.c:413 inf 8, 7, 6, 5, 4, 3, 2, 1
>        breakpoint already hit 2 times
> (gdb) delete 1
> (gdb) r
> Starting program: 
> /home/fd1026/work/skripte/master/parallel/prog/mpi/spawn/simple_spawn
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib64/libthread_db.so.1".
> [pid 16708] starting up!
> 0 completed MPI_Init
> Parent [pid 16708] about to spawn!
> [New process 16720]
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib64/libthread_db.so.1".
> process 16720 is executing new program: 
> /usr/local/openmpi-1.10.3_64_gcc/bin/orted
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib64/libthread_db.so.1".
> [New process 16722]
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib64/libthread_db.so.1".
> process 16722 is executing new program: 
> /home/fd1026/work/skripte/master/parallel/prog/mpi/spawn/simple_spawn
> [pid 16723] starting up!
> [pid 16724] starting up!
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib64/libthread_db.so.1".
> [pid 16722] starting up!
> Parent done with spawn
> Parent sending message to child
> 1 completed MPI_Init
> Hello from the child 1 of 3 on host loki pid 16723
> 2 completed MPI_Init
> Hello from the child 2 of 3 on host loki pid 16724
> 0 completed MPI_Init
> Hello from the child 0 of 3 on host loki pid 16722
> Child 0 received msg: 38
> Child 0 disconnected
> Parent disconnected
> Child 1 disconnected
> Child 2 disconnected
> 16708: exiting
> 16724: exiting
> 16723: exiting
> [New Thread 0x7ffff491b700 (LWP 16729)]
> 
> Program received signal SIGTERM, Terminated.
> [Switching to Thread 0x7ffff7ff1740 (LWP 16722)]
> __GI__dl_debug_state () at dl-debug.c:74
> 74      dl-debug.c: No such file or directory.
> (gdb) 
> --------------------------------------------------------------------------
> WARNING: A process refused to die despite all the efforts!
> This process may still be running and/or consuming resources.
> 
> Host: loki
> PID:  16722
> 
> --------------------------------------------------------------------------
> 
> 
> The following simple_spawn processes exist now.
> 
> loki spawn 171 ps -aef | grep simple_spawn
> fd1026   11079 11053  0 14:00 pts/0    00:00:00 
> /usr/local/openmpi-1.10.3_64_gcc/bin/mpiexec -np 1 --host loki --slot-list 
> 0:0-1,1:0-1 simple_spawn
> fd1026   11095 11079 29 14:01 pts/0    00:09:37 [simple_spawn] <defunct>
> fd1026   16722     1  0 14:31 ?        00:00:00 [simple_spawn] <defunct>
> fd1026   17271 29963  0 14:33 pts/2    00:00:00 grep simple_spawn
> loki spawn 172
> 
> 
> Is it possible that there is a race condition? How can I help
> to get a solution for my problem?
> 
> 
> Kind regards
> 
> Siegmar
> 
> Am 24.05.2016 um 16:54 schrieb Ralph Castain:
>> Works perfectly for me, so I believe this must be an environment issue - I 
>> am using gcc 6.0.0 on CentOS7 with x86:
>> 
>> $ mpirun -n 1 -host bend001 --slot-list 0:0-1,1:0-1 --report-bindings 
>> ./simple_spawn
>> [bend001:17599] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0[core 
>> 1[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]: 
>> [BB/BB/../../../..][BB/BB/../../../..]
>> [pid 17601] starting up!
>> 0 completed MPI_Init
>> Parent [pid 17601] about to spawn!
>> [pid 17603] starting up!
>> [bend001:17599] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0[core 
>> 1[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]: 
>> [BB/BB/../../../..][BB/BB/../../../..]
>> [bend001:17599] MCW rank 1 bound to socket 0[core 0[hwt 0-1]], socket 0[core 
>> 1[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]: 
>> [BB/BB/../../../..][BB/BB/../../../..]
>> [bend001:17599] MCW rank 2 bound to socket 0[core 0[hwt 0-1]], socket 0[core 
>> 1[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]: 
>> [BB/BB/../../../..][BB/BB/../../../..]
>> [pid 17604] starting up!
>> [pid 17605] starting up!
>> Parent done with spawn
>> Parent sending message to child
>> 0 completed MPI_Init
>> Hello from the child 0 of 3 on host bend001 pid 17603
>> Child 0 received msg: 38
>> 1 completed MPI_Init
>> Hello from the child 1 of 3 on host bend001 pid 17604
>> 2 completed MPI_Init
>> Hello from the child 2 of 3 on host bend001 pid 17605
>> Child 0 disconnected
>> Child 2 disconnected
>> Parent disconnected
>> Child 1 disconnected
>> 17603: exiting
>> 17605: exiting
>> 17601: exiting
>> 17604: exiting
>> $
>> 
>>> On May 24, 2016, at 7:18 AM, Siegmar Gross 
>>> <siegmar.gr...@informatik.hs-fulda.de> wrote:
>>> 
>>> Hi Ralph and Gilles,
>>> 
>>> the program breaks only, if I combine "--host" and "--slot-list". Perhaps 
>>> this
>>> information is helpful. I use a different machine now, so that you can see 
>>> that
>>> the problem is not restricted to "loki".
>>> 
>>> 
>>> pc03 spawn 115 ompi_info | grep -e "OPAL repo revision:" -e "C compiler 
>>> absolute:"
>>>     OPAL repo revision: v1.10.2-201-gd23dda8
>>>    C compiler absolute: /usr/local/gcc-6.1.0/bin/gcc
>>> 
>>> 
>>> pc03 spawn 116 uname -a
>>> Linux pc03 3.12.55-52.42-default #1 SMP Thu Mar 3 10:35:46 UTC 2016 
>>> (4354e1d) x86_64 x86_64 x86_64 GNU/Linux
>>> 
>>> 
>>> pc03 spawn 117 cat host_pc03.openmpi
>>> pc03.informatik.hs-fulda.de slots=12 max_slots=12
>>> 
>>> 
>>> pc03 spawn 118 mpicc simple_spawn.c
>>> 
>>> 
>>> pc03 spawn 119 mpiexec -np 1 --report-bindings a.out
>>> [pc03:03711] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: 
>>> [BB/../../../../..][../../../../../..]
>>> [pid 3713] starting up!
>>> 0 completed MPI_Init
>>> Parent [pid 3713] about to spawn!
>>> [pc03:03711] MCW rank 0 bound to socket 1[core 6[hwt 0-1]], socket 1[core 
>>> 7[hwt 0-1]], socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt 0-1]], socket 
>>> 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: 
>>> [../../../../../..][BB/BB/BB/BB/BB/BB]
>>> [pc03:03711] MCW rank 1 bound to socket 0[core 0[hwt 0-1]], socket 0[core 
>>> 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 
>>> 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]]: 
>>> [BB/BB/BB/BB/BB/BB][../../../../../..]
>>> [pc03:03711] MCW rank 2 bound to socket 1[core 6[hwt 0-1]], socket 1[core 
>>> 7[hwt 0-1]], socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt 0-1]], socket 
>>> 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: 
>>> [../../../../../..][BB/BB/BB/BB/BB/BB]
>>> [pid 3715] starting up!
>>> [pid 3716] starting up!
>>> [pid 3717] starting up!
>>> Parent done with spawn
>>> Parent sending message to child
>>> 0 completed MPI_Init
>>> Hello from the child 0 of 3 on host pc03 pid 3715
>>> 1 completed MPI_Init
>>> Hello from the child 1 of 3 on host pc03 pid 3716
>>> 2 completed MPI_Init
>>> Hello from the child 2 of 3 on host pc03 pid 3717
>>> Child 0 received msg: 38
>>> Child 0 disconnected
>>> Child 2 disconnected
>>> Parent disconnected
>>> Child 1 disconnected
>>> 3713: exiting
>>> 3715: exiting
>>> 3716: exiting
>>> 3717: exiting
>>> 
>>> 
>>> pc03 spawn 120 mpiexec -np 1 --hostfile host_pc03.openmpi --slot-list 
>>> 0:0-1,1:0-1 --report-bindings a.out
>>> [pc03:03729] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0[core 
>>> 1[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]: 
>>> [BB/BB/../../../..][BB/BB/../../../..]
>>> [pid 3731] starting up!
>>> 0 completed MPI_Init
>>> Parent [pid 3731] about to spawn!
>>> [pc03:03729] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0[core 
>>> 1[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]: 
>>> [BB/BB/../../../..][BB/BB/../../../..]
>>> [pc03:03729] MCW rank 1 bound to socket 0[core 0[hwt 0-1]], socket 0[core 
>>> 1[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]: 
>>> [BB/BB/../../../..][BB/BB/../../../..]
>>> [pc03:03729] MCW rank 2 bound to socket 0[core 0[hwt 0-1]], socket 0[core 
>>> 1[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]: 
>>> [BB/BB/../../../..][BB/BB/../../../..]
>>> [pid 3733] starting up!
>>> [pid 3734] starting up!
>>> [pid 3735] starting up!
>>> Parent done with spawn
>>> Parent sending message to child
>>> 2 completed MPI_Init
>>> Hello from the child 2 of 3 on host pc03 pid 3735
>>> 1 completed MPI_Init
>>> Hello from the child 1 of 3 on host pc03 pid 3734
>>> 0 completed MPI_Init
>>> Hello from the child 0 of 3 on host pc03 pid 3733
>>> Child 0 received msg: 38
>>> Child 0 disconnected
>>> Child 2 disconnected
>>> Child 1 disconnected
>>> Parent disconnected
>>> 3731: exiting
>>> 3734: exiting
>>> 3733: exiting
>>> 3735: exiting
>>> 
>>> 
>>> pc03 spawn 121 mpiexec -np 1 --host pc03 --slot-list 0:0-1,1:0-1 
>>> --report-bindings a.out
>>> [pc03:03744] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0[core 
>>> 1[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]: 
>>> [BB/BB/../../../..][BB/BB/../../../..]
>>> [pid 3746] starting up!
>>> 0 completed MPI_Init
>>> Parent [pid 3746] about to spawn!
>>> [pc03:03744] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0[core 
>>> 1[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]: 
>>> [BB/BB/../../../..][BB/BB/../../../..]
>>> [pc03:03744] MCW rank 2 bound to socket 0[core 0[hwt 0-1]], socket 0[core 
>>> 1[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]: 
>>> [BB/BB/../../../..][BB/BB/../../../..]
>>> [pid 3748] starting up!
>>> [pid 3749] starting up!
>>> [pc03:03749] *** Process received signal ***
>>> [pc03:03749] Signal: Segmentation fault (11)
>>> [pc03:03749] Signal code: Address not mapped (1)
>>> [pc03:03749] Failing at address: 0x8
>>> [pc03:03749] [ 0] /lib64/libpthread.so.0(+0xf870)[0x7fe6f0d1f870]
>>> [pc03:03749] [ 1] 
>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_proc_self+0x35)[0x7fe6f0f825b0]
>>> [pc03:03749] [ 2] 
>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_comm_init+0x68b)[0x7fe6f0f61b08]
>>> [pc03:03749] [ 3] 
>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_mpi_init+0xa90)[0x7fe6f0f87e8a]
>>> [pc03:03749] [ 4] 
>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(MPI_Init+0x1a0)[0x7fe6f0fc42ae]
>>> [pc03:03749] [ 5] a.out[0x400d0c]
>>> [pc03:03749] [ 6] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fe6f0989b05]
>>> [pc03:03749] [ 7] a.out[0x400bf9]
>>> [pc03:03749] *** End of error message ***
>>> --------------------------------------------------------------------------
>>> mpiexec noticed that process rank 2 with PID 3749 on node pc03 exited on 
>>> signal 11 (Segmentation fault).
>>> --------------------------------------------------------------------------
>>> pc03 spawn 122
>>> 
>>> 
>>> 
>>> Kind regards
>>> 
>>> Siegmar
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On 05/24/16 15:44, Ralph Castain wrote:
>>>> 
>>>>> On May 24, 2016, at 6:21 AM, Siegmar Gross 
>>>>> <siegmar.gr...@informatik.hs-fulda.de> wrote:
>>>>> 
>>>>> Hi Ralph,
>>>>> 
>>>>> I copy the relevant lines to this place, so that it is easier to see what
>>>>> happens. "a.out" is your program, which I compiled with mpicc.
>>>>> 
>>>>>>> loki spawn 153 ompi_info | grep -e "OPAL repo revision:" -e "C compiler
>>>>>>> absolute:"
>>>>>>>    OPAL repo revision: v1.10.2-201-gd23dda8
>>>>>>>   C compiler absolute: /usr/local/gcc-6.1.0/bin/gcc
>>>>>>> loki spawn 154 mpicc simple_spawn.c
>>>>> 
>>>>>>> loki spawn 155 mpiexec -np 1 a.out
>>>>>>> [pid 24008] starting up!
>>>>>>> 0 completed MPI_Init
>>>>> ...
>>>>> 
>>>>> "mpiexec -np 1 a.out" works.
>>>>> 
>>>>> 
>>>>> 
>>>>>> I don’t know what “a.out” is, but it looks like there is some memory
>>>>>> corruption there.
>>>>> 
>>>>> "a.out" is still your program. I get the same error on different
>>>>> machines, so that it is not very likely, that the (hardware) memory
>>>>> is corrupted.
>>>>> 
>>>>> 
>>>>>>> loki spawn 156 mpiexec -np 1 --host loki --slot-list 0-5 a.out
>>>>>>> [pid 24102] starting up!
>>>>>>> 0 completed MPI_Init
>>>>>>> Parent [pid 24102] about to spawn!
>>>>>>> [pid 24104] starting up!
>>>>>>> [pid 24105] starting up!
>>>>>>> [loki:24105] *** Process received signal ***
>>>>>>> [loki:24105] Signal: Segmentation fault (11)
>>>>>>> [loki:24105] Signal code: Address not mapped (1)
>>>>> ...
>>>>> 
>>>>> "mpiexec -np 1 --host loki --slot-list 0-5 a.out" breaks with a 
>>>>> segmentation
>>>>> faUlt. Can I do something, so that you can find out, what happens?
>>>> 
>>>> I honestly have no idea - perhaps Gilles can help, as I have no access to 
>>>> that kind of environment. We aren’t seeing such problems elsewhere, so it 
>>>> is likely something local.
>>>> 
>>>>> 
>>>>> 
>>>>> Kind regards
>>>>> 
>>>>> Siegmar
>>>>> 
>>>>> 
>>>>> 
>>>>> On 05/24/16 15:07, Ralph Castain wrote:
>>>>>> 
>>>>>>> On May 24, 2016, at 4:19 AM, Siegmar Gross
>>>>>>> <siegmar.gr...@informatik.hs-fulda.de
>>>>>>> <mailto:siegmar.gr...@informatik.hs-fulda.de>> wrote:
>>>>>>> 
>>>>>>> Hi Ralph,
>>>>>>> 
>>>>>>> thank you very much for your answer and your example program.
>>>>>>> 
>>>>>>> On 05/23/16 17:45, Ralph Castain wrote:
>>>>>>>> I cannot replicate the problem - both scenarios work fine for me. I’m 
>>>>>>>> not
>>>>>>>> convinced your test code is correct, however, as you call Comm_free the
>>>>>>>> inter-communicator but didn’t call Comm_disconnect. Checkout the 
>>>>>>>> attached
>>>>>>>> for a correct code and see if it works for you.
>>>>>>> 
>>>>>>> I thought that I only need MPI_Comm_Disconnect, if I would have 
>>>>>>> established a
>>>>>>> connection with MPI_Comm_connect before. The man page for MPI_Comm_free 
>>>>>>> states
>>>>>>> 
>>>>>>> "This  operation marks the communicator object for deallocation. The
>>>>>>> handle is set to MPI_COMM_NULL. Any pending operations that use this
>>>>>>> communicator will complete normally; the object is actually deallocated 
>>>>>>> only
>>>>>>> if there are no other active references to it.".
>>>>>>> 
>>>>>>> The man page for MPI_Comm_disconnect states
>>>>>>> 
>>>>>>> "MPI_Comm_disconnect waits for all pending communication on comm to 
>>>>>>> complete
>>>>>>> internally, deallocates the communicator object, and sets the handle to
>>>>>>> MPI_COMM_NULL. It is  a  collective operation.".
>>>>>>> 
>>>>>>> I don't see a difference for my spawned processes, because both 
>>>>>>> functions will
>>>>>>> "wait" until all pending operations have finished, before the object 
>>>>>>> will be
>>>>>>> destroyed. Nevertheless, perhaps my small example program worked all 
>>>>>>> the years
>>>>>>> by chance.
>>>>>>> 
>>>>>>> However, I don't understand, why my program works with
>>>>>>> "mpiexec -np 1 --host loki,loki,loki,loki,loki spawn_master" and breaks 
>>>>>>> with
>>>>>>> "mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master". You 
>>>>>>> are right,
>>>>>>> my slot-list is equivalent to "-bind-to none". I could also have used
>>>>>>> "mpiexec -np 1 --host loki --oversubscribe spawn_master" which works as 
>>>>>>> well.
>>>>>> 
>>>>>> Well, you are only giving us one slot when you specify "-host loki”, and 
>>>>>> then
>>>>>> you are trying to launch multiple processes into it. The “slot-list” 
>>>>>> option only
>>>>>> tells us what cpus to bind each process to - it doesn’t allocate process 
>>>>>> slots.
>>>>>> So you have to tell us how many processes are allowed to run on this 
>>>>>> node.
>>>>>> 
>>>>>>> 
>>>>>>> The program breaks with "There are not enough slots available in the 
>>>>>>> system
>>>>>>> to satisfy ...", if I only use "--host loki" or different host names,
>>>>>>> without mentioning five host names, using "slot-list", or 
>>>>>>> "oversubscribe",
>>>>>>> Unfortunately "--host <host name>:<number of slots>" isn't available for
>>>>>>> openmpi-1.10.3rc2 to specify the number of available slots.
>>>>>> 
>>>>>> Correct - we did not backport the new syntax
>>>>>> 
>>>>>>> 
>>>>>>> Your program behaves the same way as mine, so that MPI_Comm_disconnect
>>>>>>> will not solve my problem. I had to modify your program in a negligible 
>>>>>>> way
>>>>>>> to get it compiled.
>>>>>>> 
>>>>>>> loki spawn 153 ompi_info | grep -e "OPAL repo revision:" -e "C compiler 
>>>>>>> absolute:"
>>>>>>>   OPAL repo revision: v1.10.2-201-gd23dda8
>>>>>>>  C compiler absolute: /usr/local/gcc-6.1.0/bin/gcc
>>>>>>> loki spawn 154 mpicc simple_spawn.c
>>>>>>> loki spawn 155 mpiexec -np 1 a.out
>>>>>>> [pid 24008] starting up!
>>>>>>> 0 completed MPI_Init
>>>>>>> Parent [pid 24008] about to spawn!
>>>>>>> [pid 24010] starting up!
>>>>>>> [pid 24011] starting up!
>>>>>>> [pid 24012] starting up!
>>>>>>> Parent done with spawn
>>>>>>> Parent sending message to child
>>>>>>> 0 completed MPI_Init
>>>>>>> Hello from the child 0 of 3 on host loki pid 24010
>>>>>>> 1 completed MPI_Init
>>>>>>> Hello from the child 1 of 3 on host loki pid 24011
>>>>>>> 2 completed MPI_Init
>>>>>>> Hello from the child 2 of 3 on host loki pid 24012
>>>>>>> Child 0 received msg: 38
>>>>>>> Child 0 disconnected
>>>>>>> Child 1 disconnected
>>>>>>> Child 2 disconnected
>>>>>>> Parent disconnected
>>>>>>> 24012: exiting
>>>>>>> 24010: exiting
>>>>>>> 24008: exiting
>>>>>>> 24011: exiting
>>>>>>> 
>>>>>>> 
>>>>>>> Is something wrong with my command line? I didn't use slot-list before, 
>>>>>>> so
>>>>>>> that I'm not sure, if I use it in the intended way.
>>>>>> 
>>>>>> I don’t know what “a.out” is, but it looks like there is some memory 
>>>>>> corruption
>>>>>> there.
>>>>>> 
>>>>>>> 
>>>>>>> loki spawn 156 mpiexec -np 1 --host loki --slot-list 0-5 a.out
>>>>>>> [pid 24102] starting up!
>>>>>>> 0 completed MPI_Init
>>>>>>> Parent [pid 24102] about to spawn!
>>>>>>> [pid 24104] starting up!
>>>>>>> [pid 24105] starting up!
>>>>>>> [loki:24105] *** Process received signal ***
>>>>>>> [loki:24105] Signal: Segmentation fault (11)
>>>>>>> [loki:24105] Signal code: Address not mapped (1)
>>>>>>> [loki:24105] Failing at address: 0x8
>>>>>>> [loki:24105] [ 0] /lib64/libpthread.so.0(+0xf870)[0x7f39aa76f870]
>>>>>>> [loki:24105] [ 1]
>>>>>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_proc_self+0x35)[0x7f39aa9d25b0]
>>>>>>> [loki:24105] [ 2]
>>>>>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_comm_init+0x68b)[0x7f39aa9b1b08]
>>>>>>> [loki:24105] [ 3] *** An error occurred in MPI_Init
>>>>>>> *** on a NULL communicator
>>>>>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>>>>>>> ***    and potentially your MPI job)
>>>>>>> [loki:24104] Local abort before MPI_INIT completed successfully; not 
>>>>>>> able to
>>>>>>> aggregate error messages, and not able to guarantee that all other 
>>>>>>> processes
>>>>>>> were killed!
>>>>>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_mpi_init+0xa90)[0x7f39aa9d7e8a]
>>>>>>> [loki:24105] [ 4]
>>>>>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(MPI_Init+0x1a0)[0x7f39aaa142ae]
>>>>>>> [loki:24105] [ 5] a.out[0x400d0c]
>>>>>>> [loki:24105] [ 6] 
>>>>>>> /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f39aa3d9b05]
>>>>>>> [loki:24105] [ 7] a.out[0x400bf9]
>>>>>>> [loki:24105] *** End of error message ***
>>>>>>> -------------------------------------------------------
>>>>>>> Child job 2 terminated normally, but 1 process returned
>>>>>>> a non-zero exit code.. Per user-direction, the job has been aborted.
>>>>>>> -------------------------------------------------------
>>>>>>> --------------------------------------------------------------------------
>>>>>>> mpiexec detected that one or more processes exited with non-zero 
>>>>>>> status, thus
>>>>>>> causing
>>>>>>> the job to be terminated. The first process to do so was:
>>>>>>> 
>>>>>>> Process name: [[49560,2],0]
>>>>>>> Exit code:    1
>>>>>>> --------------------------------------------------------------------------
>>>>>>> loki spawn 157
>>>>>>> 
>>>>>>> 
>>>>>>> Hopefully, you will find out what happens. Please let me know, if I can
>>>>>>> help you in any way.
>>>>>>> 
>>>>>>> Kind regards
>>>>>>> 
>>>>>>> Siegmar
>>>>>>> 
>>>>>>> 
>>>>>>>> FWIW: I don’t know how many cores you have on your sockets, but if you
>>>>>>>> have 6 cores/socket, then your slot-list is equivalent to “—bind-to 
>>>>>>>> none”
>>>>>>>> as the slot-list applies to every process being launched
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On May 23, 2016, at 6:26 AM, Siegmar Gross
>>>>>>>>> <siegmar.gr...@informatik.hs-fulda.de
>>>>>>>>> <mailto:siegmar.gr...@informatik.hs-fulda.de>> wrote:
>>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>> 
>>>>>>>>> I installed openmpi-1.10.3rc2 on my "SUSE Linux Enterprise Server
>>>>>>>>> 12 (x86_64)" with Sun C 5.13  and gcc-6.1.0. Unfortunately I get
>>>>>>>>> a segmentation fault for "--slot-list" for one of my small programs.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> loki spawn 119 ompi_info | grep -e "OPAL repo revision:" -e "C 
>>>>>>>>> compiler
>>>>>>>>> absolute:"
>>>>>>>>>  OPAL repo revision: v1.10.2-201-gd23dda8
>>>>>>>>> C compiler absolute: /usr/local/gcc-6.1.0/bin/gcc
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> loki spawn 120 mpiexec -np 1 --host loki,loki,loki,loki,loki 
>>>>>>>>> spawn_master
>>>>>>>>> 
>>>>>>>>> Parent process 0 running on loki
>>>>>>>>> I create 4 slave processes
>>>>>>>>> 
>>>>>>>>> Parent process 0: tasks in MPI_COMM_WORLD:                    1
>>>>>>>>>              tasks in COMM_CHILD_PROCESSES local group:  1
>>>>>>>>>              tasks in COMM_CHILD_PROCESSES remote group: 4
>>>>>>>>> 
>>>>>>>>> Slave process 0 of 4 running on loki
>>>>>>>>> Slave process 1 of 4 running on loki
>>>>>>>>> Slave process 2 of 4 running on loki
>>>>>>>>> spawn_slave 2: argv[0]: spawn_slave
>>>>>>>>> Slave process 3 of 4 running on loki
>>>>>>>>> spawn_slave 0: argv[0]: spawn_slave
>>>>>>>>> spawn_slave 1: argv[0]: spawn_slave
>>>>>>>>> spawn_slave 3: argv[0]: spawn_slave
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> loki spawn 121 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 
>>>>>>>>> spawn_master
>>>>>>>>> 
>>>>>>>>> Parent process 0 running on loki
>>>>>>>>> I create 4 slave processes
>>>>>>>>> 
>>>>>>>>> [loki:17326] *** Process received signal ***
>>>>>>>>> [loki:17326] Signal: Segmentation fault (11)
>>>>>>>>> [loki:17326] Signal code: Address not mapped (1)
>>>>>>>>> [loki:17326] Failing at address: 0x8
>>>>>>>>> [loki:17326] [ 0] /lib64/libpthread.so.0(+0xf870)[0x7f4e469b3870]
>>>>>>>>> [loki:17326] [ 1] *** An error occurred in MPI_Init
>>>>>>>>> *** on a NULL communicator
>>>>>>>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
>>>>>>>>> abort,
>>>>>>>>> ***    and potentially your MPI job)
>>>>>>>>> [loki:17324] Local abort before MPI_INIT completed successfully; not 
>>>>>>>>> able to
>>>>>>>>> aggregate error messages, and not able to guarantee that all other 
>>>>>>>>> processes
>>>>>>>>> were killed!
>>>>>>>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_proc_self+0x35)[0x7f4e46c165b0]
>>>>>>>>> [loki:17326] [ 2]
>>>>>>>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_comm_init+0x68b)[0x7f4e46bf5b08]
>>>>>>>>> [loki:17326] [ 3] *** An error occurred in MPI_Init
>>>>>>>>> *** on a NULL communicator
>>>>>>>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
>>>>>>>>> abort,
>>>>>>>>> ***    and potentially your MPI job)
>>>>>>>>> [loki:17325] Local abort before MPI_INIT completed successfully; not 
>>>>>>>>> able to
>>>>>>>>> aggregate error messages, and not able to guarantee that all other 
>>>>>>>>> processes
>>>>>>>>> were killed!
>>>>>>>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(ompi_mpi_init+0xa90)[0x7f4e46c1be8a]
>>>>>>>>> [loki:17326] [ 4]
>>>>>>>>> /usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12(MPI_Init+0x180)[0x7f4e46c5828e]
>>>>>>>>> [loki:17326] [ 5] spawn_slave[0x40097e]
>>>>>>>>> [loki:17326] [ 6] 
>>>>>>>>> /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f4e4661db05]
>>>>>>>>> [loki:17326] [ 7] spawn_slave[0x400a54]
>>>>>>>>> [loki:17326] *** End of error message ***
>>>>>>>>> -------------------------------------------------------
>>>>>>>>> Child job 2 terminated normally, but 1 process returned
>>>>>>>>> a non-zero exit code.. Per user-direction, the job has been aborted.
>>>>>>>>> -------------------------------------------------------
>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>> mpiexec detected that one or more processes exited with non-zero 
>>>>>>>>> status,
>>>>>>>>> thus causing
>>>>>>>>> the job to be terminated. The first process to do so was:
>>>>>>>>> 
>>>>>>>>> Process name: [[56340,2],0]
>>>>>>>>> Exit code:    1
>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>> loki spawn 122
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I would be grateful, if somebody can fix the problem. Thank you
>>>>>>>>> very much for any help in advance.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Kind regards
>>>>>>>>> 
>>>>>>>>> Siegmar
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>>>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>> Link to this post:
>>>>>>>>> http://www.open-mpi.org/community/lists/users/2016/05/29281.php
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>> Link to this
>>>>>>>> post: http://www.open-mpi.org/community/lists/users/2016/05/29284.php
>>>>>>>> 
>>>>>>> <simple_spawn_modified.c>_______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> Link to this post: 
>>>>>>> http://www.open-mpi.org/community/lists/users/2016/05/29300.php
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> Link to this post: 
>>>>>> http://www.open-mpi.org/community/lists/users/2016/05/29301.php
>>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> Link to this post: 
>>>>> http://www.open-mpi.org/community/lists/users/2016/05/29304.php
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/users/2016/05/29307.php
>>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2016/05/29308.php
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2016/05/29309.php
>> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/05/29315.php

Reply via email to