Hi Gilles,

is the following output helpful to find the error? I've put
another output below the output from gdb, which shows that
things are a little bit "random" if I use only 3+2 or 4+1
Sparc machines.


tyr spawn 127 /usr/local/gdb-7.6.1_64_gcc/bin/gdb mpiexec
GNU gdb (GDB) 7.6.1
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "sparc-sun-solaris2.10".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from 
/export2/prog/SunOS_sparc/openmpi-1.10.3_64_cc/bin/orterun...done.
(gdb) set args -np 1 --host tyr,sunpc1,linpc1,ruester spawn_multiple_master
(gdb) run
Starting program: /usr/local/openmpi-1.10.3_64_cc/bin/mpiexec -np 1 --host 
tyr,sunpc1,linpc1,ruester spawn_multiple_master
[Thread debugging using libthread_db enabled]
[New Thread 1 (LWP 1)]
[New LWP    2        ]

Parent process 0 running on tyr.informatik.hs-fulda.de
  I create 3 slave processes.

Assertion failed: OPAL_OBJ_MAGIC_ID == ((opal_object_t *) (proc_pointer))->obj_magic_id, file ../../openmpi-v1.10.2-163-g42da15d/ompi/group/group_init.c, line 215, function ompi_group_increment_proc_count
[ruester:17809] *** Process received signal ***
[ruester:17809] Signal: Abort (6)
[ruester:17809] Signal code:  (-1)
/usr/local/openmpi-1.10.3_64_cc/lib64/libopen-pal.so.13.0.2:opal_backtrace_print+0x1c
/usr/local/openmpi-1.10.3_64_cc/lib64/libopen-pal.so.13.0.2:0x1b10f0
/lib/sparcv9/libc.so.1:0xd8c28
/lib/sparcv9/libc.so.1:0xcc79c
/lib/sparcv9/libc.so.1:0xcc9a8
/lib/sparcv9/libc.so.1:__lwp_kill+0x8 [ Signal 2091943080 (?)]
/lib/sparcv9/libc.so.1:abort+0xd0
/lib/sparcv9/libc.so.1:_assert_c99+0x78
/usr/local/openmpi-1.10.3_64_cc/lib64/libmpi.so.12.0.3:ompi_group_increment_proc_count+0x10c
/usr/local/openmpi-1.10.3_64_cc/lib64/openmpi/mca_dpm_orte.so:0xe758
/usr/local/openmpi-1.10.3_64_cc/lib64/openmpi/mca_dpm_orte.so:0x113d4
/usr/local/openmpi-1.10.3_64_cc/lib64/libmpi.so.12.0.3:ompi_mpi_init+0x188c
/usr/local/openmpi-1.10.3_64_cc/lib64/libmpi.so.12.0.3:MPI_Init+0x26c
/home/fd1026/SunOS/sparc/bin/spawn_slave:main+0x18
/home/fd1026/SunOS/sparc/bin/spawn_slave:_start+0x108
[ruester:17809] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 2 with PID 0 on node ruester exited on signal 
6 (Abort).
--------------------------------------------------------------------------
[LWP    2         exited]
[New Thread 2        ]
[Switching to Thread 1 (LWP 1)]
sol_thread_fetch_registers: td_ta_map_id2thr: no thread can be found to satisfy 
query
(gdb) bt
#0  0xffffffff7f6173d0 in rtld_db_dlactivity () from /usr/lib/sparcv9/ld.so.1
#1  0xffffffff7f6175a8 in rd_event () from /usr/lib/sparcv9/ld.so.1
#2  0xffffffff7f618950 in lm_delete () from /usr/lib/sparcv9/ld.so.1
#3  0xffffffff7f6226bc in remove_so () from /usr/lib/sparcv9/ld.so.1
#4  0xffffffff7f624574 in remove_hdl () from /usr/lib/sparcv9/ld.so.1
#5  0xffffffff7f61d97c in dlclose_core () from /usr/lib/sparcv9/ld.so.1
#6  0xffffffff7f61d9d4 in dlclose_intn () from /usr/lib/sparcv9/ld.so.1
#7  0xffffffff7f61db0c in dlclose () from /usr/lib/sparcv9/ld.so.1
#8  0xffffffff7e5f9718 in dlopen_close (handle=0x100)
    at 
../../../../../openmpi-v1.10.2-163-g42da15d/opal/mca/dl/dlopen/dl_dlopen_module.c:144
#9  0xffffffff7e5f364c in opal_dl_close (handle=0xffffff7d700200ff)
    at 
../../../../openmpi-v1.10.2-163-g42da15d/opal/mca/dl/base/dl_base_fns.c:53
#10 0xffffffff7e546714 in ri_destructor (obj=0x1200)
    at 
../../../../openmpi-v1.10.2-163-g42da15d/opal/mca/base/mca_base_component_repository.c:357
#11 0xffffffff7e543840 in opal_obj_run_destructors (object=0xffffff7f607a6cff)
    at ../../../../openmpi-v1.10.2-163-g42da15d/opal/class/opal_object.h:451
#12 0xffffffff7e545f54 in mca_base_component_repository_release 
(component=0xffffff7c801df0ff)
    at 
../../../../openmpi-v1.10.2-163-g42da15d/opal/mca/base/mca_base_component_repository.c:223
#13 0xffffffff7e54d0d8 in mca_base_component_unload 
(component=0xffffff7d00003000, output_id=-1610596097)
    at 
../../../../openmpi-v1.10.2-163-g42da15d/opal/mca/base/mca_base_components_close.c:47
#14 0xffffffff7e54d17c in mca_base_component_close (component=0x100, 
output_id=-1878702080)
    at 
../../../../openmpi-v1.10.2-163-g42da15d/opal/mca/base/mca_base_components_close.c:60
#15 0xffffffff7e54d28c in mca_base_components_close (output_id=1942099968, 
components=0xff,
    skip=0xffffff7f61c5a800)
    at 
../../../../openmpi-v1.10.2-163-g42da15d/opal/mca/base/mca_base_components_close.c:86
#16 0xffffffff7e54d1cc in mca_base_framework_components_close 
(framework=0x1000000ff, skip=0x10018ebb000)
    at 
../../../../openmpi-v1.10.2-163-g42da15d/opal/mca/base/mca_base_components_close.c:68
#17 0xffffffff7ee4db88 in orte_oob_base_close ()
    at 
../../../../openmpi-v1.10.2-163-g42da15d/orte/mca/oob/base/oob_base_frame.c:94
#18 0xffffffff7e580054 in mca_base_framework_close 
(framework=0xffffff0000004fff)
    at 
../../../../openmpi-v1.10.2-163-g42da15d/opal/mca/base/mca_base_framework.c:198
#19 0xffffffff7c514cdc in rte_finalize ()
    at 
../../../../../openmpi-v1.10.2-163-g42da15d/orte/mca/ess/hnp/ess_hnp_module.c:882
#20 0xffffffff7ec5c414 in orte_finalize () at 
../../openmpi-v1.10.2-163-g42da15d/orte/runtime/orte_finalize.c:65
#21 0x000000010000eb24 in orterun (argc=1423033599, argv=0xffffff7fffce41ff)
    at 
../../../../openmpi-v1.10.2-163-g42da15d/orte/tools/orterun/orterun.c:1151
#22 0x0000000100004d4c in main (argc=416477439, argv=0xffffff7fffd7f000)
    at ../../../../openmpi-v1.10.2-163-g42da15d/orte/tools/orterun/main.c:13
(gdb)




tyr spawn 145 mpiexec -np 1 --host ruester,ruester,ruester,tyr,tyr 
spawn_multiple_master

Parent process 0 running on ruester.informatik.hs-fulda.de
  I create 3 slave processes.

Assertion failed: OPAL_OBJ_MAGIC_ID == ((opal_object_t *) (proc_pointer))->obj_magic_id, file ../../openmpi-v1.10.2-163-g42da15d/ompi/group/group_init.c, line 215, function ompi_group_increment_proc_count
[ruester:18238] *** Process received signal ***
[ruester:18238] Signal: Abort (6)
[ruester:18238] Signal code:  (-1)
/usr/local/openmpi-1.10.3_64_cc/lib64/libopen-pal.so.13.0.2:opal_backtrace_print+0x1c
/usr/local/openmpi-1.10.3_64_cc/lib64/libopen-pal.so.13.0.2:0x1b10f0
/lib/sparcv9/libc.so.1:0xd8c28
------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
to another process:
  Local host:    ruester
  Remote host:   ruester
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
------------------------------------------------------------
/lib/sparcv9/libc.so.1:0xcc79c
/lib/sparcv9/libc.so.1:0xcc9a8
/lib/sparcv9/libc.so.1:__lwp_kill+0x8 [ Signal 2091943080 (?)]
/lib/sparcv9/libc.so.1:abort+0xd0
/lib/sparcv9/libc.so.1:_assert_c99+0x78
/usr/local/openmpi-1.10.3_64_cc/lib64/libmpi.so.12.0.3:ompi_group_increment_proc_count+0x10c
/usr/local/openmpi-1.10.3_64_cc/lib64/openmpi/mca_dpm_orte.so:0xe758
/usr/local/openmpi-1.10.3_64_cc/lib64/libmpi.so.12.0.3:MPI_Comm_spawn_multiple+0x8f4
/home/fd1026/SunOS/sparc/bin/spawn_multiple_master:main+0x188
/home/fd1026/SunOS/sparc/bin/spawn_multiple_master:_start+0x108
[ruester:18238] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 0 on node ruester exited on signal 
6 (Abort).
--------------------------------------------------------------------------



tyr spawn 146 mpiexec -np 1 --host ruester,ruester,ruester,ruester,tyr 
spawn_multiple_master

Parent process 0 running on ruester.informatik.hs-fulda.de
  I create 3 slave processes.

Parent process 0: tasks in MPI_COMM_WORLD:                    1
                  tasks in COMM_CHILD_PROCESSES local group:  1
                  tasks in COMM_CHILD_PROCESSES remote group: 3

Slave process 2 of 3 running on ruester.informatik.hs-fulda.de
Slave process 0 of 3 running on ruester.informatik.hs-fulda.de
spawn_slave 0: argv[0]: spawn_slave
Slave process 1 of 3 running on ruester.informatik.hs-fulda.de
spawn_slave 1: argv[0]: spawn_slave
spawn_slave 1: argv[1]: program type 2
spawn_slave 1: argv[2]: another parameter
spawn_slave 2: argv[0]: spawn_slave
spawn_slave 2: argv[1]: program type 2
spawn_slave 2: argv[2]: another parameter
spawn_slave 0: argv[1]: program type 1
tyr spawn 147



Hopefully you can sort these things out. I've no idea what happens
and why I get different outputs, if I use different sets of the
same machines.


Kind regards

Siegmar



Am 05.05.2016 um 11:13 schrieb Gilles Gouaillardet:
Siegmar,

is this Solaris 10 specific (e.g. Solaris 11 works fine)

( I only have a x86_64 vm with Solaris 11 and sun compilers ...)

Cheers,

Gilles

On Thursday, May 5, 2016, Siegmar Gross <siegmar.gr...@informatik.hs-fulda.de 
<mailto:siegmar.gr...@informatik.hs-fulda.de>> wrote:

    Hi Ralph and Gilles,

    Am 04.05.2016 um 20:02 schrieb rhc54:

        @ggouaillardet <https://github.com/ggouaillardet> Where does this stand?

        —
        You are receiving this because you were mentioned.
        Reply to this email directly or view it on GitHub
        <https://github.com/open-mpi/ompi/issues/1569#issuecomment-216950103>


    With my last installed version of openmpi-v1.10.x all of my
    spawn programs fail on Solaris Sparc and x86_64 with the same
    error for both compilers (gcc-5.1.0 and Sun C 5.13). Everything
    works as expected on Linux. Tomorrow I'm back in my office and
    I can try to build and test the latest version.

    sunpc1 fd1026 108 ompi_info | grep -e "OPAL repo" -e "C compiler absolute"
          OPAL repo revision: v1.10.2-163-g42da15d
         C compiler absolute: /opt/solstudio12.4/bin/cc
    sunpc1 fd1026 114 mpiexec -np 1 --host sunpc1,sunpc1,sunpc1,sunpc1,sunpc1 
spawnmaster
    [sunpc1:00957] *** Process received signal ***
    [sunpc1:00957] Signal: Segmentation Fault (11)
    [sunpc1:00957] Signal code: Address not mapped (1)
    [sunpc1:00957] Failing at address: 0
    
/export2/prog/SunOS_x86_64/openmpi-2.0.0_64_cc/lib64/libopen-pal.so.20.0.0:opalbacktrace_print+0x2d
    
/export2/prog/SunOS_x86_64/openmpi-2.0.0_64_cc/lib64/libopen-pal.so.20.0.0:0x2383c
    /lib/amd64/libc.so.1:0xdd6b6
    /lib/amd64/libc.so.1:0xd1f82
    /lib/amd64/libc.so.1:strlen+0x30 [ Signal 11 (SEGV)]
    /lib/amd64/libc.so.1:vsnprintf+0x51
    /lib/amd64/libc.so.1:vasprintf+0x49
    
/export2/prog/SunOS_x86_64/openmpi-2.0.0_64_cc/lib64/libopen-pal.so.20.0.0:opalshow_help_vstring+0x83
    
/export2/prog/SunOS_x86_64/openmpi-2.0.0_64_cc/lib64/libopen-rte.so.20.0.0:orteshow_help+0xd6
    
/export2/prog/SunOS_x86_64/openmpi-2.0.0_64_cc/lib64/libmpi.so.20.0.0:ompi_mpi_nit+0x1010
    
/export2/prog/SunOS_x86_64/openmpi-2.0.0_64_cc/lib64/libmpi.so.20.0.0:PMPI_Init0x9d
    /home/fd1026/SunOS/x86_64/bin/spawn_master:main+0x21
    [sunpc1:00957] *** End of error message ***
    --------------------------------------------------------------------------
    mpiexec noticed that process rank 0 with PID 957 on node sunpc1 exited on 
signa 11 (Segmentation Fault).
    --------------------------------------------------------------------------



    sunpc1 fd1026 115 mpiexec -np 1 --host sunpc1,sunpc1,sunpc1,sunpc1,sunpc1 
spawnmultiple_master
    [sunpc1:00960] *** Process received signal ***
    [sunpc1:00960] Signal: Segmentation Fault (11)
    [sunpc1:00960] Signal code: Address not mapped (1)
    [sunpc1:00960] Failing at address: 0
    
/export2/prog/SunOS_x86_64/openmpi-2.0.0_64_cc/lib64/libopen-pal.so.20.0.0:opalbacktrace_print+0x2d
    
/export2/prog/SunOS_x86_64/openmpi-2.0.0_64_cc/lib64/libopen-pal.so.20.0.0:0x2383c
    /lib/amd64/libc.so.1:0xdd6b6
    /lib/amd64/libc.so.1:0xd1f82
    /lib/amd64/libc.so.1:strlen+0x30 [ Signal 11 (SEGV)]
    /lib/amd64/libc.so.1:vsnprintf+0x51
    /lib/amd64/libc.so.1:vasprintf+0x49
    
/export2/prog/SunOS_x86_64/openmpi-2.0.0_64_cc/lib64/libopen-pal.so.20.0.0:opalshow_help_vstring+0x83
    
/export2/prog/SunOS_x86_64/openmpi-2.0.0_64_cc/lib64/libopen-rte.so.20.0.0:orteshow_help+0xd6
    
/export2/prog/SunOS_x86_64/openmpi-2.0.0_64_cc/lib64/libmpi.so.20.0.0:ompi_mpi_nit+0x1010
    
/export2/prog/SunOS_x86_64/openmpi-2.0.0_64_cc/lib64/libmpi.so.20.0.0:PMPI_Init0x9d
    /home/fd1026/SunOS/x86_64/bin/spawn_multiple_master:main+0x5d
    [sunpc1:00960] *** End of error message ***
    --------------------------------------------------------------------------
    mpiexec noticed that process rank 0 with PID 960 on node sunpc1 exited on 
signa 11 (Segmentation Fault).
    --------------------------------------------------------------------------



    sunpc1 fd1026 116 mpiexec -np 1 --host sunpc1,sunpc1,sunpc1,sunpc1,sunpc1 
spawnintra_comm
    [sunpc1:00963] *** Process received signal ***
    [sunpc1:00963] Signal: Segmentation Fault (11)
    [sunpc1:00963] Signal code: Address not mapped (1)
    [sunpc1:00963] Failing at address: 0
    
/export2/prog/SunOS_x86_64/openmpi-2.0.0_64_cc/lib64/libopen-pal.so.20.0.0:opalbacktrace_print+0x2d
    
/export2/prog/SunOS_x86_64/openmpi-2.0.0_64_cc/lib64/libopen-pal.so.20.0.0:0x2383c
    /lib/amd64/libc.so.1:0xdd6b6
    /lib/amd64/libc.so.1:0xd1f82
    /lib/amd64/libc.so.1:strlen+0x30 [ Signal 11 (SEGV)]
    /lib/amd64/libc.so.1:vsnprintf+0x51
    /lib/amd64/libc.so.1:vasprintf+0x49
    
/export2/prog/SunOS_x86_64/openmpi-2.0.0_64_cc/lib64/libopen-pal.so.20.0.0:opalshow_help_vstring+0x83
    
/export2/prog/SunOS_x86_64/openmpi-2.0.0_64_cc/lib64/libopen-rte.so.20.0.0:orteshow_help+0xd6
    
/export2/prog/SunOS_x86_64/openmpi-2.0.0_64_cc/lib64/libmpi.so.20.0.0:ompi_mpi_nit+0x1010
    
/export2/prog/SunOS_x86_64/openmpi-2.0.0_64_cc/lib64/libmpi.so.20.0.0:PMPI_Init0x9d
    /home/fd1026/SunOS/x86_64/bin/spawn_intra_comm:main+0x23
    [sunpc1:00963] *** End of error message ***
    --------------------------------------------------------------------------
    mpiexec noticed that process rank 0 with PID 963 on node sunpc1 exited on 
signa 11 (Segmentation Fault).
    --------------------------------------------------------------------------
    sunpc1 fd1026 117


    Kind regards

    Siegmar
    _______________________________________________
    users mailing list
    us...@open-mpi.org
    Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
    Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/05/29090.php



_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/05/29092.php


--

########################################################################
#                                                                      #
# Hochschule Fulda          University of Applied Sciences             #
# FB Angewandte Informatik  Department of Applied Computer Science     #
#                                                                      #
# Prof. Dr. Siegmar Gross   Tel.: +49 (0)661 9640 - 333                #
#                           Fax:  +49 (0)661 9640 - 349                #
# Leipziger Str. 123        WWW:  http://www.hs-fulda.de/~gross        #
# D-36037 Fulda             Mail: siegmar.gr...@informatik.hs-fulda.de #
#                                                                      #
#                                                                      #
# IT-Sicherheit: http://www.hs-fulda.de/it-sicherheit                  #
#                                                                      #
########################################################################

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to