Hi Ralph, Your commit r32459 fixed the bus error by correcting opal/dss/dss_copy.c. It's OK for trunk because mca_dstore_hash calls dss to copy data. But it's insufficient for v1.8 because mca_db_hash doesn't call dss and copies data itself.
The attached patch is the minimum patch to fix it in v1.8. My fix doesn't call dss but uses memcpy. I have confirmed it on SPARC64/Linux. Sorry to response so late. Regards, Takahiro Kawashima, MPI development team, Fujitsu > Siegmar, Ralph, > > I'm sorry to response so late since last week. > > Ralph fixed the problem in r32459 and it was merged to v1.8 > in r32474. But in v1.8 an additional custom patch is needed > because the db/dstore source codes are different between trunk > and v1.8. > > I'm preparing and testing the custom patch just now. > Wait wait a minute please. > > Takahiro Kawashima, > MPI development team, > Fujitsu > > > Hi, > > > > thank you very much to everybody who tried to solve my bus > > error problem on Solaris 10 Sparc. I thought that you found > > and fixed it, so that I installed openmpi-1.8.2rc4r32485 on > > my machines (Solaris 10 Sparc (tyr), Solaris 10 x86_64 (sunpc1), > > openSUSE Linux 12.1 x86_64 (linpc1)) with gcc-4.9.0. A small > > program works on my x86_64 architectures, but still breaks > > with a bus error on my Sparc system. > > > > linpc1 fd1026 106 mpiexec -np 1 init_finalize > > Hello! > > linpc1 fd1026 106 exit > > logout > > tyr small_prog 113 ssh sunpc1 > > sunpc1 fd1026 101 mpiexec -np 1 init_finalize > > Hello! > > sunpc1 fd1026 102 exit > > logout > > tyr small_prog 114 mpiexec -np 1 init_finalize > > [tyr:21109] *** Process received signal *** > > [tyr:21109] Signal: Bus Error (10) > > ... > > > > > > gdb shows the following backtrace. > > > > tyr small_prog 122 /usr/local/gdb-7.6.1_64_gcc/bin/gdb > > /usr/local/openmpi-1.8.2_64_gcc/bin/mpiexec > > GNU gdb (GDB) 7.6.1 > > ... > > (gdb) run -np 1 init_finalize > > Starting program: /usr/local/openmpi-1.8.2_64_gcc/bin/mpiexec -np 1 > > init_finalize > > [Thread debugging using libthread_db enabled] > > [New Thread 1 (LWP 1)] > > [New LWP 2 ] > > [tyr:21158] *** Process received signal *** > > [tyr:21158] Signal: Bus Error (10) > > [tyr:21158] Signal code: Invalid address alignment (1) > > [tyr:21158] Failing at address: ffffffff7fffd224 > > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:opal_backtrace_print+0x2c > > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:0xcd130 > > /lib/sparcv9/libc.so.1:0xd8b98 > > /lib/sparcv9/libc.so.1:0xcc70c > > /lib/sparcv9/libc.so.1:0xcc918 > > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/openmpi/mca_db_hash.so:0x3ee8 > > [ Signal 10 (BUS)] > > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:opal_db_base_store+0xc8 > > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_util_decode_pidmap+0x798 > > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_util_nidmap_init+0x3cc > > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/openmpi/mca_ess_env.so:0x226c > > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_init+0x308 > > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libmpi.so.1.5.2:ompi_mpi_init+0x31c > > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libmpi.so.1.5.2:MPI_Init+0x2a8 > > /home/fd1026/SunOS/sparc/bin/init_finalize:main+0x10 > > /home/fd1026/SunOS/sparc/bin/init_finalize:_start+0x7c > > [tyr:21158] *** End of error message *** > > -------------------------------------------------------------------------- > > mpiexec noticed that process rank 0 with PID 21158 on node tyr exited on > > signal 10 (Bus Error). > > -------------------------------------------------------------------------- > > [LWP 2 exited] > > [New Thread 2 ] > > [Switching to Thread 1 (LWP 1)] > > sol_thread_fetch_registers: td_ta_map_id2thr: no thread can be found to > > satisfy query > > (gdb) bt > > #0 0xffffffff7f6173d0 in rtld_db_dlactivity () from > > /usr/lib/sparcv9/ld.so.1 > > #1 0xffffffff7f6175a8 in rd_event () from /usr/lib/sparcv9/ld.so.1 > > #2 0xffffffff7f618950 in lm_delete () from /usr/lib/sparcv9/ld.so.1 > > #3 0xffffffff7f6226bc in remove_so () from /usr/lib/sparcv9/ld.so.1 > > #4 0xffffffff7f624574 in remove_hdl () from /usr/lib/sparcv9/ld.so.1 > > #5 0xffffffff7f61d97c in dlclose_core () from /usr/lib/sparcv9/ld.so.1 > > #6 0xffffffff7f61d9d4 in dlclose_intn () from /usr/lib/sparcv9/ld.so.1 > > #7 0xffffffff7f61db0c in dlclose () from /usr/lib/sparcv9/ld.so.1 > > #8 0xffffffff7ec7748c in vm_close () from > > /usr/local/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6 > > #9 0xffffffff7ec74a6c in lt_dlclose () from > > /usr/local/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6 > > #10 0xffffffff7ec99b90 in ri_destructor (obj=0x1001ead30) > > at > > ../../../../openmpi-1.8.2rc4r32485/opal/mca/base/mca_base_component_repository.c:391 > > #11 0xffffffff7ec984a8 in opal_obj_run_destructors (object=0x1001ead30) > > at ../../../../openmpi-1.8.2rc4r32485/opal/class/opal_object.h:446 > > #12 0xffffffff7ec9940c in mca_base_component_repository_release ( > > component=0xffffffff7b023df0 <mca_oob_tcp_component>) > > at > > ../../../../openmpi-1.8.2rc4r32485/opal/mca/base/mca_base_component_repository.c:244 > > #13 0xffffffff7ec9b754 in mca_base_component_unload ( > > component=0xffffffff7b023df0 <mca_oob_tcp_component>, output_id=-1) > > at > > ../../../../openmpi-1.8.2rc4r32485/opal/mca/base/mca_base_components_close.c:47 > > #14 0xffffffff7ec9b7e8 in mca_base_component_close ( > > component=0xffffffff7b023df0 <mca_oob_tcp_component>, output_id=-1) > > at > > ../../../../openmpi-1.8.2rc4r32485/opal/mca/base/mca_base_components_close.c:60 > > #15 0xffffffff7ec9b8bc in mca_base_components_close (output_id=-1, > > components=0xffffffff7f12b930 <orte_oob_base_framework+80>, skip=0x0) > > at > > ../../../../openmpi-1.8.2rc4r32485/opal/mca/base/mca_base_components_close.c:86 > > #16 0xffffffff7ec9b824 in mca_base_framework_components_close ( > > framework=0xffffffff7f12b8e0 <orte_oob_base_framework>, skip=0x0) > > at > > ../../../../openmpi-1.8.2rc4r32485/opal/mca/base/mca_base_components_close.c:66 > > #17 0xffffffff7efae21c in orte_oob_base_close () > > at > > ../../../../openmpi-1.8.2rc4r32485/orte/mca/oob/base/oob_base_frame.c:94 > > #18 0xffffffff7ecb28cc in mca_base_framework_close ( > > framework=0xffffffff7f12b8e0 <orte_oob_base_framework>) > > at > > ../../../../openmpi-1.8.2rc4r32485/opal/mca/base/mca_base_framework.c:187 > > #19 0xffffffff7bf078c0 in rte_finalize () > > at > > ../../../../../openmpi-1.8.2rc4r32485/orte/mca/ess/hnp/ess_hnp_module.c:858 > > #20 0xffffffff7ef30a44 in orte_finalize () > > at ../../openmpi-1.8.2rc4r32485/orte/runtime/orte_finalize.c:65 > > #21 0x00000001000070c4 in orterun (argc=4, argv=0xffffffff7fffe0d8) > > at ../../../../openmpi-1.8.2rc4r32485/orte/tools/orterun/orterun.c:1096 > > #22 0x0000000100003d70 in main (argc=4, argv=0xffffffff7fffe0d8) > > at ../../../../openmpi-1.8.2rc4r32485/orte/tools/orterun/main.c:13 > > (gdb) > > > > > > Is this a new problem? I would be grateful if somebody could > > fix it. Thank you very much for any help in advance. > > > > Kind regards > > > > Siegmar
Index: opal/mca/db/hash/db_hash.c =================================================================== --- opal/mca/db/hash/db_hash.c (revision 32498) +++ opal/mca/db/hash/db_hash.c (working copy) @@ -249,7 +249,8 @@ return OPAL_ERR_BAD_PARAM; } kv->type = OPAL_UINT64; - kv->data.uint64 = *(uint64_t*)(data); + /* to avoid alignment issues */ + memcpy(&kv->data.uint64, data, 8); break; case OPAL_UINT32: if (NULL == data) { @@ -257,7 +258,8 @@ return OPAL_ERR_BAD_PARAM; } kv->type = OPAL_UINT32; - kv->data.uint32 = *(uint32_t*)data; + /* to avoid alignment issues */ + memcpy(&kv->data.uint32, data, 4); break; case OPAL_UINT16: if (NULL == data) { @@ -265,7 +267,8 @@ return OPAL_ERR_BAD_PARAM; } kv->type = OPAL_UINT16; - kv->data.uint16 = *(uint16_t*)(data); + /* to avoid alignment issues */ + memcpy(&kv->data.uint16, data, 2); break; case OPAL_INT: if (NULL == data) { @@ -273,7 +276,8 @@ return OPAL_ERR_BAD_PARAM; } kv->type = OPAL_INT; - kv->data.integer = *(int*)(data); + /* to avoid alignment issues */ + memcpy(&kv->data.integer, data, sizeof(int)); break; case OPAL_UINT: if (NULL == data) { @@ -281,7 +285,8 @@ return OPAL_ERR_BAD_PARAM; } kv->type = OPAL_UINT; - kv->data.uint = *(unsigned int*)(data); + /* to avoid alignment issues */ + memcpy(&kv->data.uint, data, sizeof(unsigned int)); break; case OPAL_FLOAT: if (NULL == data) {