Hi Ralph,

Your commit r32459 fixed the bus error by correcting
opal/dss/dss_copy.c. It's OK for trunk because mca_dstore_hash
calls dss to copy data. But it's insufficient for v1.8 because
mca_db_hash doesn't call dss and copies data itself.

The attached patch is the minimum patch to fix it in v1.8.
My fix doesn't call dss but uses memcpy. I have confirmed it on
SPARC64/Linux.

Sorry to response so late.

Regards,
Takahiro Kawashima,
MPI development team,
Fujitsu

> Siegmar, Ralph,
> 
> I'm sorry to response so late since last week.
> 
> Ralph fixed the problem in r32459 and it was merged to v1.8
> in r32474. But in v1.8 an additional custom patch is needed
> because the db/dstore source codes are different between trunk
> and v1.8.
> 
> I'm preparing and testing the custom patch just now.
> Wait wait a minute please.
> 
> Takahiro Kawashima,
> MPI development team,
> Fujitsu
> 
> > Hi,
> > 
> > thank you very much to everybody who tried to solve my bus
> > error problem on Solaris 10 Sparc. I thought that you found
> > and fixed it, so that I installed openmpi-1.8.2rc4r32485 on
> > my machines (Solaris 10 Sparc (tyr), Solaris 10 x86_64 (sunpc1),
> > openSUSE Linux 12.1 x86_64 (linpc1)) with gcc-4.9.0. A small
> > program works on my x86_64 architectures, but still breaks
> > with a bus error on my Sparc system.
> > 
> > linpc1 fd1026 106 mpiexec -np 1 init_finalize
> > Hello!
> > linpc1 fd1026 106 exit
> > logout
> > tyr small_prog 113 ssh sunpc1
> > sunpc1 fd1026 101 mpiexec -np 1 init_finalize
> > Hello!
> > sunpc1 fd1026 102 exit
> > logout
> > tyr small_prog 114 mpiexec -np 1 init_finalize
> > [tyr:21109] *** Process received signal ***
> > [tyr:21109] Signal: Bus Error (10)
> > ...
> > 
> > 
> > gdb shows the following backtrace.
> > 
> > tyr small_prog 122 /usr/local/gdb-7.6.1_64_gcc/bin/gdb 
> > /usr/local/openmpi-1.8.2_64_gcc/bin/mpiexec
> > GNU gdb (GDB) 7.6.1
> > ...
> > (gdb) run -np 1 init_finalize
> > Starting program: /usr/local/openmpi-1.8.2_64_gcc/bin/mpiexec -np 1 
> > init_finalize
> > [Thread debugging using libthread_db enabled]
> > [New Thread 1 (LWP 1)]
> > [New LWP    2        ]
> > [tyr:21158] *** Process received signal ***
> > [tyr:21158] Signal: Bus Error (10)
> > [tyr:21158] Signal code: Invalid address alignment (1)
> > [tyr:21158] Failing at address: ffffffff7fffd224
> > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:opal_backtrace_print+0x2c
> > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:0xcd130
> > /lib/sparcv9/libc.so.1:0xd8b98
> > /lib/sparcv9/libc.so.1:0xcc70c
> > /lib/sparcv9/libc.so.1:0xcc918
> > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/openmpi/mca_db_hash.so:0x3ee8
> >  [ Signal 10 (BUS)]
> > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:opal_db_base_store+0xc8
> > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_util_decode_pidmap+0x798
> > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_util_nidmap_init+0x3cc
> > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/openmpi/mca_ess_env.so:0x226c
> > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_init+0x308
> > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libmpi.so.1.5.2:ompi_mpi_init+0x31c
> > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libmpi.so.1.5.2:MPI_Init+0x2a8
> > /home/fd1026/SunOS/sparc/bin/init_finalize:main+0x10
> > /home/fd1026/SunOS/sparc/bin/init_finalize:_start+0x7c
> > [tyr:21158] *** End of error message ***
> > --------------------------------------------------------------------------
> > mpiexec noticed that process rank 0 with PID 21158 on node tyr exited on 
> > signal 10 (Bus Error).
> > --------------------------------------------------------------------------
> > [LWP    2         exited]
> > [New Thread 2        ]
> > [Switching to Thread 1 (LWP 1)]
> > sol_thread_fetch_registers: td_ta_map_id2thr: no thread can be found to 
> > satisfy query
> > (gdb) bt
> > #0  0xffffffff7f6173d0 in rtld_db_dlactivity () from 
> > /usr/lib/sparcv9/ld.so.1
> > #1  0xffffffff7f6175a8 in rd_event () from /usr/lib/sparcv9/ld.so.1
> > #2  0xffffffff7f618950 in lm_delete () from /usr/lib/sparcv9/ld.so.1
> > #3  0xffffffff7f6226bc in remove_so () from /usr/lib/sparcv9/ld.so.1
> > #4  0xffffffff7f624574 in remove_hdl () from /usr/lib/sparcv9/ld.so.1
> > #5  0xffffffff7f61d97c in dlclose_core () from /usr/lib/sparcv9/ld.so.1
> > #6  0xffffffff7f61d9d4 in dlclose_intn () from /usr/lib/sparcv9/ld.so.1
> > #7  0xffffffff7f61db0c in dlclose () from /usr/lib/sparcv9/ld.so.1
> > #8  0xffffffff7ec7748c in vm_close () from 
> > /usr/local/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6
> > #9  0xffffffff7ec74a6c in lt_dlclose () from 
> > /usr/local/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6
> > #10 0xffffffff7ec99b90 in ri_destructor (obj=0x1001ead30)
> >     at 
> > ../../../../openmpi-1.8.2rc4r32485/opal/mca/base/mca_base_component_repository.c:391
> > #11 0xffffffff7ec984a8 in opal_obj_run_destructors (object=0x1001ead30)
> >     at ../../../../openmpi-1.8.2rc4r32485/opal/class/opal_object.h:446
> > #12 0xffffffff7ec9940c in mca_base_component_repository_release (
> >     component=0xffffffff7b023df0 <mca_oob_tcp_component>)
> >     at 
> > ../../../../openmpi-1.8.2rc4r32485/opal/mca/base/mca_base_component_repository.c:244
> > #13 0xffffffff7ec9b754 in mca_base_component_unload (
> >     component=0xffffffff7b023df0 <mca_oob_tcp_component>, output_id=-1)
> >     at 
> > ../../../../openmpi-1.8.2rc4r32485/opal/mca/base/mca_base_components_close.c:47
> > #14 0xffffffff7ec9b7e8 in mca_base_component_close (
> >     component=0xffffffff7b023df0 <mca_oob_tcp_component>, output_id=-1)
> >     at 
> > ../../../../openmpi-1.8.2rc4r32485/opal/mca/base/mca_base_components_close.c:60
> > #15 0xffffffff7ec9b8bc in mca_base_components_close (output_id=-1, 
> >     components=0xffffffff7f12b930 <orte_oob_base_framework+80>, skip=0x0)
> >     at 
> > ../../../../openmpi-1.8.2rc4r32485/opal/mca/base/mca_base_components_close.c:86
> > #16 0xffffffff7ec9b824 in mca_base_framework_components_close (
> >     framework=0xffffffff7f12b8e0 <orte_oob_base_framework>, skip=0x0)
> >     at 
> > ../../../../openmpi-1.8.2rc4r32485/opal/mca/base/mca_base_components_close.c:66
> > #17 0xffffffff7efae21c in orte_oob_base_close ()
> >     at 
> > ../../../../openmpi-1.8.2rc4r32485/orte/mca/oob/base/oob_base_frame.c:94
> > #18 0xffffffff7ecb28cc in mca_base_framework_close (
> >     framework=0xffffffff7f12b8e0 <orte_oob_base_framework>)
> >     at 
> > ../../../../openmpi-1.8.2rc4r32485/opal/mca/base/mca_base_framework.c:187
> > #19 0xffffffff7bf078c0 in rte_finalize ()
> >     at 
> > ../../../../../openmpi-1.8.2rc4r32485/orte/mca/ess/hnp/ess_hnp_module.c:858
> > #20 0xffffffff7ef30a44 in orte_finalize ()
> >     at ../../openmpi-1.8.2rc4r32485/orte/runtime/orte_finalize.c:65
> > #21 0x00000001000070c4 in orterun (argc=4, argv=0xffffffff7fffe0d8)
> >     at ../../../../openmpi-1.8.2rc4r32485/orte/tools/orterun/orterun.c:1096
> > #22 0x0000000100003d70 in main (argc=4, argv=0xffffffff7fffe0d8)
> >     at ../../../../openmpi-1.8.2rc4r32485/orte/tools/orterun/main.c:13
> > (gdb) 
> > 
> > 
> > Is this a new problem? I would be grateful if somebody could
> > fix it. Thank you very much for any help in advance.
> > 
> > Kind regards
> > 
> > Siegmar
Index: opal/mca/db/hash/db_hash.c
===================================================================
--- opal/mca/db/hash/db_hash.c	(revision 32498)
+++ opal/mca/db/hash/db_hash.c	(working copy)
@@ -249,7 +249,8 @@
             return OPAL_ERR_BAD_PARAM;
         }
         kv->type = OPAL_UINT64;
-        kv->data.uint64 = *(uint64_t*)(data);
+        /* to avoid alignment issues */
+        memcpy(&kv->data.uint64, data, 8);
         break;
     case OPAL_UINT32:
         if (NULL == data) {
@@ -257,7 +258,8 @@
             return OPAL_ERR_BAD_PARAM;
         }
         kv->type = OPAL_UINT32;
-        kv->data.uint32 = *(uint32_t*)data;
+        /* to avoid alignment issues */
+        memcpy(&kv->data.uint32, data, 4);
         break;
     case OPAL_UINT16:
         if (NULL == data) {
@@ -265,7 +267,8 @@
             return OPAL_ERR_BAD_PARAM;
         }
         kv->type = OPAL_UINT16;
-        kv->data.uint16 = *(uint16_t*)(data);
+        /* to avoid alignment issues */
+        memcpy(&kv->data.uint16, data, 2);
         break;
     case OPAL_INT:
         if (NULL == data) {
@@ -273,7 +276,8 @@
             return OPAL_ERR_BAD_PARAM;
         }
         kv->type = OPAL_INT;
-        kv->data.integer = *(int*)(data);
+        /* to avoid alignment issues */
+        memcpy(&kv->data.integer, data, sizeof(int));
         break;
     case OPAL_UINT:
         if (NULL == data) {
@@ -281,7 +285,8 @@
             return OPAL_ERR_BAD_PARAM;
         }
         kv->type = OPAL_UINT;
-        kv->data.uint = *(unsigned int*)(data);
+        /* to avoid alignment issues */
+        memcpy(&kv->data.uint, data, sizeof(unsigned int));
         break;
     case OPAL_FLOAT:
         if (NULL == data) {

Reply via email to