Hi Siegmar, From the jvm logs, there is an alignment error in native_get_attr but i could not find it by reading the source code.
Could you please do ulimit -c unlimited mpiexec ... and then gdb <your path to java>/bin/java core And run bt on all threads until you get a line number in native_get_attr Thanks Gilles Siegmar Gross <siegmar.gr...@informatik.hs-fulda.de> wrote: >Hi, > >today I installed openmpi-dev-178-ga16c1e4 on Solaris 10 Sparc >with gcc-4.9.1 and Java 8. Now a very simple Java program works >as expected, but other Java programs still break. I removed the >warnings about "shmem.jar" and used the following configure >command. > >tyr openmpi-dev-178-ga16c1e4-SunOS.sparc.64_gcc 406 head config.log \ > | grep openmpi >$ ../openmpi-dev-178-ga16c1e4/configure > --prefix=/usr/local/openmpi-1.9.0_64_gcc > --libdir=/usr/local/openmpi-1.9.0_64_gcc/lib64 > --with-jdk-bindir=/usr/local/jdk1.8.0/bin > --with-jdk-headers=/usr/local/jdk1.8.0/include > JAVA_HOME=/usr/local/jdk1.8.0 > LDFLAGS=-m64 CC=gcc CXX=g++ FC=gfortran CFLAGS=-m64 -D_REENTRANT > CXXFLAGS=-m64 FCFLAGS=-m64 CPP=cpp CXXCPP=cpp > CPPFLAGS= -D_REENTRANT CXXCPPFLAGS= > --enable-mpi-cxx --enable-cxx-exceptions --enable-mpi-java > --enable-mpi-thread-multiple --with-threads=posix > --with-hwloc=internal > --without-verbs --with-wrapper-cflags=-std=c11 -m64 > --with-wrapper-cxxflags=-m64 --enable-debug > > >tyr java 290 ompi_info | grep -e "Open MPI repo revision:" -e "C compiler >version:" > Open MPI repo revision: dev-178-ga16c1e4 > C compiler version: 4.9.1 > > > >> > regarding the BUS error reported by Siegmar, i also commited >> > 62bde1fcb554079143030bb305512c236672386f >> > in order to fix it (this is based on code review only, i have no sparc64 >> > hardware to test it is enough) >> >> I'll test it, when a new nightly snapshot is available for the trunk. > > >tyr java 291 mpijavac InitFinalizeMain.java >tyr java 292 mpiexec -np 1 java InitFinalizeMain >Hello! > >tyr java 293 mpijavac BcastIntMain.java >tyr java 294 mpiexec -np 2 java BcastIntMain ># ># A fatal error has been detected by the Java Runtime Environment: ># ># SIGBUS (0xa) at pc=0xfffffffee3210bfc, pid=24792, tid=2 >... > > > >tyr java 296 /usr/local/gdb-7.6.1_64_gcc/bin/gdb mpiexec >... >(gdb) run -np 2 java BcastIntMain >Starting program: /usr/local/openmpi-1.9.0_64_gcc/bin/mpiexec -np 2 java >BcastIntMain >[Thread debugging using libthread_db enabled] >[New Thread 1 (LWP 1)] >[New LWP 2 ] ># ># A fatal error has been detected by the Java Runtime Environment: ># ># SIGBUS (0xa) at pc=0xfffffffee3210bfc, pid=24814, tid=2 ># ># JRE version: Java(TM) SE Runtime Environment (8.0-b132) (build 1.8.0-b132) ># Java VM: Java HotSpot(TM) 64-Bit Server VM (25.0-b70 mixed mode >solaris-sparc compressed oops) ># Problematic frame: ># C [mca_pmix_native.so+0x10bfc] native_get_attr+0x3000 ># ># Failed to write core dump. Core dumps have been disabled. To enable core >dumping, try "ulimit -c unlimited" before starting Java again ># ># An error report file with more information is saved as: ># /home/fd1026/work/skripte/master/parallel/prog/mpi/java/hs_err_pid24814.log ># ># A fatal error has been detected by the Java Runtime Environment: ># ># SIGBUS (0xa) at pc=0xfffffffee3210bfc, pid=24812, tid=2 ># ># JRE version: Java(TM) SE Runtime Environment (8.0-b132) (build 1.8.0-b132) ># Java VM: Java HotSpot(TM) 64-Bit Server VM (25.0-b70 mixed mode >solaris-sparc compressed oops) ># Problematic frame: ># C [mca_pmix_native.so+0x10bfc] native_get_attr+0x3000 ># ># Failed to write core dump. Core dumps have been disabled. To enable core >dumping, try "ulimit -c unlimited" before starting Java again ># ># An error report file with more information is saved as: ># /home/fd1026/work/skripte/master/parallel/prog/mpi/java/hs_err_pid24812.log ># ># If you would like to submit a bug report, please visit: ># http://bugreport.sun.com/bugreport/crash.jsp ># The crash happened outside the Java Virtual Machine in native code. ># See problematic frame for where to report the bug. ># >[tyr:24814] *** Process received signal *** >[tyr:24814] Signal: Abort (6) >[tyr:24814] Signal code: (-1) >/export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-pal.so.0.0.0:opal_backtrace_print+0x2c >/export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-pal.so.0.0.0:0xdc2d4 >/lib/sparcv9/libc.so.1:0xd8b98 >/lib/sparcv9/libc.so.1:0xcc70c >/lib/sparcv9/libc.so.1:0xcc918 >/lib/sparcv9/libc.so.1:0xdd2d0 [ Signal 6 (ABRT)] >/lib/sparcv9/libc.so.1:_thr_sigsetmask+0x1c4 >/lib/sparcv9/libc.so.1:sigprocmask+0x28 >/lib/sparcv9/libc.so.1:_sigrelse+0x5c >/lib/sparcv9/libc.so.1:abort+0xc0 >/export2/prog/SunOS_sparc/jdk1.8.0/jre/lib/sparcv9/server/libjvm.so:0xb3cb90 >/export2/prog/SunOS_sparc/jdk1.8.0/jre/lib/sparcv9/server/libjvm.so:0xd97a04 >/export2/prog/SunOS_sparc/jdk1.8.0/jre/lib/sparcv9/server/libjvm.so:JVM_handle_solaris_signal+0xc0c >/export2/prog/SunOS_sparc/jdk1.8.0/jre/lib/sparcv9/server/libjvm.so:0xb44e84 >/lib/sparcv9/libc.so.1:0xd8b98 >/lib/sparcv9/libc.so.1:0xcc70c >/lib/sparcv9/libc.so.1:0xcc918 >/export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/openmpi/mca_pmix_native.so:0x10bfc > [ Signal 10 (BUS)] >/export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/openmpi/mca_ess_pmi.so:0x33dc >/export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-rte.so.0.0.0:orte_init+0x67c >/export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libmpi.so.0.0.0:ompi_mpi_init+0x374 >/export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libmpi.so.0.0.0:PMPI_Init+0x2a8 >/export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libmpi_java.so.0.0.0:Java_mpi_MPI_Init_1jni+0x1a0 >0xffffffff6b810730 >0xffffffff6b8106d4 >0xffffffff6b8078a8 >0xffffffff6b8078a8 >0xffffffff6b80024c >/export2/prog/SunOS_sparc/jdk1.8.0/jre/lib/sparcv9/server/libjvm.so:0x6fd4e8 >/export2/prog/SunOS_sparc/jdk1.8.0/jre/lib/sparcv9/server/libjvm.so:0x79331c >/export2/prog/SunOS_sparc/jdk1.8.0/lib/sparcv9/jli/libjli.so:0x7290 >/lib/sparcv9/libc.so.1:0xd8a6c >[tyr:24814] *** End of error message *** >-------------------------------------------------------------------------- >mpiexec noticed that process rank 1 with PID 0 on node tyr exited on signal 6 >(Abort). >-------------------------------------------------------------------------- >[LWP 2 exited] >[New Thread 2 ] >[Switching to Thread 1 (LWP 1)] >sol_thread_fetch_registers: td_ta_map_id2thr: no thread can be found to >satisfy query >(gdb) bt >#0 0xffffffff7f6173d0 in rtld_db_dlactivity () from /usr/lib/sparcv9/ld.so.1 >#1 0xffffffff7f6175a8 in rd_event () from /usr/lib/sparcv9/ld.so.1 >#2 0xffffffff7f618950 in lm_delete () from /usr/lib/sparcv9/ld.so.1 >#3 0xffffffff7f6226bc in remove_so () from /usr/lib/sparcv9/ld.so.1 >#4 0xffffffff7f624574 in remove_hdl () from /usr/lib/sparcv9/ld.so.1 >#5 0xffffffff7f61d97c in dlclose_core () from /usr/lib/sparcv9/ld.so.1 >#6 0xffffffff7f61d9d4 in dlclose_intn () from /usr/lib/sparcv9/ld.so.1 >#7 0xffffffff7f61db0c in dlclose () from /usr/lib/sparcv9/ld.so.1 >#8 0xffffffff7ec87ca0 in vm_close () > from /usr/local/openmpi-1.9.0_64_gcc/lib64/libopen-pal.so.0 >#9 0xffffffff7ec85274 in lt_dlclose () > from /usr/local/openmpi-1.9.0_64_gcc/lib64/libopen-pal.so.0 >#10 0xffffffff7ecaa5dc in ri_destructor (obj=0x100187b70) > at > ../../../../openmpi-dev-178-ga16c1e4/opal/mca/base/mca_base_component_repository.c:382 >#11 0xffffffff7eca8fd8 in opal_obj_run_destructors (object=0x100187b70) > at ../../../../openmpi-dev-178-ga16c1e4/opal/class/opal_object.h:446 >#12 0xffffffff7eca9eac in mca_base_component_repository_release ( > component=0xffffffff7b1236f0 <mca_oob_tcp_component>) > at > ../../../../openmpi-dev-178-ga16c1e4/opal/mca/base/mca_base_component_repository.c:240 >#13 0xffffffff7ecac17c in mca_base_component_unload ( > component=0xffffffff7b1236f0 <mca_oob_tcp_component>, output_id=-1) > at > ../../../../openmpi-dev-178-ga16c1e4/opal/mca/base/mca_base_components_close.c:47 >#14 0xffffffff7ecac210 in mca_base_component_close ( > component=0xffffffff7b1236f0 <mca_oob_tcp_component>, output_id=-1) > at > ../../../../openmpi-dev-178-ga16c1e4/opal/mca/base/mca_base_components_close.c:60 >#15 0xffffffff7ecac2e4 in mca_base_components_close (output_id=-1, > components=0xffffffff7f14bc58 <orte_oob_base_framework+80>, skip=0x0) > at > ../../../../openmpi-dev-178-ga16c1e4/opal/mca/base/mca_base_components_close.c:86 >#16 0xffffffff7ecac24c in mca_base_framework_components_close ( > framework=0xffffffff7f14bc08 <orte_oob_base_framework>, skip=0x0) > at > ../../../../openmpi-dev-178-ga16c1e4/opal/mca/base/mca_base_components_close.c:66 >#17 0xffffffff7efcaf80 in orte_oob_base_close () > at > ../../../../openmpi-dev-178-ga16c1e4/orte/mca/oob/base/oob_base_frame.c:112 >#18 0xffffffff7ecc0d74 in mca_base_framework_close ( > framework=0xffffffff7f14bc08 <orte_oob_base_framework>) > at > ../../../../openmpi-dev-178-ga16c1e4/opal/mca/base/mca_base_framework.c:187 >#19 0xffffffff7be07858 in rte_finalize () > at > ../../../../../openmpi-dev-178-ga16c1e4/orte/mca/ess/hnp/ess_hnp_module.c:857 >#20 0xffffffff7ef338bc in orte_finalize () > at ../../openmpi-dev-178-ga16c1e4/orte/runtime/orte_finalize.c:66 >#21 0x000000010000723c in orterun (argc=5, argv=0xffffffff7fffe0d8) > at ../../../../openmpi-dev-178-ga16c1e4/orte/tools/orterun/orterun.c:1103 >#22 0x0000000100003e80 in main (argc=5, argv=0xffffffff7fffe0d8) >---Type <return> to continue, or q <return> to quit--- > at ../../../../openmpi-dev-178-ga16c1e4/orte/tools/orterun/main.c:13 >(gdb) > > > > >I get the same error for C programs, if they use more than >MPI_Init and MPI_Finalize. > >tyr small_prog 301 mpicc init_finalize.c >tyr small_prog 302 mpiexec -np 1 a.out >Hello! >tyr small_prog 303 mpicc column_int.c >tyr small_prog 306 /usr/local/gdb-7.6.1_64_gcc/bin/gdb mpiexec >... >(gdb) run -np 4 a.out >Starting program: /usr/local/openmpi-1.9.0_64_gcc/bin/mpiexec -np 4 a.out >[Thread debugging using libthread_db enabled] >[New Thread 1 (LWP 1)] >[New LWP 2 ] >[tyr:24880] *** Process received signal *** >[tyr:24880] Signal: Bus Error (10) >[tyr:24880] Signal code: Invalid address alignment (1) >[tyr:24880] Failing at address: ffffffff7bd1c10c >/export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-pal.so.0.0.0:opal_backtrace_print+0x2c >/export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-pal.so.0.0.0:0xdc2d4 >/lib/sparcv9/libc.so.1:0xd8b98 >/lib/sparcv9/libc.so.1:0xcc70c >/lib/sparcv9/libc.so.1:0xcc918 >/export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/openmpi/mca_pmix_native.so:0x10684 > [ Signal 10 (BUS)] >/export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/openmpi/mca_ess_pmi.so:0x33dc >/export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-rte.so.0.0.0:orte_init+0x67c >/export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libmpi.so.0.0.0:ompi_mpi_init+0x374 >/export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libmpi.so.0.0.0:PMPI_Init+0x2a8 >/home/fd1026/work/skripte/master/parallel/prog/mpi/small_prog/a.out:main+0x20 >/home/fd1026/work/skripte/master/parallel/prog/mpi/small_prog/a.out:_start+0x7c >[tyr:24880] *** End of error message *** >[tyr:24876] *** Process received signal *** >[tyr:24876] Signal: Bus Error (10) >[tyr:24876] Signal code: Invalid address alignment (1) >[tyr:24876] Failing at address: ffffffff7bd1c10c >/export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-pal.so.0.0.0:opal_backtrace_print+0x2c >/export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-pal.so.0.0.0:0xdc2d4 >/lib/sparcv9/libc.so.1:0xd8b98 >/lib/sparcv9/libc.so.1:0xcc70c >/lib/sparcv9/libc.so.1:0xcc918 >/export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/openmpi/mca_pmix_native.so:0x10684 > [ Signal 10 (BUS)] >/export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/openmpi/mca_ess_pmi.so:0x33dc >/export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libopen-rte.so.0.0.0:orte_init+0x67c >/export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libmpi.so.0.0.0:ompi_mpi_init+0x374 >/export2/prog/SunOS_sparc/openmpi-1.9.0_64_gcc/lib64/libmpi.so.0.0.0:PMPI_Init+0x2a8 >/home/fd1026/work/skripte/master/parallel/prog/mpi/small_prog/a.out:main+0x20 >/home/fd1026/work/skripte/master/parallel/prog/mpi/small_prog/a.out:_start+0x7c >[tyr:24876] *** End of error message *** >-------------------------------------------------------------------------- >mpiexec noticed that process rank 2 with PID 0 on node tyr exited on signal 10 >(Bus Error). >-------------------------------------------------------------------------- >[LWP 2 exited] >[New Thread 2 ] >[Switching to Thread 1 (LWP 1)] >sol_thread_fetch_registers: td_ta_map_id2thr: no thread can be found to >satisfy query >(gdb) bt >#0 0xffffffff7f6173d0 in rtld_db_dlactivity () from /usr/lib/sparcv9/ld.so.1 >#1 0xffffffff7f6175a8 in rd_event () from /usr/lib/sparcv9/ld.so.1 >#2 0xffffffff7f618950 in lm_delete () from /usr/lib/sparcv9/ld.so.1 >#3 0xffffffff7f6226bc in remove_so () from /usr/lib/sparcv9/ld.so.1 >#4 0xffffffff7f624574 in remove_hdl () from /usr/lib/sparcv9/ld.so.1 >#5 0xffffffff7f61d97c in dlclose_core () from /usr/lib/sparcv9/ld.so.1 >#6 0xffffffff7f61d9d4 in dlclose_intn () from /usr/lib/sparcv9/ld.so.1 >#7 0xffffffff7f61db0c in dlclose () from /usr/lib/sparcv9/ld.so.1 >#8 0xffffffff7ec87ca0 in vm_close () > from /usr/local/openmpi-1.9.0_64_gcc/lib64/libopen-pal.so.0 >#9 0xffffffff7ec85274 in lt_dlclose () > from /usr/local/openmpi-1.9.0_64_gcc/lib64/libopen-pal.so.0 >#10 0xffffffff7ecaa5dc in ri_destructor (obj=0x100187ae0) > at > ../../../../openmpi-dev-178-ga16c1e4/opal/mca/base/mca_base_component_repository.c:382 >#11 0xffffffff7eca8fd8 in opal_obj_run_destructors (object=0x100187ae0) > at ../../../../openmpi-dev-178-ga16c1e4/opal/class/opal_object.h:446 >#12 0xffffffff7eca9eac in mca_base_component_repository_release ( > component=0xffffffff7b0236f0 <mca_oob_tcp_component>) > at > ../../../../openmpi-dev-178-ga16c1e4/opal/mca/base/mca_base_component_repository.c:240 >#13 0xffffffff7ecac17c in mca_base_component_unload ( > component=0xffffffff7b0236f0 <mca_oob_tcp_component>, output_id=-1) > at > ../../../../openmpi-dev-178-ga16c1e4/opal/mca/base/mca_base_components_close.c:47 >#14 0xffffffff7ecac210 in mca_base_component_close ( > component=0xffffffff7b0236f0 <mca_oob_tcp_component>, output_id=-1) > at > ../../../../openmpi-dev-178-ga16c1e4/opal/mca/base/mca_base_components_close.c:60 >#15 0xffffffff7ecac2e4 in mca_base_components_close (output_id=-1, > components=0xffffffff7f14bc58 <orte_oob_base_framework+80>, skip=0x0) > at > ../../../../openmpi-dev-178-ga16c1e4/opal/mca/base/mca_base_components_close.c:86 >#16 0xffffffff7ecac24c in mca_base_framework_components_close ( > framework=0xffffffff7f14bc08 <orte_oob_base_framework>, skip=0x0) > at > ../../../../openmpi-dev-178-ga16c1e4/opal/mca/base/mca_base_components_close.c:66 >#17 0xffffffff7efcaf80 in orte_oob_base_close () > at > ../../../../openmpi-dev-178-ga16c1e4/orte/mca/oob/base/oob_base_frame.c:112 >#18 0xffffffff7ecc0d74 in mca_base_framework_close ( > framework=0xffffffff7f14bc08 <orte_oob_base_framework>) > at > ../../../../openmpi-dev-178-ga16c1e4/opal/mca/base/mca_base_framework.c:187 >#19 0xffffffff7bd07858 in rte_finalize () > at > ../../../../../openmpi-dev-178-ga16c1e4/orte/mca/ess/hnp/ess_hnp_module.c:857 >#20 0xffffffff7ef338bc in orte_finalize () > at ../../openmpi-dev-178-ga16c1e4/orte/runtime/orte_finalize.c:66 >#21 0x000000010000723c in orterun (argc=4, argv=0xffffffff7fffe0e8) > at ../../../../openmpi-dev-178-ga16c1e4/orte/tools/orterun/orterun.c:1103 >#22 0x0000000100003e80 in main (argc=4, argv=0xffffffff7fffe0e8) > at ../../../../openmpi-dev-178-ga16c1e4/orte/tools/orterun/main.c:13 >(gdb) > > > >Do you need any other information? > > >Kind regards > >Siegmar