I will have a look at it today how did you configure OpenMPI ?
Cheers, Gilles On Thursday, July 7, 2016, Gundram Leifert <gundram.leif...@uni-rostock.de> wrote: > Hello Giles, > > thank you for your hints! I did 3 changes, unfortunately the same error > occures: > > update ompi: > commit ae8444682f0a7aa158caea08800542ce9874455e > Author: Ralph Castain <r...@open-mpi.org> > <javascript:_e(%7B%7D,'cvml','r...@open-mpi.org');> > Date: Tue Jul 5 20:07:16 2016 -0700 > > update java: > java version "1.8.0_92" > Java(TM) SE Runtime Environment (build 1.8.0_92-b14) > Java HotSpot(TM) Server VM (build 25.92-b14, mixed mode) > > delete hashcode-lines. > > Now I get this error message - to 100%, after different number of > iterations (15-300): > > 0/ 3:length = 100000000 > 0/ 3:bcast length done (length = 100000000) > 1/ 3:bcast length done (length = 100000000) > 2/ 3:bcast length done (length = 100000000) > # > # A fatal error has been detected by the Java Runtime Environment: > # > # SIGSEGV (0xb) at pc=0x00002b3d022fcd24, pid=16578, > tid=0x00002b3d29716700 > # > # JRE version: Java(TM) SE Runtime Environment (8.0_92-b14) (build > 1.8.0_92-b14) > # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.92-b14 mixed mode > linux-amd64 compressed oops) > # Problematic frame: > # V [libjvm.so+0x414d24] ciEnv::get_field_by_index(ciInstanceKlass*, > int)+0x94 > # > # Failed to write core dump. Core dumps have been disabled. To enable core > dumping, try "ulimit -c unlimited" before starting Java again > # > # An error report file with more information is saved as: > # /home/gl069/ompi/bin/executor/hs_err_pid16578.log > # > # Compiler replay data is saved as: > # /home/gl069/ompi/bin/executor/replay_pid16578.log > # > # If you would like to submit a bug report, please visit: > # http://bugreport.java.com/bugreport/crash.jsp > # > [titan01:16578] *** Process received signal *** > [titan01:16578] Signal: Aborted (6) > [titan01:16578] Signal code: (-6) > [titan01:16578] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b3d01500100] > [titan01:16578] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b3d01b5c5f7] > [titan01:16578] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b3d01b5dce8] > [titan01:16578] [ 3] > /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91e605)[0x2b3d02806605] > [titan01:16578] [ 4] > /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0xabda63)[0x2b3d029a5a63] > [titan01:16578] [ 5] > /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x14f)[0x2b3d0280be2f] > [titan01:16578] [ 6] > /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91a5c3)[0x2b3d028025c3] > [titan01:16578] [ 7] /usr/lib64/libc.so.6(+0x35670)[0x2b3d01b5c670] > [titan01:16578] [ 8] > /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x414d24)[0x2b3d022fcd24] > [titan01:16578] [ 9] > /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x43c5ae)[0x2b3d023245ae] > [titan01:16578] [10] > /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x369ade)[0x2b3d02251ade] > [titan01:16578] [11] > /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36eda0)[0x2b3d02256da0] > [titan01:16578] [12] > /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b] > [titan01:16578] [13] > /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6] > [titan01:16578] [14] > /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf] > [titan01:16578] [15] > /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412] > [titan01:16578] [16] > /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d] > [titan01:16578] [17] > /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b] > [titan01:16578] [18] > /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6] > [titan01:16578] [19] > /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf] > [titan01:16578] [20] > /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412] > [titan01:16578] [21] > /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d] > [titan01:16578] [22] > /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3708c2)[0x2b3d022588c2] > [titan01:16578] [23] > /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3724e7)[0x2b3d0225a4e7] > [titan01:16578] [24] > /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a817)[0x2b3d02262817] > [titan01:16578] [25] > /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a92f)[0x2b3d0226292f] > [titan01:16578] [26] > /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x358edb)[0x2b3d02240edb] > [titan01:16578] [27] > /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35929e)[0x2b3d0224129e] > [titan01:16578] [28] > /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3593ce)[0x2b3d022413ce] > [titan01:16578] [29] > /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35973e)[0x2b3d0224173e] > [titan01:16578] *** End of error message *** > ------------------------------------------------------- > Primary job terminated normally, but 1 process returned > a non-zero exit code. Per user-direction, the job has been aborted. > ------------------------------------------------------- > -------------------------------------------------------------------------- > mpirun noticed that process rank 2 with PID 0 on node titan01 exited on > signal 6 (Aborted). > -------------------------------------------------------------------------- > > I don't know if it is a problem of java or ompi - but the last years, > java worked with no problems on my machine... > > Thank you for your tips in advance! > Gundram > > On 07/06/2016 03:10 PM, Gilles Gouaillardet wrote: > > Note a race condition in MPI_Init has been fixed yesterday in the master. > can you please update your OpenMPI and try again ? > > hopefully the hang will disappear. > > Can you reproduce the crash with a simpler (and ideally deterministic) > version of your program. > the crash occurs in hashcode, and this makes little sense to me. can you > also update your jdk ? > > Cheers, > > Gilles > > On Wednesday, July 6, 2016, Gundram Leifert < > <javascript:_e(%7B%7D,'cvml','gundram.leif...@uni-rostock.de');> > gundram.leif...@uni-rostock.de > <javascript:_e(%7B%7D,'cvml','gundram.leif...@uni-rostock.de');>> wrote: > >> Hello Jason, >> >> thanks for your response! I thing it is another problem. I try to send >> 100MB bytes. So there are not many tries (between 10 and 30). I realized >> that the execution of this code can result 3 different errors: >> >> 1. most often the posted error message occures. >> >> 2. in <10% the cases i have a live lock. I can see 3 java-processes, one >> with 200% and two with 100% processor utilization. After ~15 minutes >> without new system outputs this error occurs. >> >> >> [thread 47499823949568 also had an error] >> # A fatal error has been detected by the Java Runtime Environment: >> # >> # Internal Error (safepoint.cpp:317), pid=24256, tid=47500347131648 >> # guarantee(PageArmed == 0) failed: invariant >> # >> # JRE version: 7.0_25-b15 >> # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode >> linux-amd64 compressed oops) >> # Failed to write core dump. Core dumps have been disabled. To enable >> core dumping, try "ulimit -c unlimited" before starting Java again >> # >> # An error report file with more information is saved as: >> # /home/gl069/ompi/bin/executor/hs_err_pid24256.log >> # >> # If you would like to submit a bug report, please visit: >> # http://bugreport.sun.com/bugreport/crash.jsp >> # >> [titan01:24256] *** Process received signal *** >> [titan01:24256] Signal: Aborted (6) >> [titan01:24256] Signal code: (-6) >> [titan01:24256] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b336a324100] >> [titan01:24256] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b336a9815f7] >> [titan01:24256] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b336a982ce8] >> [titan01:24256] [ 3] >> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b336b44fac5] >> [titan01:24256] [ 4] >> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b336b5af137] >> [titan01:24256] [ 5] >> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x407262)[0x2b336b114262] >> [titan01:24256] [ 6] >> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x7c6c34)[0x2b336b4d3c34] >> [titan01:24256] [ 7] >> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a9c17)[0x2b336b5b6c17] >> [titan01:24256] [ 8] >> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8aa2c0)[0x2b336b5b72c0] >> [titan01:24256] [ 9] >> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x744270)[0x2b336b451270] >> [titan01:24256] [10] /usr/lib64/libpthread.so.0(+0x7dc5)[0x2b336a31cdc5] >> [titan01:24256] [11] /usr/lib64/libc.so.6(clone+0x6d)[0x2b336aa4228d] >> [titan01:24256] *** End of error message *** >> ------------------------------------------------------- >> Primary job terminated normally, but 1 process returned >> a non-zero exit code. Per user-direction, the job has been aborted. >> ------------------------------------------------------- >> -------------------------------------------------------------------------- >> mpirun noticed that process rank 0 with PID 0 on node titan01 exited on >> signal 6 (Aborted). >> -------------------------------------------------------------------------- >> >> >> 3. in <10% the cases i have a dead lock while MPI.init. This stays for >> more than 15 minutes without returning with an error message... >> >> Can I enable some debug-flags to see what happens on C / OpenMPI side? >> >> Thanks in advance for your help! >> Gundram Leifert >> >> >> On 07/05/2016 06:05 PM, Jason Maldonis wrote: >> >> After reading your thread looks like it may be related to an issue I had >> a few weeks ago (I'm a novice though). Maybe my thread will be of help: >> <https://www.open-mpi.org/community/lists/users/2016/06/29425.php> >> https://www.open-mpi.org/community/lists/users/2016/06/29425.php >> >> When you say "After a specific number of repetitions the process either >> hangs up or returns with a SIGSEGV." does you mean that a single call >> hangs, or that at some point during the for loop a call hangs? If you mean >> the latter, then it might relate to my issue. Otherwise my thread probably >> won't be helpful. >> >> Jason Maldonis >> Research Assistant of Professor Paul Voyles >> Materials Science Grad Student >> University of Wisconsin, Madison >> 1509 University Ave, Rm M142 >> Madison, WI 53706 >> maldo...@wisc.edu >> 608-295-5532 >> >> On Tue, Jul 5, 2016 at 9:58 AM, Gundram Leifert < >> gundram.leif...@uni-rostock.de >> <javascript:_e(%7B%7D,'cvml','gundram.leif...@uni-rostock.de');>> wrote: >> >>> Hello, >>> >>> I try to send many byte-arrays via broadcast. After a specific number of >>> repetitions the process either hangs up or returns with a SIGSEGV. Does any >>> one can help me solving the problem: >>> >>> ########## The code: >>> >>> import java.util.Random; >>> import mpi.*; >>> >>> public class TestSendBigFiles { >>> >>> public static void log(String msg) { >>> try { >>> System.err.println(String.format("%2d/%2d:%s", >>> MPI.COMM_WORLD.getRank(), MPI.COMM_WORLD.getSize(), msg)); >>> } catch (MPIException ex) { >>> System.err.println(String.format("%2s/%2s:%s", "?", "?", >>> msg)); >>> } >>> } >>> >>> private static int hashcode(byte[] bytearray) { >>> if (bytearray == null) { >>> return 0; >>> } >>> int hash = 39; >>> for (int i = 0; i < bytearray.length; i++) { >>> byte b = bytearray[i]; >>> hash = hash * 7 + (int) b; >>> } >>> return hash; >>> } >>> >>> public static void main(String args[]) throws MPIException { >>> log("start main"); >>> MPI.Init(args); >>> try { >>> log("initialized done"); >>> byte[] saveMem = new byte[100000000]; >>> MPI.COMM_WORLD.barrier(); >>> Random r = new Random(); >>> r.nextBytes(saveMem); >>> if (MPI.COMM_WORLD.getRank() == 0) { >>> for (int i = 0; i < 1000; i++) { >>> saveMem[r.nextInt(saveMem.length)]++; >>> log("i = " + i); >>> int[] lengthData = new int[]{saveMem.length}; >>> log("object hash = " + hashcode(saveMem)); >>> log("length = " + lengthData[0]); >>> MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT, 0); >>> log("bcast length done (length = " + lengthData[0] + >>> ")"); >>> MPI.COMM_WORLD.barrier(); >>> MPI.COMM_WORLD.bcast(saveMem, lengthData[0], >>> MPI.BYTE, 0); >>> log("bcast data done"); >>> MPI.COMM_WORLD.barrier(); >>> } >>> MPI.COMM_WORLD.bcast(new int[]{0}, 1, MPI.INT, 0); >>> } else { >>> while (true) { >>> int[] lengthData = new int[1]; >>> MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT, 0); >>> log("bcast length done (length = " + lengthData[0] + >>> ")"); >>> if (lengthData[0] == 0) { >>> break; >>> } >>> MPI.COMM_WORLD.barrier(); >>> saveMem = new byte[lengthData[0]]; >>> MPI.COMM_WORLD.bcast(saveMem, saveMem.length, >>> MPI.BYTE, 0); >>> log("bcast data done"); >>> MPI.COMM_WORLD.barrier(); >>> log("object hash = " + hashcode(saveMem)); >>> } >>> } >>> MPI.COMM_WORLD.barrier(); >>> } catch (MPIException ex) { >>> System.out.println("caugth error." + ex); >>> log(ex.getMessage()); >>> } catch (RuntimeException ex) { >>> System.out.println("caugth error." + ex); >>> log(ex.getMessage()); >>> } finally { >>> MPI.Finalize(); >>> } >>> >>> } >>> >>> } >>> >>> >>> ############ The Error (if it does not just hang up): >>> >>> # >>> # A fatal error has been detected by the Java Runtime Environment: >>> # >>> # SIGSEGV (0xb) at pc=0x00002b7e9c86e3a1, pid=1172, tid=47822674495232 >>> # >>> # >>> # A fatal error has been detected by the Java Runtime Environment: >>> # JRE version: 7.0_25-b15 >>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode >>> linux-amd64 compressed oops) >>> # Problematic frame: >>> # # >>> # SIGSEGV (0xb) at pc=0x00002af69c0693a1, pid=1173, tid=47238546896640 >>> # >>> # JRE version: 7.0_25-b15 >>> J de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I >>> # >>> # Failed to write core dump. Core dumps have been disabled. To enable >>> core dumping, try "ulimit -c unlimited" before starting Java again >>> # >>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode >>> linux-amd64 compressed oops) >>> # Problematic frame: >>> # J de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I >>> # >>> # Failed to write core dump. Core dumps have been disabled. To enable >>> core dumping, try "ulimit -c unlimited" before starting Java again >>> # >>> # An error report file with more information is saved as: >>> # /home/gl069/ompi/bin/executor/hs_err_pid1172.log >>> # An error report file with more information is saved as: >>> # /home/gl069/ompi/bin/executor/hs_err_pid1173.log >>> # >>> # If you would like to submit a bug report, please visit: >>> # http://bugreport.sun.com/bugreport/crash.jsp >>> # >>> # >>> # If you would like to submit a bug report, please visit: >>> # http://bugreport.sun.com/bugreport/crash.jsp >>> # >>> [titan01:01172] *** Process received signal *** >>> [titan01:01172] Signal: Aborted (6) >>> [titan01:01172] Signal code: (-6) >>> [titan01:01173] *** Process received signal *** >>> [titan01:01173] Signal: Aborted (6) >>> [titan01:01173] Signal code: (-6) >>> [titan01:01172] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b7e9596a100] >>> [titan01:01172] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b7e95fc75f7] >>> [titan01:01172] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b7e95fc8ce8] >>> [titan01:01172] [ 3] >>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b7e96a95ac5] >>> [titan01:01172] [ 4] >>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b7e96bf5137] >>> [titan01:01172] [ 5] >>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2b7e96a995e0] >>> [titan01:01172] [ 6] [titan01:01173] [ 0] >>> /usr/lib64/libpthread.so.0(+0xf100)[0x2af694ded100] >>> [titan01:01173] [ 1] /usr/lib64/libc.so.6(+0x35670)[0x2b7e95fc7670] >>> [titan01:01172] [ 7] [0x2b7e9c86e3a1] >>> [titan01:01172] *** End of error message *** >>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2af69544a5f7] >>> [titan01:01173] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2af69544bce8] >>> [titan01:01173] [ 3] >>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2af695f18ac5] >>> [titan01:01173] [ 4] >>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2af696078137] >>> [titan01:01173] [ 5] >>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2af695f1c5e0] >>> [titan01:01173] [ 6] /usr/lib64/libc.so.6(+0x35670)[0x2af69544a670] >>> [titan01:01173] [ 7] [0x2af69c0693a1] >>> [titan01:01173] *** End of error message *** >>> ------------------------------------------------------- >>> Primary job terminated normally, but 1 process returned >>> a non-zero exit code. Per user-direction, the job has been aborted. >>> ------------------------------------------------------- >>> >>> -------------------------------------------------------------------------- >>> mpirun noticed that process rank 1 with PID 0 on node titan01 exited on >>> signal 6 (Aborted). >>> >>> >>> ########CONFIGURATION: >>> I used the ompi master sources from github: >>> commit 267821f0dd405b5f4370017a287d9a49f92e734a >>> Author: Gilles Gouaillardet <gil...@rist.or.jp >>> <javascript:_e(%7B%7D,'cvml','gil...@rist.or.jp');>> >>> Date: Tue Jul 5 13:47:50 2016 +0900 >>> >>> ./configure --enable-mpi-java --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25 >>> --disable-dlopen --disable-mca-dso >>> >>> Thanks a lot for your help! >>> Gundram >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2016/07/29584.php >>> >> >> >> >> _______________________________________________ >> users mailing listus...@open-mpi.org >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2016/07/29585.php >> >> >> > > _______________________________________________ > users mailing listus...@open-mpi.org > <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');> > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/07/29587.php > > >