I will have a look at it today

how did you configure OpenMPI ?

Cheers,

Gilles

On Thursday, July 7, 2016, Gundram Leifert <gundram.leif...@uni-rostock.de>
wrote:

> Hello Giles,
>
> thank you for your hints! I did 3 changes, unfortunately the same error
> occures:
>
> update ompi:
> commit ae8444682f0a7aa158caea08800542ce9874455e
> Author: Ralph Castain <r...@open-mpi.org>
> <javascript:_e(%7B%7D,'cvml','r...@open-mpi.org');>
> Date:   Tue Jul 5 20:07:16 2016 -0700
>
> update java:
> java version "1.8.0_92"
> Java(TM) SE Runtime Environment (build 1.8.0_92-b14)
> Java HotSpot(TM) Server VM (build 25.92-b14, mixed mode)
>
> delete hashcode-lines.
>
> Now I get this error message - to 100%, after different number of
> iterations (15-300):
>
>  0/ 3:length = 100000000
>  0/ 3:bcast length done (length = 100000000)
>  1/ 3:bcast length done (length = 100000000)
>  2/ 3:bcast length done (length = 100000000)
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x00002b3d022fcd24, pid=16578,
> tid=0x00002b3d29716700
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_92-b14) (build
> 1.8.0_92-b14)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.92-b14 mixed mode
> linux-amd64 compressed oops)
> # Problematic frame:
> # V  [libjvm.so+0x414d24]  ciEnv::get_field_by_index(ciInstanceKlass*,
> int)+0x94
> #
> # Failed to write core dump. Core dumps have been disabled. To enable core
> dumping, try "ulimit -c unlimited" before starting Java again
> #
> # An error report file with more information is saved as:
> # /home/gl069/ompi/bin/executor/hs_err_pid16578.log
> #
> # Compiler replay data is saved as:
> # /home/gl069/ompi/bin/executor/replay_pid16578.log
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.java.com/bugreport/crash.jsp
> #
> [titan01:16578] *** Process received signal ***
> [titan01:16578] Signal: Aborted (6)
> [titan01:16578] Signal code:  (-6)
> [titan01:16578] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b3d01500100]
> [titan01:16578] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b3d01b5c5f7]
> [titan01:16578] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b3d01b5dce8]
> [titan01:16578] [ 3]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91e605)[0x2b3d02806605]
> [titan01:16578] [ 4]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0xabda63)[0x2b3d029a5a63]
> [titan01:16578] [ 5]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x14f)[0x2b3d0280be2f]
> [titan01:16578] [ 6]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91a5c3)[0x2b3d028025c3]
> [titan01:16578] [ 7] /usr/lib64/libc.so.6(+0x35670)[0x2b3d01b5c670]
> [titan01:16578] [ 8]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x414d24)[0x2b3d022fcd24]
> [titan01:16578] [ 9]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x43c5ae)[0x2b3d023245ae]
> [titan01:16578] [10]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x369ade)[0x2b3d02251ade]
> [titan01:16578] [11]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36eda0)[0x2b3d02256da0]
> [titan01:16578] [12]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
> [titan01:16578] [13]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
> [titan01:16578] [14]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
> [titan01:16578] [15]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
> [titan01:16578] [16]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
> [titan01:16578] [17]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
> [titan01:16578] [18]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
> [titan01:16578] [19]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
> [titan01:16578] [20]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
> [titan01:16578] [21]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
> [titan01:16578] [22]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3708c2)[0x2b3d022588c2]
> [titan01:16578] [23]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3724e7)[0x2b3d0225a4e7]
> [titan01:16578] [24]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a817)[0x2b3d02262817]
> [titan01:16578] [25]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a92f)[0x2b3d0226292f]
> [titan01:16578] [26]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x358edb)[0x2b3d02240edb]
> [titan01:16578] [27]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35929e)[0x2b3d0224129e]
> [titan01:16578] [28]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3593ce)[0x2b3d022413ce]
> [titan01:16578] [29]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35973e)[0x2b3d0224173e]
> [titan01:16578] *** End of error message ***
> -------------------------------------------------------
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code. Per user-direction, the job has been aborted.
> -------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that process rank 2 with PID 0 on node titan01 exited on
> signal 6 (Aborted).
> --------------------------------------------------------------------------
>
> I don't know if it is a  problem of java or ompi - but the last years,
> java worked with no problems on my machine...
>
> Thank you for your tips in advance!
> Gundram
>
> On 07/06/2016 03:10 PM, Gilles Gouaillardet wrote:
>
> Note a race condition in MPI_Init has been fixed yesterday in the master.
> can you please update your OpenMPI and try again ?
>
> hopefully the hang will disappear.
>
> Can you reproduce the crash with a simpler (and ideally deterministic)
> version of your program.
> the crash occurs in hashcode, and this makes little sense to me. can you
> also update your jdk ?
>
> Cheers,
>
> Gilles
>
> On Wednesday, July 6, 2016, Gundram Leifert <
> <javascript:_e(%7B%7D,'cvml','gundram.leif...@uni-rostock.de');>
> gundram.leif...@uni-rostock.de
> <javascript:_e(%7B%7D,'cvml','gundram.leif...@uni-rostock.de');>> wrote:
>
>> Hello Jason,
>>
>> thanks for your response! I thing it is another problem. I try to send
>> 100MB bytes. So there are not many tries (between 10 and 30). I realized
>> that the execution of this code can result 3 different errors:
>>
>> 1. most often the posted error message occures.
>>
>> 2. in <10% the cases i have a live lock. I can see 3 java-processes, one
>> with 200% and two with 100% processor utilization. After ~15 minutes
>> without new system outputs this error occurs.
>>
>>
>> [thread 47499823949568 also had an error]
>> # A fatal error has been detected by the Java Runtime Environment:
>> #
>> #  Internal Error (safepoint.cpp:317), pid=24256, tid=47500347131648
>> #  guarantee(PageArmed == 0) failed: invariant
>> #
>> # JRE version: 7.0_25-b15
>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode
>> linux-amd64 compressed oops)
>> # Failed to write core dump. Core dumps have been disabled. To enable
>> core dumping, try "ulimit -c unlimited" before starting Java again
>> #
>> # An error report file with more information is saved as:
>> # /home/gl069/ompi/bin/executor/hs_err_pid24256.log
>> #
>> # If you would like to submit a bug report, please visit:
>> #   http://bugreport.sun.com/bugreport/crash.jsp
>> #
>> [titan01:24256] *** Process received signal ***
>> [titan01:24256] Signal: Aborted (6)
>> [titan01:24256] Signal code:  (-6)
>> [titan01:24256] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b336a324100]
>> [titan01:24256] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b336a9815f7]
>> [titan01:24256] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b336a982ce8]
>> [titan01:24256] [ 3]
>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b336b44fac5]
>> [titan01:24256] [ 4]
>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b336b5af137]
>> [titan01:24256] [ 5]
>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x407262)[0x2b336b114262]
>> [titan01:24256] [ 6]
>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x7c6c34)[0x2b336b4d3c34]
>> [titan01:24256] [ 7]
>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a9c17)[0x2b336b5b6c17]
>> [titan01:24256] [ 8]
>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8aa2c0)[0x2b336b5b72c0]
>> [titan01:24256] [ 9]
>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x744270)[0x2b336b451270]
>> [titan01:24256] [10] /usr/lib64/libpthread.so.0(+0x7dc5)[0x2b336a31cdc5]
>> [titan01:24256] [11] /usr/lib64/libc.so.6(clone+0x6d)[0x2b336aa4228d]
>> [titan01:24256] *** End of error message ***
>> -------------------------------------------------------
>> Primary job  terminated normally, but 1 process returned
>> a non-zero exit code. Per user-direction, the job has been aborted.
>> -------------------------------------------------------
>> --------------------------------------------------------------------------
>> mpirun noticed that process rank 0 with PID 0 on node titan01 exited on
>> signal 6 (Aborted).
>> --------------------------------------------------------------------------
>>
>>
>> 3. in <10% the cases i have a dead lock while MPI.init. This stays for
>> more than 15 minutes without returning with an error message...
>>
>> Can I enable some debug-flags to see what happens on C / OpenMPI side?
>>
>> Thanks in advance for your help!
>> Gundram Leifert
>>
>>
>> On 07/05/2016 06:05 PM, Jason Maldonis wrote:
>>
>> After reading your thread looks like it may be related to an issue I had
>> a few weeks ago (I'm a novice though). Maybe my thread will be of help:
>> <https://www.open-mpi.org/community/lists/users/2016/06/29425.php>
>> https://www.open-mpi.org/community/lists/users/2016/06/29425.php
>>
>> When you say "After a specific number of repetitions the process either
>> hangs up or returns with a SIGSEGV."  does you mean that a single call
>> hangs, or that at some point during the for loop a call hangs? If you mean
>> the latter, then it might relate to my issue. Otherwise my thread probably
>> won't be helpful.
>>
>> Jason Maldonis
>> Research Assistant of Professor Paul Voyles
>> Materials Science Grad Student
>> University of Wisconsin, Madison
>> 1509 University Ave, Rm M142
>> Madison, WI 53706
>> maldo...@wisc.edu
>> 608-295-5532
>>
>> On Tue, Jul 5, 2016 at 9:58 AM, Gundram Leifert <
>> gundram.leif...@uni-rostock.de
>> <javascript:_e(%7B%7D,'cvml','gundram.leif...@uni-rostock.de');>> wrote:
>>
>>> Hello,
>>>
>>> I try to send many byte-arrays via broadcast. After a specific number of
>>> repetitions the process either hangs up or returns with a SIGSEGV. Does any
>>> one can help me solving the problem:
>>>
>>> ########## The code:
>>>
>>> import java.util.Random;
>>> import mpi.*;
>>>
>>> public class TestSendBigFiles {
>>>
>>>     public static void log(String msg) {
>>>         try {
>>>             System.err.println(String.format("%2d/%2d:%s",
>>> MPI.COMM_WORLD.getRank(), MPI.COMM_WORLD.getSize(), msg));
>>>         } catch (MPIException ex) {
>>>             System.err.println(String.format("%2s/%2s:%s", "?", "?",
>>> msg));
>>>         }
>>>     }
>>>
>>>     private static int hashcode(byte[] bytearray) {
>>>         if (bytearray == null) {
>>>             return 0;
>>>         }
>>>         int hash = 39;
>>>         for (int i = 0; i < bytearray.length; i++) {
>>>             byte b = bytearray[i];
>>>             hash = hash * 7 + (int) b;
>>>         }
>>>         return hash;
>>>     }
>>>
>>>     public static void main(String args[]) throws MPIException {
>>>         log("start main");
>>>         MPI.Init(args);
>>>         try {
>>>             log("initialized done");
>>>             byte[] saveMem = new byte[100000000];
>>>             MPI.COMM_WORLD.barrier();
>>>             Random r = new Random();
>>>             r.nextBytes(saveMem);
>>>             if (MPI.COMM_WORLD.getRank() == 0) {
>>>                 for (int i = 0; i < 1000; i++) {
>>>                     saveMem[r.nextInt(saveMem.length)]++;
>>>                     log("i = " + i);
>>>                     int[] lengthData = new int[]{saveMem.length};
>>>                     log("object hash = " + hashcode(saveMem));
>>>                     log("length = " + lengthData[0]);
>>>                     MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT, 0);
>>>                     log("bcast length done (length = " + lengthData[0] +
>>> ")");
>>>                     MPI.COMM_WORLD.barrier();
>>>                     MPI.COMM_WORLD.bcast(saveMem, lengthData[0],
>>> MPI.BYTE, 0);
>>>                     log("bcast data done");
>>>                     MPI.COMM_WORLD.barrier();
>>>                 }
>>>                 MPI.COMM_WORLD.bcast(new int[]{0}, 1, MPI.INT, 0);
>>>             } else {
>>>                 while (true) {
>>>                     int[] lengthData = new int[1];
>>>                     MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT, 0);
>>>                     log("bcast length done (length = " + lengthData[0] +
>>> ")");
>>>                     if (lengthData[0] == 0) {
>>>                         break;
>>>                     }
>>>                     MPI.COMM_WORLD.barrier();
>>>                     saveMem = new byte[lengthData[0]];
>>>                     MPI.COMM_WORLD.bcast(saveMem, saveMem.length,
>>> MPI.BYTE, 0);
>>>                     log("bcast data done");
>>>                     MPI.COMM_WORLD.barrier();
>>>                     log("object hash = " + hashcode(saveMem));
>>>                 }
>>>             }
>>>             MPI.COMM_WORLD.barrier();
>>>         } catch (MPIException ex) {
>>>             System.out.println("caugth error." + ex);
>>>             log(ex.getMessage());
>>>         } catch (RuntimeException ex) {
>>>             System.out.println("caugth error." + ex);
>>>             log(ex.getMessage());
>>>         } finally {
>>>             MPI.Finalize();
>>>         }
>>>
>>>     }
>>>
>>> }
>>>
>>>
>>> ############ The Error (if it does not just hang up):
>>>
>>> #
>>> # A fatal error has been detected by the Java Runtime Environment:
>>> #
>>> #  SIGSEGV (0xb) at pc=0x00002b7e9c86e3a1, pid=1172, tid=47822674495232
>>> #
>>> #
>>> # A fatal error has been detected by the Java Runtime Environment:
>>> # JRE version: 7.0_25-b15
>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode
>>> linux-amd64 compressed oops)
>>> # Problematic frame:
>>> # #
>>> #  SIGSEGV (0xb) at pc=0x00002af69c0693a1, pid=1173, tid=47238546896640
>>> #
>>> # JRE version: 7.0_25-b15
>>> J  de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
>>> #
>>> # Failed to write core dump. Core dumps have been disabled. To enable
>>> core dumping, try "ulimit -c unlimited" before starting Java again
>>> #
>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode
>>> linux-amd64 compressed oops)
>>> # Problematic frame:
>>> # J  de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
>>> #
>>> # Failed to write core dump. Core dumps have been disabled. To enable
>>> core dumping, try "ulimit -c unlimited" before starting Java again
>>> #
>>> # An error report file with more information is saved as:
>>> # /home/gl069/ompi/bin/executor/hs_err_pid1172.log
>>> # An error report file with more information is saved as:
>>> # /home/gl069/ompi/bin/executor/hs_err_pid1173.log
>>> #
>>> # If you would like to submit a bug report, please visit:
>>> #   http://bugreport.sun.com/bugreport/crash.jsp
>>> #
>>> #
>>> # If you would like to submit a bug report, please visit:
>>> #   http://bugreport.sun.com/bugreport/crash.jsp
>>> #
>>> [titan01:01172] *** Process received signal ***
>>> [titan01:01172] Signal: Aborted (6)
>>> [titan01:01172] Signal code:  (-6)
>>> [titan01:01173] *** Process received signal ***
>>> [titan01:01173] Signal: Aborted (6)
>>> [titan01:01173] Signal code:  (-6)
>>> [titan01:01172] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b7e9596a100]
>>> [titan01:01172] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b7e95fc75f7]
>>> [titan01:01172] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b7e95fc8ce8]
>>> [titan01:01172] [ 3]
>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b7e96a95ac5]
>>> [titan01:01172] [ 4]
>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b7e96bf5137]
>>> [titan01:01172] [ 5]
>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2b7e96a995e0]
>>> [titan01:01172] [ 6] [titan01:01173] [ 0]
>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2af694ded100]
>>> [titan01:01173] [ 1] /usr/lib64/libc.so.6(+0x35670)[0x2b7e95fc7670]
>>> [titan01:01172] [ 7] [0x2b7e9c86e3a1]
>>> [titan01:01172] *** End of error message ***
>>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2af69544a5f7]
>>> [titan01:01173] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2af69544bce8]
>>> [titan01:01173] [ 3]
>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2af695f18ac5]
>>> [titan01:01173] [ 4]
>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2af696078137]
>>> [titan01:01173] [ 5]
>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2af695f1c5e0]
>>> [titan01:01173] [ 6] /usr/lib64/libc.so.6(+0x35670)[0x2af69544a670]
>>> [titan01:01173] [ 7] [0x2af69c0693a1]
>>> [titan01:01173] *** End of error message ***
>>> -------------------------------------------------------
>>> Primary job  terminated normally, but 1 process returned
>>> a non-zero exit code. Per user-direction, the job has been aborted.
>>> -------------------------------------------------------
>>>
>>> --------------------------------------------------------------------------
>>> mpirun noticed that process rank 1 with PID 0 on node titan01 exited on
>>> signal 6 (Aborted).
>>>
>>>
>>> ########CONFIGURATION:
>>> I used the ompi master sources from github:
>>> commit 267821f0dd405b5f4370017a287d9a49f92e734a
>>> Author: Gilles Gouaillardet <gil...@rist.or.jp
>>> <javascript:_e(%7B%7D,'cvml','gil...@rist.or.jp');>>
>>> Date:   Tue Jul 5 13:47:50 2016 +0900
>>>
>>> ./configure --enable-mpi-java --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25
>>> --disable-dlopen --disable-mca-dso
>>>
>>> Thanks a lot for your help!
>>> Gundram
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2016/07/29584.php
>>>
>>
>>
>>
>> _______________________________________________
>> users mailing listus...@open-mpi.org
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2016/07/29585.php
>>
>>
>>
>
> _______________________________________________
> users mailing listus...@open-mpi.org 
> <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');>
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/07/29587.php
>
>
>

Reply via email to