Hello,

I try to send many byte-arrays via broadcast. After a specific number of repetitions the process either hangs up or returns with a SIGSEGV. Does any one can help me solving the problem:

########## The code:

import java.util.Random;
import mpi.*;

public class TestSendBigFiles {

    public static void log(String msg) {
        try {
System.err.println(String.format("%2d/%2d:%s", MPI.COMM_WORLD.getRank(), MPI.COMM_WORLD.getSize(), msg));
        } catch (MPIException ex) {
            System.err.println(String.format("%2s/%2s:%s", "?", "?", msg));
        }
    }

    private static int hashcode(byte[] bytearray) {
        if (bytearray == null) {
            return 0;
        }
        int hash = 39;
        for (int i = 0; i < bytearray.length; i++) {
            byte b = bytearray[i];
            hash = hash * 7 + (int) b;
        }
        return hash;
    }

    public static void main(String args[]) throws MPIException {
        log("start main");
        MPI.Init(args);
        try {
            log("initialized done");
            byte[] saveMem = new byte[100000000];
            MPI.COMM_WORLD.barrier();
            Random r = new Random();
            r.nextBytes(saveMem);
            if (MPI.COMM_WORLD.getRank() == 0) {
                for (int i = 0; i < 1000; i++) {
                    saveMem[r.nextInt(saveMem.length)]++;
                    log("i = " + i);
                    int[] lengthData = new int[]{saveMem.length};
                    log("object hash = " + hashcode(saveMem));
                    log("length = " + lengthData[0]);
                    MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT, 0);
log("bcast length done (length = " + lengthData[0] + ")");
                    MPI.COMM_WORLD.barrier();
MPI.COMM_WORLD.bcast(saveMem, lengthData[0], MPI.BYTE, 0);
                    log("bcast data done");
                    MPI.COMM_WORLD.barrier();
                }
                MPI.COMM_WORLD.bcast(new int[]{0}, 1, MPI.INT, 0);
            } else {
                while (true) {
                    int[] lengthData = new int[1];
                    MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT, 0);
log("bcast length done (length = " + lengthData[0] + ")");
                    if (lengthData[0] == 0) {
                        break;
                    }
                    MPI.COMM_WORLD.barrier();
                    saveMem = new byte[lengthData[0]];
MPI.COMM_WORLD.bcast(saveMem, saveMem.length, MPI.BYTE, 0);
                    log("bcast data done");
                    MPI.COMM_WORLD.barrier();
                    log("object hash = " + hashcode(saveMem));
                }
            }
            MPI.COMM_WORLD.barrier();
        } catch (MPIException ex) {
            System.out.println("caugth error." + ex);
            log(ex.getMessage());
        } catch (RuntimeException ex) {
            System.out.println("caugth error." + ex);
            log(ex.getMessage());
        } finally {
            MPI.Finalize();
        }

    }

}


############ The Error (if it does not just hang up):

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00002b7e9c86e3a1, pid=1172, tid=47822674495232
#
#
# A fatal error has been detected by the Java Runtime Environment:
# JRE version: 7.0_25-b15
# Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# #
#  SIGSEGV (0xb) at pc=0x00002af69c0693a1, pid=1173, tid=47238546896640
#
# JRE version: 7.0_25-b15
J  de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# J  de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/gl069/ompi/bin/executor/hs_err_pid1172.log
# An error report file with more information is saved as:
# /home/gl069/ompi/bin/executor/hs_err_pid1173.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.sun.com/bugreport/crash.jsp
#
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.sun.com/bugreport/crash.jsp
#
[titan01:01172] *** Process received signal ***
[titan01:01172] Signal: Aborted (6)
[titan01:01172] Signal code:  (-6)
[titan01:01173] *** Process received signal ***
[titan01:01173] Signal: Aborted (6)
[titan01:01173] Signal code:  (-6)
[titan01:01172] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b7e9596a100]
[titan01:01172] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b7e95fc75f7]
[titan01:01172] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b7e95fc8ce8]
[titan01:01172] [ 3] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b7e96a95ac5] [titan01:01172] [ 4] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b7e96bf5137] [titan01:01172] [ 5] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2b7e96a995e0] [titan01:01172] [ 6] [titan01:01173] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2af694ded100]
[titan01:01173] [ 1] /usr/lib64/libc.so.6(+0x35670)[0x2b7e95fc7670]
[titan01:01172] [ 7] [0x2b7e9c86e3a1]
[titan01:01172] *** End of error message ***
/usr/lib64/libc.so.6(gsignal+0x37)[0x2af69544a5f7]
[titan01:01173] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2af69544bce8]
[titan01:01173] [ 3] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2af695f18ac5] [titan01:01173] [ 4] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2af696078137] [titan01:01173] [ 5] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2af695f1c5e0]
[titan01:01173] [ 6] /usr/lib64/libc.so.6(+0x35670)[0x2af69544a670]
[titan01:01173] [ 7] [0x2af69c0693a1]
[titan01:01173] *** End of error message ***
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node titan01 exited on signal 6 (Aborted).


########CONFIGURATION:
I used the ompi master sources from github:
commit 267821f0dd405b5f4370017a287d9a49f92e734a
Author: Gilles Gouaillardet <gil...@rist.or.jp>
List-Post: users@lists.open-mpi.org
Date:   Tue Jul 5 13:47:50 2016 +0900

./configure --enable-mpi-java --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25 --disable-dlopen --disable-mca-dso

Thanks a lot for your help!
Gundram

Reply via email to