Hello,
I try to send many byte-arrays via broadcast. After a specific number of
repetitions the process either hangs up or returns with a SIGSEGV. Does
any one can help me solving the problem:
########## The code:
import java.util.Random;
import mpi.*;
public class TestSendBigFiles {
public static void log(String msg) {
try {
System.err.println(String.format("%2d/%2d:%s",
MPI.COMM_WORLD.getRank(), MPI.COMM_WORLD.getSize(), msg));
} catch (MPIException ex) {
System.err.println(String.format("%2s/%2s:%s", "?", "?", msg));
}
}
private static int hashcode(byte[] bytearray) {
if (bytearray == null) {
return 0;
}
int hash = 39;
for (int i = 0; i < bytearray.length; i++) {
byte b = bytearray[i];
hash = hash * 7 + (int) b;
}
return hash;
}
public static void main(String args[]) throws MPIException {
log("start main");
MPI.Init(args);
try {
log("initialized done");
byte[] saveMem = new byte[100000000];
MPI.COMM_WORLD.barrier();
Random r = new Random();
r.nextBytes(saveMem);
if (MPI.COMM_WORLD.getRank() == 0) {
for (int i = 0; i < 1000; i++) {
saveMem[r.nextInt(saveMem.length)]++;
log("i = " + i);
int[] lengthData = new int[]{saveMem.length};
log("object hash = " + hashcode(saveMem));
log("length = " + lengthData[0]);
MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT, 0);
log("bcast length done (length = " + lengthData[0]
+ ")");
MPI.COMM_WORLD.barrier();
MPI.COMM_WORLD.bcast(saveMem, lengthData[0],
MPI.BYTE, 0);
log("bcast data done");
MPI.COMM_WORLD.barrier();
}
MPI.COMM_WORLD.bcast(new int[]{0}, 1, MPI.INT, 0);
} else {
while (true) {
int[] lengthData = new int[1];
MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT, 0);
log("bcast length done (length = " + lengthData[0]
+ ")");
if (lengthData[0] == 0) {
break;
}
MPI.COMM_WORLD.barrier();
saveMem = new byte[lengthData[0]];
MPI.COMM_WORLD.bcast(saveMem, saveMem.length,
MPI.BYTE, 0);
log("bcast data done");
MPI.COMM_WORLD.barrier();
log("object hash = " + hashcode(saveMem));
}
}
MPI.COMM_WORLD.barrier();
} catch (MPIException ex) {
System.out.println("caugth error." + ex);
log(ex.getMessage());
} catch (RuntimeException ex) {
System.out.println("caugth error." + ex);
log(ex.getMessage());
} finally {
MPI.Finalize();
}
}
}
############ The Error (if it does not just hang up):
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00002b7e9c86e3a1, pid=1172, tid=47822674495232
#
#
# A fatal error has been detected by the Java Runtime Environment:
# JRE version: 7.0_25-b15
# Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode
linux-amd64 compressed oops)
# Problematic frame:
# #
# SIGSEGV (0xb) at pc=0x00002af69c0693a1, pid=1173, tid=47238546896640
#
# JRE version: 7.0_25-b15
J de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
#
# Failed to write core dump. Core dumps have been disabled. To enable
core dumping, try "ulimit -c unlimited" before starting Java again
#
# Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode
linux-amd64 compressed oops)
# Problematic frame:
# J de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
#
# Failed to write core dump. Core dumps have been disabled. To enable
core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/gl069/ompi/bin/executor/hs_err_pid1172.log
# An error report file with more information is saved as:
# /home/gl069/ompi/bin/executor/hs_err_pid1173.log
#
# If you would like to submit a bug report, please visit:
# http://bugreport.sun.com/bugreport/crash.jsp
#
#
# If you would like to submit a bug report, please visit:
# http://bugreport.sun.com/bugreport/crash.jsp
#
[titan01:01172] *** Process received signal ***
[titan01:01172] Signal: Aborted (6)
[titan01:01172] Signal code: (-6)
[titan01:01173] *** Process received signal ***
[titan01:01173] Signal: Aborted (6)
[titan01:01173] Signal code: (-6)
[titan01:01172] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b7e9596a100]
[titan01:01172] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b7e95fc75f7]
[titan01:01172] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b7e95fc8ce8]
[titan01:01172] [ 3]
/home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b7e96a95ac5]
[titan01:01172] [ 4]
/home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b7e96bf5137]
[titan01:01172] [ 5]
/home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2b7e96a995e0]
[titan01:01172] [ 6] [titan01:01173] [ 0]
/usr/lib64/libpthread.so.0(+0xf100)[0x2af694ded100]
[titan01:01173] [ 1] /usr/lib64/libc.so.6(+0x35670)[0x2b7e95fc7670]
[titan01:01172] [ 7] [0x2b7e9c86e3a1]
[titan01:01172] *** End of error message ***
/usr/lib64/libc.so.6(gsignal+0x37)[0x2af69544a5f7]
[titan01:01173] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2af69544bce8]
[titan01:01173] [ 3]
/home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2af695f18ac5]
[titan01:01173] [ 4]
/home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2af696078137]
[titan01:01173] [ 5]
/home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2af695f1c5e0]
[titan01:01173] [ 6] /usr/lib64/libc.so.6(+0x35670)[0x2af69544a670]
[titan01:01173] [ 7] [0x2af69c0693a1]
[titan01:01173] *** End of error message ***
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node titan01 exited on
signal 6 (Aborted).
########CONFIGURATION:
I used the ompi master sources from github:
commit 267821f0dd405b5f4370017a287d9a49f92e734a
Author: Gilles Gouaillardet <gil...@rist.or.jp>
List-Post: users@lists.open-mpi.org
Date: Tue Jul 5 13:47:50 2016 +0900
./configure --enable-mpi-java --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25
--disable-dlopen --disable-mca-dso
Thanks a lot for your help!
Gundram