Both issues have been fixed. The trouble with CReqops.java was a problem with the test. A fixed version has been pushed to the ompi-java-tests repo. The issue with compare_and_swap is merged on master, and should be in the 2.0.2 release.
Let me know if you have any other issues. -Nathan -- Nathaniel Graham HPC-DES Los Alamos National Laboratory ________________________________ From: users <users-boun...@lists.open-mpi.org> on behalf of Graham, Nathaniel Richard <ngra...@lanl.gov> Sent: Wednesday, September 14, 2016 12:55 PM To: Open MPI Users Subject: Re: [OMPI users] Java-OpenMPI returns with SIGSEGV Thanks for reporting this! There are a number of things going on here. It seems there may be a problem with the Java bindings checked by CReqops.Java because the C test passes. Ill take a look at that. The issue can be found at: https://github.com/open-mpi/ompi/issues/2081 MPI_Compare_and_swap is failing on master, and therefore on the release branches. You can get around the issue for now by doing: export OMPI_MCA_osc=pt2pt I submitted an issue to track it at: https://github.com/open-mpi/ompi/issues/2080 These tests test code I added last summer and did not make it into 1.8. I know its all in the 2.0 serious though. -Nathan -- Nathaniel Graham HPC-DES Los Alamos National Laboratory ________________________________ From: users <users-boun...@lists.open-mpi.org> on behalf of Gundram Leifert <gundram.leif...@uni-rostock.de> Sent: Wednesday, September 14, 2016 4:02 AM To: users@lists.open-mpi.org Subject: Re: [OMPI users] Java-OpenMPI returns with SIGSEGV In short words: yes, we compiled with mpijavac and mpicc and run with mpirun -np 2. In long words: we tested the following setups a) without Java, with mpi 2.0.1 the C-test [mw314@titan01 mpi_test]$ module list Currently Loaded Modulefiles: 1) openmpi/gcc/2.0.1 [mw314@titan01 mpi_test]$ mpirun -np 2 ./a.out [titan01:18460] *** An error occurred in MPI_Compare_and_swap [titan01:18460] *** reported by process [3535667201,1] [titan01:18460] *** on win rdma window 3 [titan01:18460] *** MPI_ERR_RMA_RANGE: invalid RMA address range [titan01:18460] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort, [titan01:18460] *** and potentially your MPI job) [titan01.service:18454] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal [titan01.service:18454] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages b) without Java with mpi 1.8.8 the C-test [mw314@titan01 mpi_test2]$ module list Currently Loaded Modulefiles: 1) openmpi/gcc/1.8.8 [mw314@titan01 mpi_test2]$ mpirun -np 2 ./a.out No Errors [mw314@titan01 mpi_test2]$ c) with java 1.8.8 with jdk and Java-Testsuite [mw314@titan01 onesided]$ mpijavac TestMpiRmaCompareAndSwap.java TestMpiRmaCompareAndSwap.java:49: error: cannot find symbol win.compareAndSwap(next, iBuffer, result, MPI.INT, rank, 0); ^ symbol: method compareAndSwap(IntBuffer,IntBuffer,IntBuffer,Datatype,int,int) location: variable win of type Win TestMpiRmaCompareAndSwap.java:53: error: cannot find symbol >> these java methods are not supported in 1.8.8 d) ompi 2.0.1 and jdk and Testsuite [mw314@titan01 ~]$ module list Currently Loaded Modulefiles: 1) openmpi/gcc/2.0.1 2) java/jdk1.8.0_102 [mw314@titan01 ~]$ cd ompi-java-test/ [mw314@titan01 ompi-java-test]$ ./autogen.sh autoreconf: Entering directory `.' autoreconf: configure.ac: not using Gettext autoreconf: running: aclocal --force autoreconf: configure.ac: tracing autoreconf: configure.ac: not using Libtool autoreconf: running: /usr/bin/autoconf --force autoreconf: configure.ac: not using Autoheader autoreconf: running: automake --add-missing --copy --force-missing autoreconf: Leaving directory `.' [mw314@titan01 ompi-java-test]$ ./configure Configuring Open Java test suite checking for a BSD-compatible install... /bin/install -c checking whether build environment is sane... yes checking for a thread-safe mkdir -p... /bin/mkdir -p checking for gawk... gawk checking whether make sets $(MAKE)... yes checking whether make supports nested variables... yes checking whether make supports nested variables... (cached) yes checking for mpijavac... yes checking if checking MPI API params... yes checking that generated files are newer than configure... done configure: creating ./config.status config.status: creating reporting/OmpitestConfig.java config.status: creating Makefile [mw314@titan01 ompi-java-test]$ cd onesided/ [mw314@titan01 onesided]$ ./make_onesided &> result cat result: <crop.....> =========================== CReqops =========================== [titan01:32155] *** An error occurred in MPI_Rput [titan01:32155] *** reported by process [3879534593,1] [titan01:32155] *** on win rdma window 3 [titan01:32155] *** MPI_ERR_RMA_RANGE: invalid RMA address range [titan01:32155] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort, [titan01:32155] *** and potentially your MPI job) <...crop....> =========================== TestMpiRmaCompareAndSwap =========================== [titan01:32703] *** An error occurred in MPI_Compare_and_swap [titan01:32703] *** reported by process [3843162113,0] [titan01:32703] *** on win rdma window 3 [titan01:32703] *** MPI_ERR_RMA_RANGE: invalid RMA address range [titan01:32703] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort, [titan01:32703] *** and potentially your MPI job) [titan01.service:32698] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal [titan01.service:32698] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages < ... end crop> Also if we start the thing in this way, it fails: [mw314@titan01 onesided]$ mpijavac TestMpiRmaCompareAndSwap.java OmpitestError.java OmpitestProgress.java OmpitestConfig.java [mw314@titan01 onesided]$ mpiexec -np 2 java TestMpiRmaCompareAndSwap [titan01:22877] *** An error occurred in MPI_Compare_and_swap [titan01:22877] *** reported by process [3287285761,0] [titan01:22877] *** on win rdma window 3 [titan01:22877] *** MPI_ERR_RMA_RANGE: invalid RMA address range [titan01:22877] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort, [titan01:22877] *** and potentially your MPI job) [titan01.service:22872] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal [titan01.service:22872] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages On 09/13/2016 08:06 PM, Graham, Nathaniel Richard wrote: Since you are getting the same errors with C as you are with Java, this is an issue with C, not the Java bindings. However, in the most recent output, you are using ./a.out to run the test. Did you use mpirun to run the test in Java or C? The command should be something along the lines of: mpirun -np 2 java TestMpiRmaCompareAndSwap mpirun -np 2 ./a.out Also, are you compiling with the ompi wrappers? Should be: mpijavac TestMpiRmaCompareAndSwap.java mpicc compare_and_swap.c In the mean time, I will try to reproduce this on a similar system. -Nathan -- Nathaniel Graham HPC-DES Los Alamos National Laboratory ________________________________ From: users <users-boun...@lists.open-mpi.org><mailto:users-boun...@lists.open-mpi.org> on behalf of Gundram Leifert <gundram.leif...@uni-rostock.de><mailto:gundram.leif...@uni-rostock.de> Sent: Tuesday, September 13, 2016 12:46 AM To: users@lists.open-mpi.org<mailto:users@lists.open-mpi.org> Subject: Re: [OMPI users] Java-OpenMPI returns with SIGSEGV Hey, it seams to be a problem of ompi 2.x. Also the c-version 2.0.1 returns produces this output: (the same bulid by sources or the release 2.0.1) [mw314@node108 mpi_test]$ ./a.out [node108:2949] *** An error occurred in MPI_Compare_and_swap [node108:2949] *** reported by process [1649420396,0] [node108:2949] *** on win rdma window 3 [node108:2949] *** MPI_ERR_RMA_RANGE: invalid RMA address range [node108:2949] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort, [node108:2949] *** and potentially your MPI job) But the test works for 1.8.x! In fact our cluster does not have shared-memory - so it has to use the wrapper to default methods. Gundram On 09/07/2016 06:49 PM, Graham, Nathaniel Richard wrote: Hello Gundram, It looks like the test that is failing is TestMpiRmaCompareAndSwap.java. Is that the one that is crashing? If so, could you try to run the C test from: http://git.mpich.org/mpich.git/blob/c77631474f072e86c9fe761c1328c3d4cb8cc4a5:/test/mpi/rma/compare_and_swap.c#l1 There are a couple of header files you will need for that test, but they are in the same repo as the test (up a few folders and in an include folder). This should let us know whether its an issue related to Java or not. If it is another test, let me know and Ill see if I can get you the C version (most or all of the Java tests are translations from the C test). -Nathan -- Nathaniel Graham HPC-DES Los Alamos National Laboratory ________________________________ From: users <users-boun...@lists.open-mpi.org><mailto:users-boun...@lists.open-mpi.org> on behalf of Gundram Leifert <gundram.leif...@uni-rostock.de><mailto:gundram.leif...@uni-rostock.de> Sent: Wednesday, September 7, 2016 9:23 AM To: users@lists.open-mpi.org<mailto:users@lists.open-mpi.org> Subject: Re: [OMPI users] Java-OpenMPI returns with SIGSEGV Hello, I still have the same errors on our cluster - even one more. Maybe the new one helps us to find a solution. I have this error if I run "make_onesided" of the ompi-java-test repo. CReqops and TestMpiRmaCompareAndSwap report (pretty deterministically - in all my 30 runs) this error: [titan01:5134] *** An error occurred in MPI_Compare_and_swap [titan01:5134] *** reported by process [2392850433,1] [titan01:5134] *** on win rdma window 3 [titan01:5134] *** MPI_ERR_RMA_RANGE: invalid RMA address range [titan01:5134] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort, [titan01:5134] *** and potentially your MPI job) [titan01.service:05128] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal [titan01.service:05128] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages Sometimes I also have the SIGSEGV error. System: compiler: gcc/5.2.0 java: jdk1.8.0_102 kernelmodule: mlx4_core mlx4_en mlx4_ib Linux version 3.10.0-327.13.1.el7.x86_64 (buil...@kbuilder.dev.centos.org<mailto:buil...@kbuilder.dev.centos.org>) (gcc version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC) ) #1 SMP Open MPI v2.0.1, package: Open MPI Distribution, ident: 2.0.1, repo rev: v2.0.0-257-gee86e07, Sep 02, 2016 inifiband openib: OpenSM 3.3.19 limits: ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 256554 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 100000 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) unlimited cpu time (seconds, -t) unlimited max user processes (-u) 4096 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited Thanks, Gundram On 07/12/2016 11:08 AM, Gundram Leifert wrote: Hello Gilley, Howard, I configured without disable dlopen - same error. I test these classes on another cluster and: IT WORKS! So it is a problem of the cluster configuration. Thank you all very much for all your help! When the admin can solve the problem, i will let you know, what he had changed. Cheers Gundram On 07/08/2016 04:19 PM, Howard Pritchard wrote: Hi Gundram Could you configure without the disable dlopen option and retry? Howard Am Freitag, 8. Juli 2016 schrieb Gilles Gouaillardet : the JVM sets its own signal handlers, and it is important openmpi dones not override them. this is what previously happened with PSM (infinipath) but this has been solved since. you might be linking with a third party library that hijacks signal handlers and cause the crash (which would explain why I cannot reproduce the issue) the master branch has a revamped memory patcher (compared to v2.x or v1.10), and that could have some bad interactions with the JVM, so you might also give v2.x a try Cheers, Gilles On Friday, July 8, 2016, Gundram Leifert <gundram.leif...@uni-rostock.de<mailto:gundram.leif...@uni-rostock.de>> wrote: You made the best of it... thanks a lot! Whithout MPI it runs. Just adding MPI.init() causes the crash! maybe I installed something wrong... install newest automake, autoconf, m4, libtoolize in right order and same prefix check out ompi, autogen configure with same prefix, pointing to the same jdk, I later use make make install I will test some different configurations of ./configure... On 07/08/2016 01:40 PM, Gilles Gouaillardet wrote: I am running out of ideas ... what if you do not run within slurm ? what if you do not use '-cp executor.jar' or what if you configure without --disable-dlopen --disable-mca-dso ? if you mpirun -np 1 ... then MPI_Bcast and MPI_Barrier are basically no-op, so it is really weird your program is still crashing. an other test is to comment out MPI_Bcast and MPI_Barrier and try again with -np 1 Cheers, Gilles On Friday, July 8, 2016, Gundram Leifert <gundram.leif...@uni-rostock.de<mailto:gundram.leif...@uni-rostock.de>> wrote: In any cases the same error. this is my code: salloc -n 3 export IPATH_NO_BACKTRACE ulimit -s 10240 mpirun -np 3 java -cp executor.jar de.uros.citlab.executor.test.TestSendBigFiles2 also for 1 or two cores, the process crashes. On 07/08/2016 12:32 PM, Gilles Gouaillardet wrote: you can try export IPATH_NO_BACKTRACE before invoking mpirun (that should not be needed though) an other test is to ulimit -s 10240 before invoking mpirun. btw, do you use mpirun or srun ? can you reproduce the crash with 1 or 2 tasks ? Cheers, Gilles On Friday, July 8, 2016, Gundram Leifert <gundram.leif...@uni-rostock.de> wrote: Hello, configure: ./configure --enable-mpi-java --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25 --disable-dlopen --disable-mca-dso 1 node with 3 cores. I use SLURM to allocate one node. I changed --mem, but it has no effect. salloc -n 3 core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 256564 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 100000 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) unlimited cpu time (seconds, -t) unlimited max user processes (-u) 4096 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited uname -a Linux titan01.service 3.10.0-327.13.1.el7.x86_64 #1 SMP Thu Mar 31 16:04:38 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux cat /etc/system-release CentOS Linux release 7.2.1511 (Core) what else do you need? Cheers, Gundram On 07/07/2016 10:05 AM, Gilles Gouaillardet wrote: Gundram, can you please provide more information on your environment : - configure command line - OS - memory available - ulimit -a - number of nodes - number of tasks used - interconnect used (if any) - batch manager (if any) Cheers, Gilles On 7/7/2016 4:17 PM, Gundram Leifert wrote: Hello Gilles, I tried you code and it crashes after 3-15 iterations (see (1)). It is always the same error (only the "94" varies). Meanwhile I think Java and MPI use the same memory because when I delete the hash-call, the program runs sometimes more than 9k iterations. When it crashes, there are different lines (see (2) and (3)). The crashes also occurs on rank 0. ##### (1)##### # Problematic frame: # J 94 C2 de.uros.citlab.executor.test.TestSendBigFiles2.hashcode([BI)I (42 bytes) @ 0x00002b03242dc9c4 [0x00002b03242dc860+0x164] #####(2)##### # Problematic frame: # V [libjvm.so+0x68d0f6] JavaCallWrapper::JavaCallWrapper(methodHandle, Handle, JavaValue*, Thread*)+0xb6 #####(3)##### # Problematic frame: # V [libjvm.so+0x4183bf] ThreadInVMfromNative::ThreadInVMfromNative(JavaThread*)+0x4f Any more idea? On 07/07/2016 03:00 AM, Gilles Gouaillardet wrote: Gundram, fwiw, i cannot reproduce the issue on my box - centos 7 - java version "1.8.0_71" Java(TM) SE Runtime Environment (build 1.8.0_71-b15) Java HotSpot(TM) 64-Bit Server VM (build 25.71-b15, mixed mode) i noticed on non zero rank saveMem is allocated at each iteration. ideally, the garbage collector can take care of that and this should not be an issue. would you mind giving the attached file a try ? Cheers, Gilles On 7/7/2016 7:41 AM, Gilles Gouaillardet wrote: I will have a look at it today how did you configure OpenMPI ? Cheers, Gilles On Thursday, July 7, 2016, Gundram Leifert <gundram.leif...@uni-rostock.de<mailto:gundram.leif...@uni-rostock.de>> wrote: Hello Giles, thank you for your hints! I did 3 changes, unfortunately the same error occures: update ompi: commit ae8444682f0a7aa158caea08800542ce9874455e Author: Ralph Castain <r...@open-mpi.org><mailto:r...@open-mpi.org> Date: Tue Jul 5 20:07:16 2016 -0700 update java: java version "1.8.0_92" Java(TM) SE Runtime Environment (build 1.8.0_92-b14) Java HotSpot(TM) Server VM (build 25.92-b14, mixed mode) delete hashcode-lines. Now I get this error message - to 100%, after different number of iterations (15-300): 0/ 3:length = 100000000 0/ 3:bcast length done (length = 100000000) 1/ 3:bcast length done (length = 100000000) 2/ 3:bcast length done (length = 100000000) # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x00002b3d022fcd24, pid=16578, tid=0x00002b3d29716700 # # JRE version: Java(TM) SE Runtime Environment (8.0_92-b14) (build 1.8.0_92-b14) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.92-b14 mixed mode linux-amd64 compressed oops) # Problematic frame: # V [libjvm.so+0x414d24] ciEnv::get_field_by_index(ciInstanceKlass*, int)+0x94 # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again # # An error report file with more information is saved as: # /home/gl069/ompi/bin/executor/hs_err_pid16578.log # # Compiler replay data is saved as: # /home/gl069/ompi/bin/executor/replay_pid16578.log # # If you would like to submit a bug report, please visit: # http://bugreport.java.com/bugreport/crash.jsp # [titan01:16578] *** Process received signal *** [titan01:16578] Signal: Aborted (6) [titan01:16578] Signal code: (-6) [titan01:16578] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b3d01500100] [titan01:16578] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b3d01b5c5f7] [titan01:16578] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b3d01b5dce8] [titan01:16578] [ 3] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91e605)[0x2b3d02806605] [titan01:16578] [ 4] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0xabda63)[0x2b3d029a5a63] [titan01:16578] [ 5] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x14f)[0x2b3d0280be2f] [titan01:16578] [ 6] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91a5c3)[0x2b3d028025c3] [titan01:16578] [ 7] /usr/lib64/libc.so.6(+0x35670)[0x2b3d01b5c670] [titan01:16578] [ 8] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x414d24)[0x2b3d022fcd24] [titan01:16578] [ 9] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x43c5ae)[0x2b3d023245ae] [titan01:16578] [10] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x369ade)[0x2b3d02251ade] [titan01:16578] [11] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36eda0)[0x2b3d02256da0] [titan01:16578] [12] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b] [titan01:16578] [13] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6] [titan01:16578] [14] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf] [titan01:16578] [15] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412] [titan01:16578] [16] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d] [titan01:16578] [17] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b] [titan01:16578] [18] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6] [titan01:16578] [19] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf] [titan01:16578] [20] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412] [titan01:16578] [21] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d] [titan01:16578] [22] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3708c2)[0x2b3d022588c2] [titan01:16578] [23] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3724e7)[0x2b3d0225a4e7] [titan01:16578] [24] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a817)[0x2b3d02262817] [titan01:16578] [25] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a92f)[0x2b3d0226292f] [titan01:16578] [26] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x358edb)[0x2b3d02240edb] [titan01:16578] [27] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35929e)[0x2b3d0224129e] [titan01:16578] [28] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3593ce)[0x2b3d022413ce] [titan01:16578] [29] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35973e)[0x2b3d0224173e] [titan01:16578] *** End of error message *** ------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. ------------------------------------------------------- -------------------------------------------------------------------------- mpirun noticed that process rank 2 with PID 0 on node titan01 exited on signal 6 (Aborted). -------------------------------------------------------------------------- I don't know if it is a problem of java or ompi - but the last years, java worked with no problems on my machine... Thank you for your tips in advance! Gundram On 07/06/2016 03:10 PM, Gilles Gouaillardet wrote: Note a race condition in MPI_Init has been fixed yesterday in the master. can you please update your OpenMPI and try again ? hopefully the hang will disappear. Can you reproduce the crash with a simpler (and ideally deterministic) version of your program. the crash occurs in hashcode, and this makes little sense to me. can you also update your jdk ? Cheers, Gilles On Wednesday, July 6, 2016, Gundram Leifert <gundram.leif...@uni-rostock.de<mailto:gundram.leif...@uni-rostock.de>> wrote: Hello Jason, thanks for your response! I thing it is another problem. I try to send 100MB bytes. So there are not many tries (between 10 and 30). I realized that the execution of this code can result 3 different errors: 1. most often the posted error message occures. 2. in <10% the cases i have a live lock. I can see 3 java-processes, one with 200% and two with 100% processor utilization. After ~15 minutes without new system outputs this error occurs. [thread 47499823949568 also had an error] # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (safepoint.cpp:317), pid=24256, tid=47500347131648 # guarantee(PageArmed == 0) failed: invariant # # JRE version: 7.0_25-b15 # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode linux-amd64 compressed oops) # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again # # An error report file with more information is saved as: # /home/gl069/ompi/bin/executor/hs_err_pid24256.log # # If you would like to submit a bug report, please visit: # http://bugreport.sun.com/bugreport/crash.jsp # [titan01:24256] *** Process received signal *** [titan01:24256] Signal: Aborted (6) [titan01:24256] Signal code: (-6) [titan01:24256] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b336a324100] [titan01:24256] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b336a9815f7] [titan01:24256] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b336a982ce8] [titan01:24256] [ 3] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b336b44fac5] [titan01:24256] [ 4] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b336b5af137] [titan01:24256] [ 5] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x407262)[0x2b336b114262] [titan01:24256] [ 6] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x7c6c34)[0x2b336b4d3c34] [titan01:24256] [ 7] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a9c17)[0x2b336b5b6c17] [titan01:24256] [ 8] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8aa2c0)[0x2b336b5b72c0] [titan01:24256] [ 9] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x744270)[0x2b336b451270] [titan01:24256] [10] /usr/lib64/libpthread.so.0(+0x7dc5)[0x2b336a31cdc5] [titan01:24256] [11] /usr/lib64/libc.so.6(clone+0x6d)[0x2b336aa4228d] [titan01:24256] *** End of error message *** ------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. ------------------------------------------------------- -------------------------------------------------------------------------- mpirun noticed that process rank 0 with PID 0 on node titan01 exited on signal 6 (Aborted). -------------------------------------------------------------------------- 3. in <10% the cases i have a dead lock while MPI.init. This stays for more than 15 minutes without returning with an error message... Can I enable some debug-flags to see what happens on C / OpenMPI side? Thanks in advance for your help! Gundram Leifert On 07/05/2016 06:05 PM, Jason Maldonis wrote: After reading your thread looks like it may be related to an issue I had a few weeks ago (I'm a novice though). Maybe my thread will be of help: https://www.open-mpi.org/community/lists/users/2016/06/29425.php When you say "After a specific number of repetitions the process either hangs up or returns with a SIGSEGV." does you mean that a single call hangs, or that at some point during the for loop a call hangs? If you mean the latter, then it might relate to my issue. Otherwise my thread probably won't be helpful. Jason Maldonis Research Assistant of Professor Paul Voyles Materials Science Grad Student University of Wisconsin, Madison 1509 University Ave, Rm M142 Madison, WI 53706 maldo...@wisc.edu<mailto:maldo...@wisc.edu> 608-295-5532 On Tue, Jul 5, 2016 at 9:58 AM, Gundram Leifert <gundram.leif...@uni-rostock.de<mailto:gundram.leif...@uni-rostock.de>> wrote: Hello, I try to send many byte-arrays via broadcast. After a specific number of repetitions the process either hangs up or returns with a SIGSEGV. Does any one can help me solving the problem: ########## The code: import java.util.Random; import mpi.*; public class TestSendBigFiles { public static void log(String msg) { try { System.err.println(String.format("%2d/%2d:%s", MPI.COMM_WORLD.getRank(), MPI.COMM_WORLD.getSize(), msg)); } catch (MPIException ex) { System.err.println(String.format("%2s/%2s:%s", "?", "?", msg)); } } private static int hashcode(byte[] bytearray) { if (bytearray == null) { return 0; } int hash = 39; for (int i = 0; i < bytearray.length; i++) { byte b = bytearray[i]; hash = hash * 7 + (int) b; } return hash; } public static void main(String args[]) throws MPIException { log("start main"); MPI.Init(args); try { log("initialized done"); byte[] saveMem = new byte[100000000]; MPI.COMM_WORLD.barrier(); Random r = new Random(); r.nextBytes(saveMem); if (MPI.COMM_WORLD.getRank() == 0) { for (int i = 0; i < 1000; i++) { saveMem[r.nextInt(saveMem.length)]++; log("i = " + i); int[] lengthData = new int[]{saveMem.length}; log("object hash = " + hashcode(saveMem)); log("length = " + lengthData[0]); MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT<http://MPI.INT>, 0); log("bcast length done (length = " + lengthData[0] + ")"); MPI.COMM_WORLD.barrier(); MPI.COMM_WORLD.bcast(saveMem, lengthData[0], MPI.BYTE, 0); log("bcast data done"); MPI.COMM_WORLD.barrier(); } MPI.COMM_WORLD.bcast(new int[]{0}, 1, MPI.INT<http://MPI.INT>, 0); } else { while (true) { int[] lengthData = new int[1]; MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT<http://MPI.INT>, 0); log("bcast length done (length = " + lengthData[0] + ")"); if (lengthData[0] == 0) { break; } MPI.COMM_WORLD.barrier(); saveMem = new byte[lengthData[0]]; MPI.COMM_WORLD.bcast(saveMem, saveMem.length, MPI.BYTE, 0); log("bcast data done"); MPI.COMM_WORLD.barrier(); log("object hash = " + hashcode(saveMem)); } } MPI.COMM_WORLD.barrier(); } catch (MPIException ex) { System.out.println("caugth error." + ex); log(ex.getMessage()); } catch (RuntimeException ex) { System.out.println("caugth error." + ex); log(ex.getMessage()); } finally { MPI.Finalize(); } } } ############ The Error (if it does not just hang up): # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x00002b7e9c86e3a1, pid=1172, tid=47822674495232 # # # A fatal error has been detected by the Java Runtime Environment: # JRE version: 7.0_25-b15 # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode linux-amd64 compressed oops) # Problematic frame: # # # SIGSEGV (0xb) at pc=0x00002af69c0693a1, pid=1173, tid=47238546896640 # # JRE version: 7.0_25-b15 J de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again # # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode linux-amd64 compressed oops) # Problematic frame: # J de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again # # An error report file with more information is saved as: # /home/gl069/ompi/bin/executor/hs_err_pid1172.log # An error report file with more information is saved as: # /home/gl069/ompi/bin/executor/hs_err_pid1173.log # # If you would like to submit a bug report, please visit: # http://bugreport.sun.com/bugreport/crash.jsp # # # If you would like to submit a bug report, please visit: # http://bugreport.sun.com/bugreport/crash.jsp # [titan01:01172] *** Process received signal *** [titan01:01172] Signal: Aborted (6) [titan01:01172] Signal code: (-6) [titan01:01173] *** Process received signal *** [titan01:01173] Signal: Aborted (6) [titan01:01173] Signal code: (-6) [titan01:01172] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b7e9596a100] [titan01:01172] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b7e95fc75f7] [titan01:01172] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b7e95fc8ce8] [titan01:01172] [ 3] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b7e96a95ac5] [titan01:01172] [ 4] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b7e96bf5137] [titan01:01172] [ 5] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2b7e96a995e0] [titan01:01172] [ 6] [titan01:01173] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2af694ded100] [titan01:01173] [ 1] /usr/lib64/libc.so.6(+0x35670)[0x2b7e95fc7670] [titan01:01172] [ 7] [0x2b7e9c86e3a1] [titan01:01172] *** End of error message *** /usr/lib64/libc.so.6(gsignal+0x37)[0x2af69544a5f7] [titan01:01173] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2af69544bce8] [titan01:01173] [ 3] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2af695f18ac5] [titan01:01173] [ 4] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2af696078137] [titan01:01173] [ 5] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2af695f1c5e0] [titan01:01173] [ 6] /usr/lib64/libc.so.6(+0x35670)[0x2af69544a670] [titan01:01173] [ 7] [0x2af69c0693a1] [titan01:01173] *** End of error message *** ------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. ------------------------------------------------------- -------------------------------------------------------------------------- mpirun noticed that process rank 1 with PID 0 on node titan01 exited on signal 6 (Aborted). ########CONFIGURATION: I used the ompi master sources from github: commit 267821f0dd405b5f4370017a287d9a49f92e734a Author: Gilles Gouaillardet <gil...@rist.or.jp<mailto:gil...@rist.or.jp>> Date: Tue Jul 5 13:47:50 2016 +0900 ./configure --enable-mpi-java --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25 --disable-dlopen --disable-mca-dso Thanks a lot for your help! Gundram _______________________________________________ users mailing list us...@open-mpi.org<mailto:us...@open-mpi.org> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29584.php _______________________________________________ users mailing list us...@open-mpi.org Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29585.php _______________________________________________ users mailing list us...@open-mpi.org Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29587.php _______________________________________________ users mailing list us...@open-mpi.org Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29589.php _______________________________________________ users mailing list us...@open-mpi.org Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29590.php _______________________________________________ users mailing list us...@open-mpi.org Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29592.php _______________________________________________ users mailing list us...@open-mpi.org Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29593.php _______________________________________________ users mailing list us...@open-mpi.org Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29601.php _______________________________________________ users mailing list us...@open-mpi.org Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29603.php _______________________________________________ users mailing list us...@open-mpi.org<mailto:us...@open-mpi.org> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29610.php _______________________________________________ users mailing list users@lists.open-mpi.org<mailto:users@lists.open-mpi.org> https://rfd.newmexicoconsortium.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org<mailto:users@lists.open-mpi.org> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users