Gotcha; thanks. > On Aug 14, 2015, at 2:12 PM, Howard Pritchard <hpprit...@gmail.com> wrote: > > Hi Jeff, > > I don't know why Gilles keeps picking on the persistent request problem and > mixing > it up with this user bug. I do think for this user the psm probably is the > problem. > > > They don't have anything to do with each other. > > I can reproduce the persistent request problem on hopper consistently. As I > said > on the telecon last week it has something to do with memory corruption with > the > receive buffer that is associated with the persistent request. > > Howard > > > 2015-08-14 11:21 GMT-06:00 Jeff Squyres (jsquyres) <jsquy...@cisco.com>: > Hmm. Oscar's not around to ask any more, but I'd be greatly surprised if he > had InfiniPath on his systems where he ran into this segv issue...? > > > > On Aug 14, 2015, at 1:08 PM, Howard Pritchard <hpprit...@gmail.com> wrote: > > > > Hi Gilles, > > > > Good catch! Nate we hadn't been testing on a infinipath system. > > > > Howard > > > > > > 2015-08-14 0:20 GMT-06:00 Gilles Gouaillardet <gil...@rist.or.jp>: > > Nate, > > > > i could get rid of the problem by not using the psm mtl. > > the infinipath library (used by the psm mtl) sets some signal handlers that > > conflict with the JVM > > that can be seen by running > > mpirun -np 1 java -Xcheck:jni MPITestBroke data/ > > > > so instead of running > > mpirun -np 1 java MPITestBroke data/ > > please run > > mpirun --mca mtl ^psm -np 1 java MPITestBroke data/ > > > > that solved the issue for me > > > > Cheers, > > > > Gilles > > > > On 8/13/2015 9:19 AM, Nate Chambers wrote: > >> I appreciate you trying to help! I put the Java and its compiled .class > >> file on Dropbox. The directory contains the .java and .class files, as > >> well as a data/ directory: > >> > >> http://www.dropbox.com/sh/pds5c5wecfpb2wk/AAAcz17UTDQErmrUqp2SPjpqa?dl=0 > >> > >> You can run it with and without MPI: > >> > >> > java MPITestBroke data/ > >> > mpirun -np 1 java MPITestBroke data/ > >> > >> Attached is a text file of what I see when I run it with mpirun and your > >> debug flag. Lots of debug lines. > >> > >> > >> Nate > >> > >> > >> > >> > >> > >> On Wed, Aug 12, 2015 at 11:09 AM, Howard Pritchard <hpprit...@gmail.com> > >> wrote: > >> Hi Nate, > >> > >> Sorry for the delay in getting back to you. > >> We're somewhat stuck on how to help you, but here are two suggestions. > >> > >> Could you add the following to your launch command line > >> > >> --mca odls_base_verbose 100 > >> > >> so we can see exactly what arguments are being feed to java when launching > >> your app. > >> > >> Also, if you could put your MPITestBroke.class file somewhere (like google > >> drive) > >> where we could get it and try to run locally or at NERSC, that might help > >> us > >> narrow down the problem. Better yet, if you have the class or jar file > >> for > >> the entire app plus some data sets, we could try that out as well. > >> > >> All the config outputs, etc. you've sent so far indicate a correct > >> installation > >> of open mpi. > >> > >> Howard > >> > >> > >> On Aug 6, 2015 1:54 PM, "Nate Chambers" <ncham...@usna.edu> wrote: > >> Howard, > >> > >> I tried the nightly build openmpi-dev-2223-g731cfe3 and it still segfaults > >> as before. I must admit I am new to MPI, so is it possible I'm just > >> configuring or running incorrectly? Let me list my steps for you, and > >> maybe something will jump out? Also attached is my config.log. > >> > >> > >> CONFIGURE > >> ./configure --prefix=<install-dir> --enable-mpi-java CC=gcc > >> > >> MAKE > >> make all install > >> > >> RUN > >> <install-dir>/mpirun -np 1 java MPITestBroke twitter/ > >> > >> > >> DEFAULT JAVA AND GCC > >> > >> $ java -version > >> java version "1.7.0_21" > >> Java(TM) SE Runtime Environment (build 1.7.0_21-b11) > >> Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode) > >> > >> $ gcc --v > >> Using built-in specs. > >> Target: x86_64-redhat-linux > >> Configured with: ../configure --prefix=/usr --mandir=/usr/share/man > >> --infodir=/usr/share/info > >> --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap > >> --enable-shared --enable-threads=posix --enable-checking=release > >> --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions > >> --enable-gnu-unique-object > >> --enable-languages=c,c++,objc,obj-c++,java,fortran,ada > >> --enable-java-awt=gtk --disable-dssi > >> --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-1.5.0.0/jre > >> --enable-libgcj-multifile --enable-java-maintainer-mode > >> --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --disable-libjava-multilib > >> --with-ppl --with-cloog --with-tune=generic --with-arch_32=i686 > >> --build=x86_64-redhat-linux > >> Thread model: posix > >> gcc version 4.4.7 20120313 (Red Hat 4.4.7-4) (GCC) > >> > >> > >> > >> > >> > >> On Thu, Aug 6, 2015 at 7:58 AM, Howard Pritchard <hpprit...@gmail.com> > >> wrote: > >> HI Nate, > >> > >> We're trying this out on a mac running mavericks and a cray xc system. > >> the mac has java 8 > >> while the cray xc has java 7. > >> > >> We could not get the code to run just using the java launch command, > >> although we noticed if you add > >> > >> catch(NoClassDefFoundError e) { > >> > >> System.out.println("Not using MPI its out to lunch for now"); > >> > >> } > >> > >> as one of the catches after the try for firing up MPI, you can get further. > >> > >> Instead we tried on the two systems using > >> > >> mpirun -np 1 java MPITestBroke tweets repeat.txt > >> > >> and, you guessed it, we can't reproduce the error, at least using master. > >> > >> Would you mind trying to get a copy of nightly master build off of > >> > >> http://www.open-mpi.org/nightly/master/ > >> and install that version and give it a try. > >> > >> If that works, then I'd suggest using master (or v2.0) for now. > >> Howard > >> > >> > >> > >> > >> 2015-08-05 14:41 GMT-06:00 Nate Chambers <ncham...@usna.edu>: > >> Howard, > >> > >> Thanks for looking at all this. Adding System.gc() did not cause it to > >> segfault. The segfault still comes much later in the processing. > >> > >> I was able to reduce my code to a single test file without other > >> dependencies. It is attached. This code simply opens a text file and reads > >> its lines, one by one. Once finished, it closes and opens the same file > >> and reads the lines again. On my system, it does this about 4 times until > >> the segfault fires. Obviously this code makes no sense, but it's based on > >> our actual code that reads millions of lines of data and does various > >> processing to it. > >> > >> Attached is a tweets.tgz file that you can uncompress to have an input > >> directory. The text file is just the same line over and over again. Run it > >> as: > >> > >> java MPITestBroke tweets/ > >> > >> > >> Nate > >> > >> > >> > >> > >> > >> On Wed, Aug 5, 2015 at 8:29 AM, Howard Pritchard <hpprit...@gmail.com> > >> wrote: > >> Hi Nate, > >> > >> Sorry for the delay in getting back. Thanks for the sanity check. You > >> may have a point about the args string to MPI.init - > >> there's nothing the Open MPI is needing from this but that is a difference > >> with your use case - your app has an argument. > >> > >> Would you mind adding a > >> > >> System.gc() > >> > >> call immediately after MPI.init call and see if the gc blows up with a > >> segfault? > >> > >> Also, may be interesting to add the -verbose:jni to your command line. > >> > >> We'll do some experiments here with the init string arg. > >> > >> Is your app open source where we could download it and try to reproduce > >> the problem locally? > >> > >> thanks, > >> > >> Howard > >> > >> > >> 2015-08-04 18:52 GMT-06:00 Nate Chambers <ncham...@usna.edu>: > >> Sanity checks pass. Both Hello and Ring.java run correctly with the > >> expected program's output. > >> > >> Does MPI.init(args) expect anything from those command-line args? > >> > >> > >> Nate > >> > >> > >> On Tue, Aug 4, 2015 at 12:26 PM, Howard Pritchard <hpprit...@gmail.com> > >> wrote: > >> Hello Nate, > >> > >> As a sanity check of your installation, could you try to compile the > >> examples/*.java codes using the mpijavac you've installed and see that > >> those run correctly? > >> I'd be just interested in the Hello.java and Ring.java? > >> > >> Howard > >> > >> > >> > >> > >> > >> > >> > >> 2015-08-04 14:34 GMT-06:00 Nate Chambers <ncham...@usna.edu>: > >> Sure, I reran the configure with CC=gcc and then make install. I think > >> that's the proper way to do it. Attached is my config log. The behavior > >> when running our code appears to be the same. The output is the same error > >> I pasted in my email above. It occurs when calling MPI.init(). > >> > >> I'm not great at debugging this sort of stuff, but happy to try things out > >> if you need me to. > >> > >> Nate > >> > >> > >> On Tue, Aug 4, 2015 at 5:09 AM, Howard Pritchard <hpprit...@gmail.com> > >> wrote: > >> Hello Nate, > >> > >> As a first step to addressing this, could you please try using gcc rather > >> than the Intel compilers to build Open MPI? > >> > >> We've been doing a lot of work recently on the java bindings, etc. but > >> have never tried using any compilers other > >> than gcc when working with the java bindings. > >> > >> Thanks, > >> > >> Howard > >> > >> > >> 2015-08-03 17:36 GMT-06:00 Nate Chambers <ncham...@usna.edu>: > >> We've been struggling with this error for a while, so hoping someone more > >> knowledgeable can help! > >> > >> Our java MPI code exits with a segfault during its normal operation, but > >> the segfault occurs before our code ever uses MPI functionality like > >> sending/receiving. We've removed all message calls and any use of > >> MPI.COMM_WORLD from the code. The segfault occurs if we call > >> MPI.init(args) in our code, and does not if we comment that line out. > >> Further vexing us, the crash doesn't happen at the point of the MPI.init > >> call, but later on in the program. I don't have an easy-to-run example > >> here because our non-MPI code is so large and > >> complicated. We have run simpler test > >> programs with MPI and the segfault does not occur. > >> > >> We have isolated the line where the segfault occurs. However, if we > >> comment that out, the program will run longer, but then randomly (but > >> deterministically) segfault later on in the code. Does anyone have tips on > >> how to debug this? We have tried several flags with mpirun, but no good > >> clues. > >> > >> We have also tried several MPI versions, including stable 1.8.7 and the > >> most recent 1.8.8rc1 > >> > >> > >> ATTACHED > >> - config.log from installation > >> - output from `ompi_info -all` > >> > >> > >> OUTPUT FROM RUNNING > >> > >> > mpirun -np 2 java -mx4g FeaturizeDay datadir/ days.txt > >> ... > >> some normal output from our code > >> ... > >> -------------------------------------------------------------------------- > >> mpirun noticed that process rank 0 with PID 29646 on node r9n69 exited on > >> signal 11 (Segmentation fault). > >> -------------------------------------------------------------------------- > >> > >> > >> > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >> Link to this post: > >> http://www.open-mpi.org/community/lists/users/2015/08/27386.php > >> > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >> Link to this post: > >> http://www.open-mpi.org/community/lists/users/2015/08/27389.php > >> > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >> Link to this post: > >> http://www.open-mpi.org/community/lists/users/2015/08/27391.php > >> > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >> Link to this post: > >> http://www.open-mpi.org/community/lists/users/2015/08/27392.php > >> > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >> Link to this post: > >> http://www.open-mpi.org/community/lists/users/2015/08/27393.php > >> > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >> Link to this post: > >> http://www.open-mpi.org/community/lists/users/2015/08/27396.php > >> > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >> Link to this post: > >> http://www.open-mpi.org/community/lists/users/2015/08/27399.php > >> > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >> Link to this post: > >> http://www.open-mpi.org/community/lists/users/2015/08/27405.php > >> > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >> Link to this post: > >> http://www.open-mpi.org/community/lists/users/2015/08/27406.php > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >> Link to this post: > >> http://www.open-mpi.org/community/lists/users/2015/08/27446.php > >> > >> > >> > >> _______________________________________________ > >> users mailing list > >> > >> us...@open-mpi.org > >> > >> Subscription: > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > >> Link to this post: > >> http://www.open-mpi.org/community/lists/users/2015/08/27450.php > > > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > > Link to this post: > > http://www.open-mpi.org/community/lists/users/2015/08/27465.php > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > > Link to this post: > > http://www.open-mpi.org/community/lists/users/2015/08/27471.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/08/27472.php > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/08/27476.php
-- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/