All, It works!! Giles with the fix!
I ran it with his suggested flags: mpirun --mca mtl ^psm -np 1 java MPITestBroke data/ The test code now runs without the segfault occurring around the 5th loop. It will be a while before I can put this back into our bigger code that first caused our segfault, but for now this is looking very promising. I will keep you posted. Thanks again. Nate On Sat, Aug 15, 2015 at 6:30 PM, Howard Pritchard <hpprit...@gmail.com> wrote: > Gilles, > > On hopper there aren't any psm libraries - its an infiniband/infinipath > free system - > at least on the compute nodes. > > For my own work, I never use things like the platform files, I just do > ./configure --prefix=blahblah --enable-mpi-java (and whatever else I want > to test this tie) > > Thanks for the ideas though, > > Howard > > > 2015-08-14 19:20 GMT-06:00 Gilles Gouaillardet < > gilles.gouaillar...@gmail.com>: > >> Howard, >> >> I have no infinipath hardware, but the infinipath libraries are installed. >> I tried to run with --mca mtl_psm_priority 0 instead of --mca mtl ^psm >> but that did not work. >> without psm mtl, I was unable to reproduce the persistent communication >> issue, >> so I concluded there was only one issue here. >> >> do you configure with --disable-dlopen on hopper ? >> I wonder whether --mca mtl ^psm is effective if dlopen is disabled >> >> Cheers, >> >> Gilles >> >> On Saturday, August 15, 2015, Howard Pritchard <hpprit...@gmail.com> >> wrote: >> >>> Hi Jeff, >>> >>> I don't know why Gilles keeps picking on the persistent request problem >>> and mixing >>> it up with this user bug. I do think for this user the psm probably is >>> the problem. >>> >>> >>> They don't have anything to do with each other. >>> >>> I can reproduce the persistent request problem on hopper consistently. >>> As I said >>> on the telecon last week it has something to do with memory corruption >>> with the >>> receive buffer that is associated with the persistent request. >>> >>> Howard >>> >>> >>> 2015-08-14 11:21 GMT-06:00 Jeff Squyres (jsquyres) <jsquy...@cisco.com>: >>> >>>> Hmm. Oscar's not around to ask any more, but I'd be greatly surprised >>>> if he had InfiniPath on his systems where he ran into this segv issue...? >>>> >>>> >>>> > On Aug 14, 2015, at 1:08 PM, Howard Pritchard <hpprit...@gmail.com> >>>> wrote: >>>> > >>>> > Hi Gilles, >>>> > >>>> > Good catch! Nate we hadn't been testing on a infinipath system. >>>> > >>>> > Howard >>>> > >>>> > >>>> > 2015-08-14 0:20 GMT-06:00 Gilles Gouaillardet <gil...@rist.or.jp>: >>>> > Nate, >>>> > >>>> > i could get rid of the problem by not using the psm mtl. >>>> > the infinipath library (used by the psm mtl) sets some signal >>>> handlers that conflict with the JVM >>>> > that can be seen by running >>>> > mpirun -np 1 java -Xcheck:jni MPITestBroke data/ >>>> > >>>> > so instead of running >>>> > mpirun -np 1 java MPITestBroke data/ >>>> > please run >>>> > mpirun --mca mtl ^psm -np 1 java MPITestBroke data/ >>>> > >>>> > that solved the issue for me >>>> > >>>> > Cheers, >>>> > >>>> > Gilles >>>> > >>>> > On 8/13/2015 9:19 AM, Nate Chambers wrote: >>>> >> I appreciate you trying to help! I put the Java and its compiled >>>> .class file on Dropbox. The directory contains the .java and .class files, >>>> as well as a data/ directory: >>>> >> >>>> >> >>>> http://www.dropbox.com/sh/pds5c5wecfpb2wk/AAAcz17UTDQErmrUqp2SPjpqa?dl=0 >>>> >> >>>> >> You can run it with and without MPI: >>>> >> >>>> >> > java MPITestBroke data/ >>>> >> > mpirun -np 1 java MPITestBroke data/ >>>> >> >>>> >> Attached is a text file of what I see when I run it with mpirun and >>>> your debug flag. Lots of debug lines. >>>> >> >>>> >> >>>> >> Nate >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> On Wed, Aug 12, 2015 at 11:09 AM, Howard Pritchard < >>>> hpprit...@gmail.com> wrote: >>>> >> Hi Nate, >>>> >> >>>> >> Sorry for the delay in getting back to you. >>>> >> We're somewhat stuck on how to help you, but here are two >>>> suggestions. >>>> >> >>>> >> Could you add the following to your launch command line >>>> >> >>>> >> --mca odls_base_verbose 100 >>>> >> >>>> >> so we can see exactly what arguments are being feed to java when >>>> launching >>>> >> your app. >>>> >> >>>> >> Also, if you could put your MPITestBroke.class file somewhere (like >>>> google drive) >>>> >> where we could get it and try to run locally or at NERSC, that might >>>> help us >>>> >> narrow down the problem. Better yet, if you have the class or jar >>>> file for >>>> >> the entire app plus some data sets, we could try that out as well. >>>> >> >>>> >> All the config outputs, etc. you've sent so far indicate a correct >>>> installation >>>> >> of open mpi. >>>> >> >>>> >> Howard >>>> >> >>>> >> >>>> >> On Aug 6, 2015 1:54 PM, "Nate Chambers" <ncham...@usna.edu> wrote: >>>> >> Howard, >>>> >> >>>> >> I tried the nightly build openmpi-dev-2223-g731cfe3 and it still >>>> segfaults as before. I must admit I am new to MPI, so is it possible I'm >>>> just configuring or running incorrectly? Let me list my steps for you, and >>>> maybe something will jump out? Also attached is my config.log. >>>> >> >>>> >> >>>> >> CONFIGURE >>>> >> ./configure --prefix=<install-dir> --enable-mpi-java CC=gcc >>>> >> >>>> >> MAKE >>>> >> make all install >>>> >> >>>> >> RUN >>>> >> <install-dir>/mpirun -np 1 java MPITestBroke twitter/ >>>> >> >>>> >> >>>> >> DEFAULT JAVA AND GCC >>>> >> >>>> >> $ java -version >>>> >> java version "1.7.0_21" >>>> >> Java(TM) SE Runtime Environment (build 1.7.0_21-b11) >>>> >> Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode) >>>> >> >>>> >> $ gcc --v >>>> >> Using built-in specs. >>>> >> Target: x86_64-redhat-linux >>>> >> Configured with: ../configure --prefix=/usr --mandir=/usr/share/man >>>> --infodir=/usr/share/info --with-bugurl= >>>> http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared >>>> --enable-threads=posix --enable-checking=release --with-system-zlib >>>> --enable-__cxa_atexit --disable-libunwind-exceptions >>>> --enable-gnu-unique-object >>>> --enable-languages=c,c++,objc,obj-c++,java,fortran,ada >>>> --enable-java-awt=gtk --disable-dssi >>>> --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-1.5.0.0/jre >>>> --enable-libgcj-multifile --enable-java-maintainer-mode >>>> --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --disable-libjava-multilib >>>> --with-ppl --with-cloog --with-tune=generic --with-arch_32=i686 >>>> --build=x86_64-redhat-linux >>>> >> Thread model: posix >>>> >> gcc version 4.4.7 20120313 (Red Hat 4.4.7-4) (GCC) >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> On Thu, Aug 6, 2015 at 7:58 AM, Howard Pritchard < >>>> hpprit...@gmail.com> wrote: >>>> >> HI Nate, >>>> >> >>>> >> We're trying this out on a mac running mavericks and a cray xc >>>> system. the mac has java 8 >>>> >> while the cray xc has java 7. >>>> >> >>>> >> We could not get the code to run just using the java launch command, >>>> although we noticed if you add >>>> >> >>>> >> catch(NoClassDefFoundError e) { >>>> >> >>>> >> System.out.println("Not using MPI its out to lunch for now"); >>>> >> >>>> >> } >>>> >> >>>> >> as one of the catches after the try for firing up MPI, you can get >>>> further. >>>> >> >>>> >> Instead we tried on the two systems using >>>> >> >>>> >> mpirun -np 1 java MPITestBroke tweets repeat.txt >>>> >> >>>> >> and, you guessed it, we can't reproduce the error, at least using >>>> master. >>>> >> >>>> >> Would you mind trying to get a copy of nightly master build off of >>>> >> >>>> >> http://www.open-mpi.org/nightly/master/ >>>> >> and install that version and give it a try. >>>> >> >>>> >> If that works, then I'd suggest using master (or v2.0) for now. >>>> >> Howard >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> 2015-08-05 14:41 GMT-06:00 Nate Chambers <ncham...@usna.edu>: >>>> >> Howard, >>>> >> >>>> >> Thanks for looking at all this. Adding System.gc() did not cause it >>>> to segfault. The segfault still comes much later in the processing. >>>> >> >>>> >> I was able to reduce my code to a single test file without other >>>> dependencies. It is attached. This code simply opens a text file and reads >>>> its lines, one by one. Once finished, it closes and opens the same file and >>>> reads the lines again. On my system, it does this about 4 times until the >>>> segfault fires. Obviously this code makes no sense, but it's based on our >>>> actual code that reads millions of lines of data and does various >>>> processing to it. >>>> >> >>>> >> Attached is a tweets.tgz file that you can uncompress to have an >>>> input directory. The text file is just the same line over and over again. >>>> Run it as: >>>> >> >>>> >> java MPITestBroke tweets/ >>>> >> >>>> >> >>>> >> Nate >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> On Wed, Aug 5, 2015 at 8:29 AM, Howard Pritchard < >>>> hpprit...@gmail.com> wrote: >>>> >> Hi Nate, >>>> >> >>>> >> Sorry for the delay in getting back. Thanks for the sanity check. >>>> You may have a point about the args string to MPI.init - >>>> >> there's nothing the Open MPI is needing from this but that is a >>>> difference with your use case - your app has an argument. >>>> >> >>>> >> Would you mind adding a >>>> >> >>>> >> System.gc() >>>> >> >>>> >> call immediately after MPI.init call and see if the gc blows up with >>>> a segfault? >>>> >> >>>> >> Also, may be interesting to add the -verbose:jni to your command >>>> line. >>>> >> >>>> >> We'll do some experiments here with the init string arg. >>>> >> >>>> >> Is your app open source where we could download it and try to >>>> reproduce the problem locally? >>>> >> >>>> >> thanks, >>>> >> >>>> >> Howard >>>> >> >>>> >> >>>> >> 2015-08-04 18:52 GMT-06:00 Nate Chambers <ncham...@usna.edu>: >>>> >> Sanity checks pass. Both Hello and Ring.java run correctly with the >>>> expected program's output. >>>> >> >>>> >> Does MPI.init(args) expect anything from those command-line args? >>>> >> >>>> >> >>>> >> Nate >>>> >> >>>> >> >>>> >> On Tue, Aug 4, 2015 at 12:26 PM, Howard Pritchard < >>>> hpprit...@gmail.com> wrote: >>>> >> Hello Nate, >>>> >> >>>> >> As a sanity check of your installation, could you try to compile the >>>> examples/*.java codes using the mpijavac you've installed and see that >>>> those run correctly? >>>> >> I'd be just interested in the Hello.java and Ring.java? >>>> >> >>>> >> Howard >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> 2015-08-04 14:34 GMT-06:00 Nate Chambers <ncham...@usna.edu>: >>>> >> Sure, I reran the configure with CC=gcc and then make install. I >>>> think that's the proper way to do it. Attached is my config log. The >>>> behavior when running our code appears to be the same. The output is the >>>> same error I pasted in my email above. It occurs when calling MPI.init(). >>>> >> >>>> >> I'm not great at debugging this sort of stuff, but happy to try >>>> things out if you need me to. >>>> >> >>>> >> Nate >>>> >> >>>> >> >>>> >> On Tue, Aug 4, 2015 at 5:09 AM, Howard Pritchard < >>>> hpprit...@gmail.com> wrote: >>>> >> Hello Nate, >>>> >> >>>> >> As a first step to addressing this, could you please try using gcc >>>> rather than the Intel compilers to build Open MPI? >>>> >> >>>> >> We've been doing a lot of work recently on the java bindings, etc. >>>> but have never tried using any compilers other >>>> >> than gcc when working with the java bindings. >>>> >> >>>> >> Thanks, >>>> >> >>>> >> Howard >>>> >> >>>> >> >>>> >> 2015-08-03 17:36 GMT-06:00 Nate Chambers <ncham...@usna.edu>: >>>> >> We've been struggling with this error for a while, so hoping someone >>>> more knowledgeable can help! >>>> >> >>>> >> Our java MPI code exits with a segfault during its normal operation, >>>> but the segfault occurs before our code ever uses MPI functionality like >>>> sending/receiving. We've removed all message calls and any use of >>>> MPI.COMM_WORLD from the code. The segfault occurs if we call MPI.init(args) >>>> in our code, and does not if we comment that line out. Further vexing us, >>>> the crash doesn't happen at the point of the MPI.init call, but later on in >>>> the program. I don't have an easy-to-run example here because our non-MPI >>>> code is so large and >>>> complicated. We have run simpler test programs with MPI and the segfault >>>> does not occur. >>>> >> >>>> >> We have isolated the line where the segfault occurs. However, if we >>>> comment that out, the program will run longer, but then randomly (but >>>> deterministically) segfault later on in the code. Does anyone have tips on >>>> how to debug this? We have tried several flags with mpirun, but no good >>>> clues. >>>> >> >>>> >> We have also tried several MPI versions, including stable 1.8.7 and >>>> the most recent 1.8.8rc1 >>>> >> >>>> >> >>>> >> ATTACHED >>>> >> - config.log from installation >>>> >> - output from `ompi_info -all` >>>> >> >>>> >> >>>> >> OUTPUT FROM RUNNING >>>> >> >>>> >> > mpirun -np 2 java -mx4g FeaturizeDay datadir/ days.txt >>>> >> ... >>>> >> some normal output from our code >>>> >> ... >>>> >> >>>> -------------------------------------------------------------------------- >>>> >> mpirun noticed that process rank 0 with PID 29646 on node r9n69 >>>> exited on signal 11 (Segmentation fault). >>>> >> >>>> -------------------------------------------------------------------------- >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> _______________________________________________ >>>> >> users mailing list >>>> >> us...@open-mpi.org >>>> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2015/08/27386.php >>>> >> >>>> >> >>>> >> _______________________________________________ >>>> >> users mailing list >>>> >> us...@open-mpi.org >>>> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2015/08/27389.php >>>> >> >>>> >> >>>> >> _______________________________________________ >>>> >> users mailing list >>>> >> us...@open-mpi.org >>>> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2015/08/27391.php >>>> >> >>>> >> >>>> >> _______________________________________________ >>>> >> users mailing list >>>> >> us...@open-mpi.org >>>> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2015/08/27392.php >>>> >> >>>> >> >>>> >> _______________________________________________ >>>> >> users mailing list >>>> >> us...@open-mpi.org >>>> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2015/08/27393.php >>>> >> >>>> >> >>>> >> _______________________________________________ >>>> >> users mailing list >>>> >> us...@open-mpi.org >>>> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2015/08/27396.php >>>> >> >>>> >> >>>> >> _______________________________________________ >>>> >> users mailing list >>>> >> us...@open-mpi.org >>>> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2015/08/27399.php >>>> >> >>>> >> >>>> >> _______________________________________________ >>>> >> users mailing list >>>> >> us...@open-mpi.org >>>> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2015/08/27405.php >>>> >> >>>> >> >>>> >> _______________________________________________ >>>> >> users mailing list >>>> >> us...@open-mpi.org >>>> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2015/08/27406.php >>>> >> >>>> >> _______________________________________________ >>>> >> users mailing list >>>> >> us...@open-mpi.org >>>> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2015/08/27446.php >>>> >> >>>> >> >>>> >> >>>> >> _______________________________________________ >>>> >> users mailing list >>>> >> >>>> >> us...@open-mpi.org >>>> >> >>>> >> Subscription: >>>> >> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >> >>>> >> Link to this post: >>>> >> http://www.open-mpi.org/community/lists/users/2015/08/27450.php >>>> > >>>> > >>>> > _______________________________________________ >>>> > users mailing list >>>> > us...@open-mpi.org >>>> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> > Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2015/08/27465.php >>>> > >>>> > _______________________________________________ >>>> > users mailing list >>>> > us...@open-mpi.org >>>> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> > Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2015/08/27471.php >>>> >>>> >>>> -- >>>> Jeff Squyres >>>> jsquy...@cisco.com >>>> For corporate legal information go to: >>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2015/08/27472.php >>>> >>> >>> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/08/27479.php >> > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/08/27480.php >