All,

It works!! Giles with the fix!

I ran it with his suggested flags:
mpirun --mca mtl ^psm -np 1 java MPITestBroke data/

The test code now runs without the segfault occurring around the 5th loop.
It will be a while before I can put this back into our bigger code that
first caused our segfault, but for now this is looking very promising. I
will keep you posted. Thanks again.


Nate



On Sat, Aug 15, 2015 at 6:30 PM, Howard Pritchard <hpprit...@gmail.com>
wrote:

> Gilles,
>
> On hopper there aren't any psm libraries - its an infiniband/infinipath
> free system -
> at least on the compute nodes.
>
> For my own work, I never use things like the platform files, I just do
> ./configure --prefix=blahblah --enable-mpi-java (and whatever else I want
> to test this tie)
>
> Thanks for the ideas though,
>
> Howard
>
>
> 2015-08-14 19:20 GMT-06:00 Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com>:
>
>> Howard,
>>
>> I have no infinipath hardware, but the infinipath libraries are installed.
>> I tried to run with --mca mtl_psm_priority 0 instead of --mca mtl ^psm
>> but that did not work.
>> without psm mtl, I was unable to reproduce the persistent communication
>> issue,
>> so I concluded there was only one issue here.
>>
>> do you configure with --disable-dlopen on hopper ?
>> I wonder whether --mca mtl ^psm is effective if dlopen is disabled
>>
>> Cheers,
>>
>> Gilles
>>
>> On Saturday, August 15, 2015, Howard Pritchard <hpprit...@gmail.com>
>> wrote:
>>
>>> Hi Jeff,
>>>
>>> I don't know why Gilles keeps picking on the persistent request problem
>>> and mixing
>>> it up with this user bug.  I do think for this user the psm probably is
>>> the problem.
>>>
>>>
>>> They don't have anything to do with each other.
>>>
>>> I can reproduce the persistent request problem on hopper consistently.
>>> As I said
>>> on the telecon last week it has something to do with memory corruption
>>> with the
>>> receive buffer that is associated with the persistent request.
>>>
>>> Howard
>>>
>>>
>>> 2015-08-14 11:21 GMT-06:00 Jeff Squyres (jsquyres) <jsquy...@cisco.com>:
>>>
>>>> Hmm.  Oscar's not around to ask any more, but I'd be greatly surprised
>>>> if he had InfiniPath on his systems where he ran into this segv issue...?
>>>>
>>>>
>>>> > On Aug 14, 2015, at 1:08 PM, Howard Pritchard <hpprit...@gmail.com>
>>>> wrote:
>>>> >
>>>> > Hi Gilles,
>>>> >
>>>> > Good catch!  Nate we hadn't been testing on a infinipath system.
>>>> >
>>>> > Howard
>>>> >
>>>> >
>>>> > 2015-08-14 0:20 GMT-06:00 Gilles Gouaillardet <gil...@rist.or.jp>:
>>>> > Nate,
>>>> >
>>>> > i could get rid of the problem by not using the psm mtl.
>>>> > the infinipath library (used by the psm mtl) sets some signal
>>>> handlers that conflict with the JVM
>>>> > that can be seen by running
>>>> > mpirun -np 1 java -Xcheck:jni MPITestBroke data/
>>>> >
>>>> > so instead of running
>>>> > mpirun -np 1 java MPITestBroke data/
>>>> > please run
>>>> > mpirun --mca mtl ^psm -np 1 java MPITestBroke data/
>>>> >
>>>> > that solved the issue for me
>>>> >
>>>> > Cheers,
>>>> >
>>>> > Gilles
>>>> >
>>>> > On 8/13/2015 9:19 AM, Nate Chambers wrote:
>>>> >> I appreciate you trying to help! I put the Java and its compiled
>>>> .class file on Dropbox. The directory contains the .java and .class files,
>>>> as well as a data/ directory:
>>>> >>
>>>> >>
>>>> http://www.dropbox.com/sh/pds5c5wecfpb2wk/AAAcz17UTDQErmrUqp2SPjpqa?dl=0
>>>> >>
>>>> >> You can run it with and without MPI:
>>>> >>
>>>> >> >  java MPITestBroke data/
>>>> >> >  mpirun -np 1 java MPITestBroke data/
>>>> >>
>>>> >> Attached is a text file of what I see when I run it with mpirun and
>>>> your debug flag. Lots of debug lines.
>>>> >>
>>>> >>
>>>> >> Nate
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >> On Wed, Aug 12, 2015 at 11:09 AM, Howard Pritchard <
>>>> hpprit...@gmail.com> wrote:
>>>> >> Hi Nate,
>>>> >>
>>>> >> Sorry for the delay in getting back to you.
>>>> >> We're somewhat stuck on how to help you, but here are two
>>>> suggestions.
>>>> >>
>>>> >> Could you add the following to your launch command line
>>>> >>
>>>> >> --mca odls_base_verbose 100
>>>> >>
>>>> >> so we can see exactly what arguments are being feed to java when
>>>> launching
>>>> >> your app.
>>>> >>
>>>> >> Also, if you could put your MPITestBroke.class file somewhere (like
>>>> google drive)
>>>> >> where we could get it and try to run locally or at NERSC, that might
>>>> help us
>>>> >> narrow down the problem.    Better yet, if you have the class or jar
>>>> file for
>>>> >> the entire app plus some data sets, we could try that out as well.
>>>> >>
>>>> >> All the config outputs, etc. you've sent so far indicate a correct
>>>> installation
>>>> >> of open mpi.
>>>> >>
>>>> >> Howard
>>>> >>
>>>> >>
>>>> >> On Aug 6, 2015 1:54 PM, "Nate Chambers" <ncham...@usna.edu> wrote:
>>>> >> Howard,
>>>> >>
>>>> >> I tried the nightly build openmpi-dev-2223-g731cfe3 and it still
>>>> segfaults as before. I must admit I am new to MPI, so is it possible I'm
>>>> just configuring or running incorrectly? Let me list my steps for you, and
>>>> maybe something will jump out? Also attached is my config.log.
>>>> >>
>>>> >>
>>>> >> CONFIGURE
>>>> >> ./configure --prefix=<install-dir> --enable-mpi-java CC=gcc
>>>> >>
>>>> >> MAKE
>>>> >> make all install
>>>> >>
>>>> >> RUN
>>>> >> <install-dir>/mpirun -np 1 java MPITestBroke twitter/
>>>> >>
>>>> >>
>>>> >> DEFAULT JAVA AND GCC
>>>> >>
>>>> >> $ java -version
>>>> >> java version "1.7.0_21"
>>>> >> Java(TM) SE Runtime Environment (build 1.7.0_21-b11)
>>>> >> Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode)
>>>> >>
>>>> >> $ gcc --v
>>>> >> Using built-in specs.
>>>> >> Target: x86_64-redhat-linux
>>>> >> Configured with: ../configure --prefix=/usr --mandir=/usr/share/man
>>>> --infodir=/usr/share/info --with-bugurl=
>>>> http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared
>>>> --enable-threads=posix --enable-checking=release --with-system-zlib
>>>> --enable-__cxa_atexit --disable-libunwind-exceptions
>>>> --enable-gnu-unique-object
>>>> --enable-languages=c,c++,objc,obj-c++,java,fortran,ada
>>>> --enable-java-awt=gtk --disable-dssi
>>>> --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-1.5.0.0/jre
>>>> --enable-libgcj-multifile --enable-java-maintainer-mode
>>>> --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --disable-libjava-multilib
>>>> --with-ppl --with-cloog --with-tune=generic --with-arch_32=i686
>>>> --build=x86_64-redhat-linux
>>>> >> Thread model: posix
>>>> >> gcc version 4.4.7 20120313 (Red Hat 4.4.7-4) (GCC)
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >> On Thu, Aug 6, 2015 at 7:58 AM, Howard Pritchard <
>>>> hpprit...@gmail.com> wrote:
>>>> >> HI Nate,
>>>> >>
>>>> >> We're trying this out on a mac running mavericks and a cray xc
>>>> system.   the mac has java 8
>>>> >> while the cray xc has java 7.
>>>> >>
>>>> >> We could not get the code to run just using the java launch command,
>>>> although we noticed if you add
>>>> >>
>>>> >>     catch(NoClassDefFoundError e) {
>>>> >>
>>>> >>       System.out.println("Not using MPI its out to lunch for now");
>>>> >>
>>>> >>     }
>>>> >>
>>>> >> as one of the catches after the try for firing up MPI, you can get
>>>> further.
>>>> >>
>>>> >> Instead we tried on the two systems using
>>>> >>
>>>> >> mpirun -np 1 java MPITestBroke tweets repeat.txt
>>>> >>
>>>> >> and, you guessed it, we can't reproduce the error, at least using
>>>> master.
>>>> >>
>>>> >> Would you mind trying to get a copy of nightly master build off of
>>>> >>
>>>> >> http://www.open-mpi.org/nightly/master/
>>>> >> and install that version and give it a try.
>>>> >>
>>>> >> If that works, then I'd suggest using master (or v2.0) for now.
>>>> >> Howard
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >> 2015-08-05 14:41 GMT-06:00 Nate Chambers <ncham...@usna.edu>:
>>>> >> Howard,
>>>> >>
>>>> >> Thanks for looking at all this. Adding System.gc() did not cause it
>>>> to segfault. The segfault still comes much later in the processing.
>>>> >>
>>>> >> I was able to reduce my code to a single test file without other
>>>> dependencies. It is attached. This code simply opens a text file and reads
>>>> its lines, one by one. Once finished, it closes and opens the same file and
>>>> reads the lines again. On my system, it does this about 4 times until the
>>>> segfault fires. Obviously this code makes no sense, but it's based on our
>>>> actual code that reads millions of lines of data and does various
>>>> processing to it.
>>>> >>
>>>> >> Attached is a tweets.tgz file that you can uncompress to have an
>>>> input directory. The text file is just the same line over and over again.
>>>> Run it as:
>>>> >>
>>>> >> java MPITestBroke tweets/
>>>> >>
>>>> >>
>>>> >> Nate
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >> On Wed, Aug 5, 2015 at 8:29 AM, Howard Pritchard <
>>>> hpprit...@gmail.com> wrote:
>>>> >> Hi Nate,
>>>> >>
>>>> >> Sorry for the delay in getting back.  Thanks for the sanity check.
>>>> You may have a point about the args string to MPI.init -
>>>> >> there's nothing the Open MPI is needing from this but that is a
>>>> difference with your use case - your app has an argument.
>>>> >>
>>>> >> Would you mind adding a
>>>> >>
>>>> >> System.gc()
>>>> >>
>>>> >> call immediately after MPI.init call and see if the gc blows up with
>>>> a segfault?
>>>> >>
>>>> >> Also, may be interesting to add the -verbose:jni to your command
>>>> line.
>>>> >>
>>>> >> We'll do some experiments here with the init string arg.
>>>> >>
>>>> >> Is your app open source where we could download it and try to
>>>> reproduce the problem locally?
>>>> >>
>>>> >> thanks,
>>>> >>
>>>> >> Howard
>>>> >>
>>>> >>
>>>> >> 2015-08-04 18:52 GMT-06:00 Nate Chambers <ncham...@usna.edu>:
>>>> >> Sanity checks pass. Both Hello and Ring.java run correctly with the
>>>> expected program's output.
>>>> >>
>>>> >> Does MPI.init(args) expect anything from those command-line args?
>>>> >>
>>>> >>
>>>> >> Nate
>>>> >>
>>>> >>
>>>> >> On Tue, Aug 4, 2015 at 12:26 PM, Howard Pritchard <
>>>> hpprit...@gmail.com> wrote:
>>>> >> Hello Nate,
>>>> >>
>>>> >> As a sanity check of your installation, could you try to compile the
>>>> examples/*.java codes using the mpijavac you've installed and see that
>>>> those run correctly?
>>>> >> I'd be just interested in the Hello.java and Ring.java?
>>>> >>
>>>> >> Howard
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >> 2015-08-04 14:34 GMT-06:00 Nate Chambers <ncham...@usna.edu>:
>>>> >> Sure, I reran the configure with CC=gcc and then make install. I
>>>> think that's the proper way to do it. Attached is my config log. The
>>>> behavior when running our code appears to be the same. The output is the
>>>> same error I pasted in my email above. It occurs when calling MPI.init().
>>>> >>
>>>> >> I'm not great at debugging this sort of stuff, but happy to try
>>>> things out if you need me to.
>>>> >>
>>>> >> Nate
>>>> >>
>>>> >>
>>>> >> On Tue, Aug 4, 2015 at 5:09 AM, Howard Pritchard <
>>>> hpprit...@gmail.com> wrote:
>>>> >> Hello Nate,
>>>> >>
>>>> >> As a first step to addressing this, could you please try using gcc
>>>> rather than the Intel compilers to build Open MPI?
>>>> >>
>>>> >> We've been doing a lot of work recently on the java bindings, etc.
>>>> but have never tried using any compilers other
>>>> >> than gcc when working with the java bindings.
>>>> >>
>>>> >> Thanks,
>>>> >>
>>>> >> Howard
>>>> >>
>>>> >>
>>>> >> 2015-08-03 17:36 GMT-06:00 Nate Chambers <ncham...@usna.edu>:
>>>> >> We've been struggling with this error for a while, so hoping someone
>>>> more knowledgeable can help!
>>>> >>
>>>> >> Our java MPI code exits with a segfault during its normal operation,
>>>> but the segfault occurs before our code ever uses MPI functionality like
>>>> sending/receiving. We've removed all message calls and any use of
>>>> MPI.COMM_WORLD from the code. The segfault occurs if we call MPI.init(args)
>>>> in our code, and does not if we comment that line out. Further vexing us,
>>>> the crash doesn't happen at the point of the MPI.init call, but later on in
>>>> the program. I don't have an easy-to-run example here because our non-MPI
>>>> code is so large and
>>>>    complicated. We have run simpler test programs with MPI and the segfault
>>>> does not occur.
>>>> >>
>>>> >> We have isolated the line where the segfault occurs. However, if we
>>>> comment that out, the program will run longer, but then randomly (but
>>>> deterministically) segfault later on in the code. Does anyone have tips on
>>>> how to debug this? We have tried several flags with mpirun, but no good
>>>> clues.
>>>> >>
>>>> >> We have also tried several MPI versions, including stable 1.8.7 and
>>>> the most recent 1.8.8rc1
>>>> >>
>>>> >>
>>>> >> ATTACHED
>>>> >> - config.log from installation
>>>> >> - output from `ompi_info -all`
>>>> >>
>>>> >>
>>>> >> OUTPUT FROM RUNNING
>>>> >>
>>>> >> > mpirun -np 2 java -mx4g FeaturizeDay datadir/ days.txt
>>>> >> ...
>>>> >> some normal output from our code
>>>> >> ...
>>>> >>
>>>> --------------------------------------------------------------------------
>>>> >> mpirun noticed that process rank 0 with PID 29646 on node r9n69
>>>> exited on signal 11 (Segmentation fault).
>>>> >>
>>>> --------------------------------------------------------------------------
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >> _______________________________________________
>>>> >> users mailing list
>>>> >> us...@open-mpi.org
>>>> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> >> Link to this post:
>>>> http://www.open-mpi.org/community/lists/users/2015/08/27386.php
>>>> >>
>>>> >>
>>>> >> _______________________________________________
>>>> >> users mailing list
>>>> >> us...@open-mpi.org
>>>> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> >> Link to this post:
>>>> http://www.open-mpi.org/community/lists/users/2015/08/27389.php
>>>> >>
>>>> >>
>>>> >> _______________________________________________
>>>> >> users mailing list
>>>> >> us...@open-mpi.org
>>>> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> >> Link to this post:
>>>> http://www.open-mpi.org/community/lists/users/2015/08/27391.php
>>>> >>
>>>> >>
>>>> >> _______________________________________________
>>>> >> users mailing list
>>>> >> us...@open-mpi.org
>>>> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> >> Link to this post:
>>>> http://www.open-mpi.org/community/lists/users/2015/08/27392.php
>>>> >>
>>>> >>
>>>> >> _______________________________________________
>>>> >> users mailing list
>>>> >> us...@open-mpi.org
>>>> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> >> Link to this post:
>>>> http://www.open-mpi.org/community/lists/users/2015/08/27393.php
>>>> >>
>>>> >>
>>>> >> _______________________________________________
>>>> >> users mailing list
>>>> >> us...@open-mpi.org
>>>> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> >> Link to this post:
>>>> http://www.open-mpi.org/community/lists/users/2015/08/27396.php
>>>> >>
>>>> >>
>>>> >> _______________________________________________
>>>> >> users mailing list
>>>> >> us...@open-mpi.org
>>>> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> >> Link to this post:
>>>> http://www.open-mpi.org/community/lists/users/2015/08/27399.php
>>>> >>
>>>> >>
>>>> >> _______________________________________________
>>>> >> users mailing list
>>>> >> us...@open-mpi.org
>>>> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> >> Link to this post:
>>>> http://www.open-mpi.org/community/lists/users/2015/08/27405.php
>>>> >>
>>>> >>
>>>> >> _______________________________________________
>>>> >> users mailing list
>>>> >> us...@open-mpi.org
>>>> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> >> Link to this post:
>>>> http://www.open-mpi.org/community/lists/users/2015/08/27406.php
>>>> >>
>>>> >> _______________________________________________
>>>> >> users mailing list
>>>> >> us...@open-mpi.org
>>>> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> >> Link to this post:
>>>> http://www.open-mpi.org/community/lists/users/2015/08/27446.php
>>>> >>
>>>> >>
>>>> >>
>>>> >> _______________________________________________
>>>> >> users mailing list
>>>> >>
>>>> >> us...@open-mpi.org
>>>> >>
>>>> >> Subscription:
>>>> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> >>
>>>> >> Link to this post:
>>>> >> http://www.open-mpi.org/community/lists/users/2015/08/27450.php
>>>> >
>>>> >
>>>> > _______________________________________________
>>>> > users mailing list
>>>> > us...@open-mpi.org
>>>> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> > Link to this post:
>>>> http://www.open-mpi.org/community/lists/users/2015/08/27465.php
>>>> >
>>>> > _______________________________________________
>>>> > users mailing list
>>>> > us...@open-mpi.org
>>>> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> > Link to this post:
>>>> http://www.open-mpi.org/community/lists/users/2015/08/27471.php
>>>>
>>>>
>>>> --
>>>> Jeff Squyres
>>>> jsquy...@cisco.com
>>>> For corporate legal information go to:
>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/users/2015/08/27472.php
>>>>
>>>
>>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2015/08/27479.php
>>
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/08/27480.php
>

Reply via email to