We did update ROMIO at some point in there, so it is possible this is a ROMIO bug that we have picked up. I've asked someone to check upstream about it.
On Jan 17, 2014, at 12:02 PM, Ronald Cohen <rhco...@lbl.gov> wrote: > Sorry, too many entries in this thread, I guess. My general goal is to get a > working parallel hdf5 with openmpi on Mac OS X Mavericks. At one point in > the saga I had romio disabled, which naturally doesn't work for hdf5 (which > is trying to read/write files in parallel). So the hdf5 tests would of > course fail. I subsequently had link errors with hdf5 because I was > building openmpi with --disable-static, whereas the default (and recommended) > option for hdf5 is to disable shared and build static. My most recent > attempts were with openmpi with enable-static, enable-nodlopen. In that > case, with openmpi 1.7.4rc1, hdf5 1.8.12 configured and built successfully > but make chek-p produced many errors in its t-mpi tests, with messages like > "proc 4: found data error at [2140143616+0], expect -7, got 6". The errors > were reproduced by the HDF5 testing team with openmpir 1.7.4rc1, but not with > 1.7.3 (which I am now building). > > Hopefully that is an adequate summary. > > Ron > > > > On Fri, Jan 17, 2014 at 11:44 AM, Jeff Squyres (jsquyres) > <jsquy...@cisco.com> wrote: > Can you specify exactly which issue you're referring to? > > - test failing when you had ROMIO disabled > - test (sometimes) failing when you had ROMIO disabled > - compiling / linking issues > > ? > > > On Jan 17, 2014, at 1:50 PM, Ronald Cohen <rhco...@lbl.gov> wrote: > > > Hello Ralph and others, I just got the following back from the HDF-5 > > support group, suggesting an ompi bug. So I should either try 1.7.3 or a > > recent nightly 1.7.4. Will likely opt for 1.7.3, but hopefully someone > > at openmpi can look at the problem for 1.7.4. In short, the challenge is > > to get a parallel hdf5 that passes make check-p with 1.7.4. > > > > > > > > > > > > ------------------ > > Hi Ron, > > > > I had sent your message to the developer and he can reproduce the issue. > > Here is what he says: > > > > --- > > I replicated this on Jam with ompi 1.7.4rc1. I saw the same error he is > > seeing. > > Note that this is an un-stable release for ompi. > > I tried ompi 1.7.3 (feature - little more stable release). I didn't see the > > problems there. > > > > So this is an ompi bug. He can report it to the ompi list. He can just > > point > > them to the t_mpi.c tests in our test suite in testpar/ and say it occurs > > with > > their 1.7.4 rc1. > > --- > > > > -Barbara > > > > > > > > On Fri, Jan 17, 2014 at 9:39 AM, Ronald Cohen <rhco...@lbl.gov> wrote: > > Thanks, I've just gotten an email with some suggestions (and promise of > > more help) from the HDF5 support team. I will report back here, as it may > > be of interest to others trying to build hdf5 on mavericks. > > > > > > On Fri, Jan 17, 2014 at 9:08 AM, Ralph Castain <r...@open-mpi.org> wrote: > > Afraid I have no idea, but hopefully someone else here with experience with > > HDF5 can chime in? > > > > > > On Jan 17, 2014, at 9:03 AM, Ronald Cohen <rhco...@lbl.gov> wrote: > > > >> Still a timely response, thank you. The particular problem I noted > >> hasn't recurred; for reasons I will explain shortly I had to rebuild > >> openmpi again, and this time Sample_mpio.c compiled and ran successfully > >> from the start. > >> > >> But now my problem is trying to get parallel HDF5 to run. In my first > >> attempt to build HDF5 it failed in the load stage because of unsatisifed > >> externals from openmpi, and I deduced the problem was having built openmpi > >> with --disable-static. So I rebuilt with --enable-static and > >> --disable-dlopen (emulating a successful openmpi + hdf5 combination I had > >> built on Snow Leopard). Once again openmpi passed its make check's, and > >> as noted above the Sample_mpio.c test compiled and ran fine. And the > >> parallel hdf5 configure and make steps ran successfully. But when I ran > >> make check for hdf5, the serial tests passed but none of the parallel > >> tests did. Over a million test failures! Error messages like: > >> > >> Proc 0: *** MPIO File size range test... > >> -------------------------------- > >> MPI_Offset is signed 8 bytes integeral type > >> MPIO GB file write test MPItest.h5 > >> MPIO GB file write test MPItest.h5 > >> MPIO GB file write test MPItest.h5 > >> MPIO GB file write test MPItest.h5 > >> MPIO GB file write test MPItest.h5 > >> MPIO GB file write test MPItest.h5 > >> MPIO GB file read test MPItest.h5 > >> MPIO GB file read test MPItest.h5 > >> MPIO GB file read test MPItest.h5 > >> MPIO GB file read test MPItest.h5 > >> proc 3: found data error at [2141192192+0], expect -6, got 5 > >> proc 3: found data error at [2141192192+1], expect -6, got 5 > >> > >> And -- the specific errors reported, which processor, which location, and > >> the total number of errors changes if I rerun make check. > >> > >> I've sent configure, make and make check logs to the HDF5 help desk but > >> haven't gotten a response. > >> > >> I am now configuring openmpi (still 1.7.4rc1) with: > >> > >> ./configure --prefix=/usr/local/openmpi CC=gcc CXX=g++ FC=gfortran > >> F77=gfortran --enable-static --with-pic --disable-dlopen > >> --enable-mpirun-prefix-by-default > >> > >> and configuring HDF5 (version 1.8.12) with: > >> > >> configure --prefix=/usr/local/hdf5/par CC=mpicc CFLAGS=-fPIC FC=mpif90 > >> FCFLAGS=-fPIC CXX=mpicxx CXXFLAGS=-fPIC --enable-parallel --enable-fortran > >> > >> This is the combination that worked for me with Snow Leopard (though it > >> was then earlier versions of both openmpi and hdf5.) > >> > >> If it matters, the gcc is the stock one with Mavericks' XCode, and > >> gfortran is 4.9.0. > >> > >> (I just noticed that the mpi fortran wrapper is now mpifort, but I also > >> see that mpif90 is still there and is a just link to mpifort.) > >> > >> Any suggestions? > >> > >> > >> On Fri, Jan 17, 2014 at 8:14 AM, Ralph Castain <r...@open-mpi.org> wrote: > >> sorry for delayed response - just getting back from travel. I don't know > >> why you would get that behavior other than a race condition. Afraid that > >> code path is foreign to me, but perhaps one of the folks in the MPI-IO > >> area can respond > >> > >> > >> On Jan 15, 2014, at 4:26 PM, Ronald Cohen <rhco...@lbl.gov> wrote: > >> > >>> Update: I reconfigured with enable_io_romio=yes, and this time -- mostly > >>> -- the test using Sample_mpio.c passes. Oddly the very first time I > >>> tried I got errors: > >>> > >>> % mpirun -np 2 sampleio > >>> Proc 1: hostname=Ron-Cohen-MBP.local > >>> Testing simple C MPIO program with 2 processes accessing file > >>> ./mpitest.data > >>> (Filename can be specified via program argument) > >>> Proc 0: hostname=Ron-Cohen-MBP.local > >>> Proc 1: read data[0:1] got 0, expect 1 > >>> Proc 1: read data[0:2] got 0, expect 2 > >>> Proc 1: read data[0:3] got 0, expect 3 > >>> Proc 1: read data[0:4] got 0, expect 4 > >>> Proc 1: read data[0:5] got 0, expect 5 > >>> Proc 1: read data[0:6] got 0, expect 6 > >>> Proc 1: read data[0:7] got 0, expect 7 > >>> Proc 1: read data[0:8] got 0, expect 8 > >>> Proc 1: read data[0:9] got 0, expect 9 > >>> Proc 1: read data[1:0] got 0, expect 10 > >>> Proc 1: read data[1:1] got 0, expect 11 > >>> Proc 1: read data[1:2] got 0, expect 12 > >>> Proc 1: read data[1:3] got 0, expect 13 > >>> Proc 1: read data[1:4] got 0, expect 14 > >>> Proc 1: read data[1:5] got 0, expect 15 > >>> Proc 1: read data[1:6] got 0, expect 16 > >>> Proc 1: read data[1:7] got 0, expect 17 > >>> Proc 1: read data[1:8] got 0, expect 18 > >>> Proc 1: read data[1:9] got 0, expect 19 > >>> -------------------------------------------------------------------------- > >>> MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD > >>> with errorcode 1. > >>> > >>> But when I reran the same mpirun command, the test was successful. And > >>> deleting the executable and recompiling and then again running the same > >>> mpirun command, the test was successful. Can someone explain that? > >>> > >>> > >>> > >>> > >>> On Wed, Jan 15, 2014 at 1:16 PM, Ronald Cohen <rhco...@lbl.gov> wrote: > >>> Aha. I guess I didn't know what the io-romio option does. If you look > >>> at my config.log you will see my configure line included > >>> --disable-io-romio. Guess I should change --disable to --enable. > >>> > >>> You seem to imply that the nightly build is stable enough that I should > >>> probably switch to that rather than 1.7.4rc1. Am I reading between the > >>> lines correctly? > >>> > >>> > >>> > >>> On Wed, Jan 15, 2014 at 10:56 AM, Ralph Castain <r...@open-mpi.org> wrote: > >>> Oh, a word of caution on those config params - you might need to check to > >>> ensure I don't disable romio in them. I don't normally build it as I > >>> don't use it. Since that is what you are trying to use, just change the > >>> "no" to "yes" (or delete that line altogether) and it will build. > >>> > >>> > >>> > >>> On Wed, Jan 15, 2014 at 10:53 AM, Ralph Castain <r...@open-mpi.org> wrote: > >>> You can find my configure options in the OMPI distribution at > >>> contrib/platform/intel/bend/mac. You are welcome to use them - just > >>> configure --with-platform=intel/bend/mac > >>> > >>> I work on the developer's trunk, of course, but also run the head of the > >>> 1.7.4 branch (essentially the nightly tarball) on a fairly regular basis. > >>> > >>> As for the opal_bitmap test: it wouldn't surprise me if that one was > >>> stale. I can check on it later tonight, but I'd suspect that the test is > >>> bad as we use that class in the code base and haven't seen an issue. > >>> > >>> > >>> > >>> On Wed, Jan 15, 2014 at 10:49 AM, Ronald Cohen <rhco...@lbl.gov> wrote: > >>> Ralph, > >>> > >>> I just sent out another post with the c file attached. > >>> > >>> If you can get that to work, and even if you can't can you tell me what > >>> configure options you use, and what version of open-mpi? Thanks. > >>> > >>> Ron > >>> > >>> > >>> On Wed, Jan 15, 2014 at 10:36 AM, Ralph Castain <r...@open-mpi.org> wrote: > >>> BTW: could you send me your sample test code? > >>> > >>> > >>> On Wed, Jan 15, 2014 at 10:34 AM, Ralph Castain <r...@open-mpi.org> wrote: > >>> I regularly build on Mavericks and run without problem, though I haven't > >>> tried a parallel IO app. I'll give yours a try later, when I get back to > >>> my Mac. > >>> > >>> > >>> > >>> On Wed, Jan 15, 2014 at 10:04 AM, Ronald Cohen <rhco...@lbl.gov> wrote: > >>> I have been struggling trying to get a usable build of openmpi on Mac OSX > >>> Mavericks (10.9.1). I can get openmpi to configure and build without > >>> error, but have problems after that which depend on the openmpi version. > >>> > >>> With 1.6.5, make check fails the opal_datatype_test, ddt_test, and > >>> ddt_raw tests. The various atomic_* tests pass. See checklogs_1.6.5, > >>> attached as a .gz file. > >>> > >>> Following suggestions from openmpi discussions I tried openmpi version > >>> 1.7.4rc1. In this case make check indicates all tests passed. But when > >>> I proceeded to try to build a parallel code (parallel HDF5) it failed. > >>> Following an email exchange with the HDF5 support people, they suggested > >>> I try to compile and run the attached bit of simple code Sample_mpio.c > >>> (which they supplied) which does not use any HDF5, but just attempts a > >>> parallel write to a file and parallel read. That test failed when > >>> requesting more than 1 processor -- which they say indicates a failure of > >>> the openmpi installation. The error message was: > >>> > >>> MPI_INIT: argc 1 > >>> MPI_INIT: argc 1 > >>> Testing simple C MPIO program with 2 processes accessing file > >>> ./mpitest.data > >>> (Filename can be specified via program argument) > >>> Proc 0: hostname=Ron-Cohen-MBP.local > >>> Proc 1: hostname=Ron-Cohen-MBP.local > >>> MPI_BARRIER[0]: comm MPI_COMM_WORLD > >>> MPI_BARRIER[1]: comm MPI_COMM_WORLD > >>> Proc 0: MPI_File_open with MPI_MODE_EXCL failed (MPI_ERR_FILE: invalid > >>> file) > >>> MPI_ABORT[0]: comm MPI_COMM_WORLD errorcode 1 > >>> MPI_BCAST[1]: buffer 7fff5a483048 count 1 datatype MPI_INT root 0 comm > >>> MPI_COMM_WORLD > >>> > >>> I then went back to my openmpi directories and tried running some of the > >>> individual tests in the test and examples directories. In particular in > >>> test/class I found one test that seem to not be run as part of make check > >>> which failed, even with one processor; this is opal_bitmap. Not sure if > >>> this is because 1.7.4rc1 is incomplete, or there is something wrong with > >>> the installation, or maybe a 32 vs 64 bit thing? The error message is > >>> > >>> mpirun detected that one or more processes exited with non-zero status, > >>> thus causing the job to be terminated. The first process to do so was: > >>> > >>> Process name: [[48805,1],0] > >>> Exit code: 255 > >>> > >>> Any suggestions? > >>> > >>> More generally has anyone out there gotten an openmpi build on Mavericks > >>> to work with sufficient success that they can get the attached > >>> Sample_mpio.c (or better yet, parallel HDF5) to build? > >>> > >>> Details: Running Mac OS X 10.9.1 on a mid-2009 Macbook pro with 4 GB > >>> memory; tried openmpi 1.6.5 and 1.7.4rc1. Built openmpi against the > >>> stock gcc that comes with XCode 5.0.2, and gfortran 4.9.0. > >>> > >>> Files attached: config.log.gz, openmpialllog.gz (output of running > >>> ompi_info --all), checklog2.gz (output of make.check in top openmpi > >>> directory). > >>> > >>> I am not attaching logs of make and install since those seem to have been > >>> successful, but can generate those if that would be helpful. > >>> > >>> _______________________________________________ > >>> users mailing list > >>> us...@open-mpi.org > >>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>> > >>> > >>> > >>> _______________________________________________ > >>> users mailing list > >>> us...@open-mpi.org > >>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>> > >>> > >>> _______________________________________________ > >>> users mailing list > >>> us...@open-mpi.org > >>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>> > >>> > >>> > >>> _______________________________________________ > >>> users mailing list > >>> us...@open-mpi.org > >>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>> > >>> > >>> _______________________________________________ > >>> users mailing list > >>> us...@open-mpi.org > >>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users