[OMPI users] 4.1 mpi-io test failures on lustre
I tried mpi-io tests from mpich 4.3 with openmpi 4.1 on the ac922 system that I understand was used to fix ompio problems on lustre. I'm puzzled that I still see failures. I don't know why there are disjoint sets in mpich's test/mpi/io and src/mpi/romio/test, but I ran all the non-Fortran ones with MCA io defaults across two nodes. In src/mpi/romio/test, atomicity failed (ignoring error and syshints); in test/mpi/io, the failures were setviewcur, tst_fileview, external32_derived_dtype, i_bigtype, and i_setviewcur. tst_fileview was probably killed by the 100s timeout. It may be that some are only appropriate for romio, but no-one said so before and they presumably shouldn't segv or report libc errors. I built against ucx 1.9 with cuda support. I realize that has problems on ppc64le, with no action on the issue, but there's a limit to what I can do. cuda looks relevant since one test crashes while apparently trying to register cuda memory; that's presumably not ompio's fault, but we need cuda.
[OMPI users] bad defaults with ucx
Why does 4.1 still not use the right defaults with UCX? Without specifying osc=ucx, IMB-RMA crashes like 4.0.5. I haven't checked what else it is UCX says you must set for openmpi to avoid memory corruption, at least, but I guess that won't be right either. Users surely shouldn't have to explore notes for a fundamental library to be able to run even IMB.
Re: [OMPI users] 4.1 mpi-io test failures on lustre
I will have a look at those tests. The recent fixes were not correctness, but performance fixes. Nevertheless, we used to pass the mpich tests, but I admit that it is not a testsuite that we run regularly, I will have a look at them. The atomicity tests are expected to fail, since this the one chapter of MPI I/O that is not implemented in ompio. Thanks Edgar -Original Message- From: users On Behalf Of Dave Love via users Sent: Thursday, January 14, 2021 5:46 AM To: users@lists.open-mpi.org Cc: Dave Love Subject: [OMPI users] 4.1 mpi-io test failures on lustre I tried mpi-io tests from mpich 4.3 with openmpi 4.1 on the ac922 system that I understand was used to fix ompio problems on lustre. I'm puzzled that I still see failures. I don't know why there are disjoint sets in mpich's test/mpi/io and src/mpi/romio/test, but I ran all the non-Fortran ones with MCA io defaults across two nodes. In src/mpi/romio/test, atomicity failed (ignoring error and syshints); in test/mpi/io, the failures were setviewcur, tst_fileview, external32_derived_dtype, i_bigtype, and i_setviewcur. tst_fileview was probably killed by the 100s timeout. It may be that some are only appropriate for romio, but no-one said so before and they presumably shouldn't segv or report libc errors. I built against ucx 1.9 with cuda support. I realize that has problems on ppc64le, with no action on the issue, but there's a limit to what I can do. cuda looks relevant since one test crashes while apparently trying to register cuda memory; that's presumably not ompio's fault, but we need cuda.
Re: [OMPI users] bad defaults with ucx
Good question. I've filed https://github.com/open-mpi/ompi/issues/8379 so that we can track this. > On Jan 14, 2021, at 7:53 AM, Dave Love via users > wrote: > > Why does 4.1 still not use the right defaults with UCX? > > Without specifying osc=ucx, IMB-RMA crashes like 4.0.5. I haven't > checked what else it is UCX says you must set for openmpi to avoid > memory corruption, at least, but I guess that won't be right either. > Users surely shouldn't have to explore notes for a fundamental library > to be able to run even IMB. -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI users] bad defaults with ucx
"Jeff Squyres (jsquyres)" writes: > Good question. I've filed > https://github.com/open-mpi/ompi/issues/8379 so that we can track > this. For the benefit of the list: I mis-remembered that osc=ucx was general advice. The UCX docs just say you need to avoid the uct btl, which can cause memory corruption, but OMPI 4.1 still builds and uses it by default. (The UCX doc also suggests other changes to parameters, but for performance rather than correctness.) Anyway, I can get at least IMB-RMA to run on this Summit-like hardware just with --mca btl ^uct (though there are failures with other tests which seem to be specific to UCX on ppc64le, and not to OMPI).
[OMPI users] Error with building OMPI with PGI
Hello, I'm having an error when trying to build OMPI 4.0.3 (also tried 4.1) with PGI 20.1 ./configure CPP=cpp CC=pgcc CXX=pgc++ F77=pgf77 FC=pgf90 --prefix=$PREFIX --with-ucx=$UCX_HOME --with-slurm --with-pmi=/opt/slurm/cluster/ibex/install --with-cuda=$CUDATOOLKIT_HOME in the make install step: make[4]: Leaving directory `/tmp/openmpi-4.0.3/opal/mca/pmix/pmix3x' make[3]: Leaving directory `/tmp/openmpi-4.0.3/opal/mca/pmix/pmix3x' make[2]: Leaving directory `/tmp/openmpi-4.0.3/opal/mca/pmix/pmix3x' Making install in mca/pmix/s1 make[2]: Entering directory `/tmp/openmpi-4.0.3/opal/mca/pmix/s1' CCLD mca_pmix_s1.la pgcc-Error-Unknown switch: -pthread make[2]: *** [mca_pmix_s1.la] Error 1 make[2]: Leaving directory `/tmp/openmpi-4.0.3/opal/mca/pmix/s1' make[1]: *** [install-recursive] Error 1 make[1]: Leaving directory `/tmp/openmpi-4.0.3/opal' make: *** [install-recursive] Error 1 Please advise. ? All the best, Passant
Re: [OMPI users] Error with building OMPI with PGI
Hi Passant, list This is an old problem with PGI. There are many threads in the OpenMPI mailing list archives about this, with workarounds. The simplest is to use FC="pgf90 -noswitcherror". Here are two out of many threads ... well, not pthreads! :) https://www.mail-archive.com/users@lists.open-mpi.org/msg08962.html https://www.mail-archive.com/users@lists.open-mpi.org/msg10375.html I hope this helps, Gus Correa On Thu, Jan 14, 2021 at 5:45 PM Passant A. Hafez via users < users@lists.open-mpi.org> wrote: > Hello, > > > I'm having an error when trying to build OMPI 4.0.3 (also tried 4.1) with > PGI 20.1 > > > ./configure CPP=cpp CC=pgcc CXX=pgc++ F77=pgf77 FC=pgf90 > --prefix=$PREFIX --with-ucx=$UCX_HOME --with-slurm > --with-pmi=/opt/slurm/cluster/ibex/install --with-cuda=$CUDATOOLKIT_HOME > > > in the make install step: > > make[4]: Leaving directory `/tmp/openmpi-4.0.3/opal/mca/pmix/pmix3x' > make[3]: Leaving directory `/tmp/openmpi-4.0.3/opal/mca/pmix/pmix3x' > make[2]: Leaving directory `/tmp/openmpi-4.0.3/opal/mca/pmix/pmix3x' > Making install in mca/pmix/s1 > make[2]: Entering directory `/tmp/openmpi-4.0.3/opal/mca/pmix/s1' > CCLD mca_pmix_s1.la > pgcc-Error-Unknown switch: -pthread > make[2]: *** [mca_pmix_s1.la] Error 1 > make[2]: Leaving directory `/tmp/openmpi-4.0.3/opal/mca/pmix/s1' > make[1]: *** [install-recursive] Error 1 > make[1]: Leaving directory `/tmp/openmpi-4.0.3/opal' > make: *** [install-recursive] Error 1 > > Please advise. > > > > > All the best, > Passant >