On Oct 27, 2005, at 1:30 PM, Troy Telford wrote:
I've been running a number of benchmarks & tests with OpenMPI 1.0rc4.
I've run into a few issues that I believe are related to OpenMPI; if
they aren't, I'd appreciate the education. :)
No, they're unfortunately probably bugs. I think we've fixed a bunch
of them, but I'm sure that some still remain. For example, we still
have some issue left that HPL will not complete if you use the internal
datatype support (Tim and George are still analyzing this). If you
tell HPL to turn of datatypes, runs should complete.
The attached tarball does not have the MPICH variant results (the
tarball is 87 kb as it is)
I can run the same tests with MVAPICH, MPICH-GM, and MPICH-MX with no
problems. The benchmarks were built from source rpm's (that I
maintain), so I can say the build procedure for the benchmarks is
essentially identical from one MPI to another.
Excellent. Many thanks for all your diligence here -- this is
extremely helpful to us!
A short summary:
[snipped]
Quick summary of results:
HPC Challenge:
* Never completed an entire run on any interconnect
- MVAPI came close; crashed after the HPL section.
-Error messages:
[n60:21912] *** An error occurred in MPI_Reduce
[n60:21912] *** on communicator MPI_COMM_WORLD
[n60:21912] *** MPI_ERR_OP: invalid reduce operation
- GM wedges itself in the HPL section
- MX crashes during the PTRANS test (the first test performed)
(See earlier thread on this list about OpenMPI wedging itself; I did
apply that workaround).
A bunch of fixes have been committed post-rc4 on the 1.0 branch that
may help with this. Two notes:
1. I'm concerned about the MPI_Reduce error -- that one shouldn't be
happening at all. We have table lookups for the MPI_Op/MPI_Datatype
combinations that are supposed to work; the fact that you're getting
this error means that HPCC is using a combination that falls outside
the pairs that are defined in the MPI standard. Sigh. But it's HPCC,
so we should support it ;-).
I just committed a fix to both the trunk and v1.0 branch that gives a
bit more helpful error message when this happens -- it tells what MPI
datatype (if it's intrinsic or named) and what MPI_Op were used.
Brian, one of the OMPI developers, thinks that he's seen something
similar from back in our LAM days -- he has a dim recollection that it
might be in the Random test. Can you grab either the latest v1.0 SVN
or the v1.0 snapshot and give it a whirl?
(I just initiated the creation of new snapshots rather than wait until
midnight tonight -- should be up on the web site in ~30 minutes -- look
for r7924 at http://www.open-mpi.org/nightly/v1.0/ )
2. Some of the fixes that we committed were deep within the voodoo of
what could loosely be called the "main progression engine" (i.e., the
ob1 and teg PML components). These bugs could well have caused
problems across the board (i.e., regardless of interconnect). I still
don't think we have all the kinks worked out yet, but if you've got an
automated testing procedure, if you could kick it off with the newest
stuff, we'd appreciate knowing if you get further, etc.
HPL:
- MX gives an error: MX: assertion: <<not yet implemented>> failed
at line 281, file ../mx__shmem.c
Wow -- that's neat. Passing this one on to George, who did our MX
support...
IMB:
* Only completes with one interconnect:
Presta:
* Completes with varying degrees of success
These are quite symptomatic of the errors that we had in the
progression engine. It's amazing how tiny logic errors can hose an
entire MPI implementation ;-). (read: these were not serious errors,
just run-of-the-mill typos/mistakes, but unfortunately are within the
central innards of the whole implementation, making it look like the
entire implementation was wonky).
Hopefully now you'll be able to get a bit further in the tests...?
--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/