On Oct 27, 2005, at 1:30 PM, Troy Telford wrote:

I've been running a number of benchmarks & tests with OpenMPI 1.0rc4. I've run into a few issues that I believe are related to OpenMPI; if they aren't, I'd appreciate the education. :)

No, they're unfortunately probably bugs. I think we've fixed a bunch of them, but I'm sure that some still remain. For example, we still have some issue left that HPL will not complete if you use the internal datatype support (Tim and George are still analyzing this). If you tell HPL to turn of datatypes, runs should complete.

The attached tarball does not have the MPICH variant results (the tarball is 87 kb as it is)

I can run the same tests with MVAPICH, MPICH-GM, and MPICH-MX with no problems. The benchmarks were built from source rpm's (that I maintain), so I can say the build procedure for the benchmarks is essentially identical from one MPI to another.

Excellent. Many thanks for all your diligence here -- this is extremely helpful to us!

A short summary:
[snipped]

Quick summary of results:

HPC Challenge:
* Never completed an entire run on any interconnect
        - MVAPI came close; crashed after the HPL section.
                -Error messages:
                [n60:21912] *** An error occurred in MPI_Reduce
                [n60:21912] *** on communicator MPI_COMM_WORLD
                [n60:21912] *** MPI_ERR_OP: invalid reduce operation
        - GM wedges itself in the HPL section
        - MX crashes during the PTRANS test (the first test performed)
(See earlier thread on this list about OpenMPI wedging itself; I did apply that workaround).

A bunch of fixes have been committed post-rc4 on the 1.0 branch that may help with this. Two notes:

1. I'm concerned about the MPI_Reduce error -- that one shouldn't be happening at all. We have table lookups for the MPI_Op/MPI_Datatype combinations that are supposed to work; the fact that you're getting this error means that HPCC is using a combination that falls outside the pairs that are defined in the MPI standard. Sigh. But it's HPCC, so we should support it ;-).

I just committed a fix to both the trunk and v1.0 branch that gives a bit more helpful error message when this happens -- it tells what MPI datatype (if it's intrinsic or named) and what MPI_Op were used. Brian, one of the OMPI developers, thinks that he's seen something similar from back in our LAM days -- he has a dim recollection that it might be in the Random test. Can you grab either the latest v1.0 SVN or the v1.0 snapshot and give it a whirl?

(I just initiated the creation of new snapshots rather than wait until midnight tonight -- should be up on the web site in ~30 minutes -- look for r7924 at http://www.open-mpi.org/nightly/v1.0/ )

2. Some of the fixes that we committed were deep within the voodoo of what could loosely be called the "main progression engine" (i.e., the ob1 and teg PML components). These bugs could well have caused problems across the board (i.e., regardless of interconnect). I still don't think we have all the kinks worked out yet, but if you've got an automated testing procedure, if you could kick it off with the newest stuff, we'd appreciate knowing if you get further, etc.

HPL:
- MX gives an error: MX: assertion: <<not yet implemented>> failed at line 281, file ../mx__shmem.c

Wow -- that's neat. Passing this one on to George, who did our MX support...

IMB:
* Only completes with one interconnect:
Presta:
* Completes with varying degrees of success

These are quite symptomatic of the errors that we had in the progression engine. It's amazing how tiny logic errors can hose an entire MPI implementation ;-). (read: these were not serious errors, just run-of-the-mill typos/mistakes, but unfortunately are within the central innards of the whole implementation, making it look like the entire implementation was wonky).

Hopefully now you'll be able to get a bit further in the tests...?

--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/

Reply via email to