Re: [O-MPI users] Tests and Bugs (RC4):

Jeff Squyres Fri, 28 Oct 2005 14:08:56 -0500

On Oct 27, 2005, at 1:30 PM, Troy Telford wrote:

I've been running a number of benchmarks & tests with OpenMPI 1.0rc4.I've run into a few issues that I believe are related to OpenMPI; ifthey aren't, I'd appreciate the education. :)

No, they're unfortunately probably bugs. I think we've fixed a bunchof them, but I'm sure that some still remain. For example, we stillhave some issue left that HPL will not complete if you use the internaldatatype support (Tim and George are still analyzing this). If youtell HPL to turn of datatypes, runs should complete.

The attached tarball does not have the MPICH variant results (thetarball is 87 kb as it is)
I can run the same tests with MVAPICH, MPICH-GM, and MPICH-MX with noproblems. The benchmarks were built from source rpm's (that Imaintain), so I can say the build procedure for the benchmarks isessentially identical from one MPI to another.

Excellent. Many thanks for all your diligence here -- this isextremely helpful to us!

A short summary:
[snipped]

Quick summary of results:

HPC Challenge:
* Never completed an entire run on any interconnect
        - MVAPI came close; crashed after the HPL section.
                -Error messages:
                [n60:21912] *** An error occurred in MPI_Reduce
                [n60:21912] *** on communicator MPI_COMM_WORLD
                [n60:21912] *** MPI_ERR_OP: invalid reduce operation
        - GM wedges itself in the HPL section
        - MX crashes during the PTRANS test (the first test performed)

(See earlier thread on this list about OpenMPI wedging itself; I didapply that workaround).

A bunch of fixes have been committed post-rc4 on the 1.0 branch thatmay help with this. Two notes:

1. I'm concerned about the MPI_Reduce error -- that one shouldn't behappening at all. We have table lookups for the MPI_Op/MPI_Datatypecombinations that are supposed to work; the fact that you're gettingthis error means that HPCC is using a combination that falls outsidethe pairs that are defined in the MPI standard. Sigh. But it's HPCC,so we should support it ;-).

I just committed a fix to both the trunk and v1.0 branch that gives abit more helpful error message when this happens -- it tells what MPIdatatype (if it's intrinsic or named) and what MPI_Op were used.Brian, one of the OMPI developers, thinks that he's seen somethingsimilar from back in our LAM days -- he has a dim recollection that itmight be in the Random test. Can you grab either the latest v1.0 SVNor the v1.0 snapshot and give it a whirl?

(I just initiated the creation of new snapshots rather than wait untilmidnight tonight -- should be up on the web site in ~30 minutes -- lookfor r7924 at http://www.open-mpi.org/nightly/v1.0/ )

2. Some of the fixes that we committed were deep within the voodoo ofwhat could loosely be called the "main progression engine" (i.e., theob1 and teg PML components). These bugs could well have causedproblems across the board (i.e., regardless of interconnect). I stilldon't think we have all the kinks worked out yet, but if you've got anautomated testing procedure, if you could kick it off with the neweststuff, we'd appreciate knowing if you get further, etc.

HPL:
- MX gives an error: MX: assertion: <<not yet implemented>> failedat line 281, file ../mx__shmem.c

Wow -- that's neat. Passing this one on to George, who did our MXsupport...

IMB:
* Only completes with one interconnect:
Presta:
* Completes with varying degrees of success

These are quite symptomatic of the errors that we had in theprogression engine. It's amazing how tiny logic errors can hose anentire MPI implementation ;-). (read: these were not serious errors,just run-of-the-mill typos/mistakes, but unfortunately are within thecentral innards of the whole implementation, making it look like theentire implementation was wonky).


Hopefully now you'll be able to get a bit further in the tests...?

--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/

Re: [O-MPI users] Tests and Bugs (RC4):

Reply via email to