I've been running a number of benchmarks & tests with OpenMPI 1.0rc4. I've run into a few issues that I believe are related to OpenMPI; if they aren't, I'd appreciate the education. :)

The attached tarball does not have the MPICH variant results (the tarball is 87 kb as it is)

I can run the same tests with MVAPICH, MPICH-GM, and MPICH-MX with no problems. The benchmarks were built from source rpm's (that I maintain), so I can say the build procedure for the benchmarks is essentially identical from one MPI to another.

A short summary:
* Identical hardware, except for the interconnect.
* Linux, SLES 9 SP2, kernel 2.6.5-7.201-smp (SLES binary)
* Opteron 248's, two CPU's per node, 4 GB per node.
* Four nodes in every test run.

I used the following interconnects/drivers:
* Myrinet               (GM 2.0.22 and MX 1.0.3)
* Infiniband    (Mellanox "IB Gold" 1.8)

And the following benchmarks/tests:
* HPC Challenge (v1.0)
* HPL (v1.0)
* Intel MPI Benchmark (IMB, formerly PALLAS) v2.3
* Presta MPI Benchmarks

Quick summary of results:

HPC Challenge:
* Never completed an entire run on any interconnect
        - MVAPI came close; crashed after the HPL section.
                -Error messages:
                [n60:21912] *** An error occurred in MPI_Reduce
                [n60:21912] *** on communicator MPI_COMM_WORLD
                [n60:21912] *** MPI_ERR_OP: invalid reduce operation
        - GM wedges itself in the HPL section
        - MX crashes during the PTRANS test (the first test performed)
(See earlier thread on this list about OpenMPI wedging itself; I did apply that workaround).

HPL:
* Only completes with one interconnect:
        - MVAPI mca btl works fine.
        - GM wedges itself, similar to HPCC
- MX gives an error: MX: assertion: <<not yet implemented>> failed at line 281, file ../mx__shmem.c

IMB:
* Only completes with one interconnect:
        - MVAPI mca btl works fine.
- GM fails, but differs in which portion of the benchmark it gets stuck at.
        - MX fails, offering both the error listed in the HPL section, as well 
as:
"mx_connect fail for 0th remote address key deadbeef (error Operation timed-out)"

Presta:
* Completes with varying degrees of success
        - MVAPI:  Completes successfully
-But the 'all reduction' test is 173 times slower than the same test on GM, and is 360 times slower than with MX. - GM: Does not complete the 'com' test; simply stops at the same point every time (I have it included in my logs) - MX: Completes successfully, but I do receive the "mx_connect fail for 0th remote address key deadbeef (error Operation timed-out)" message.


I hope I've provided enough information to be useful; if not, just ask and I'll help out as much as I can.

Attachment: openmpi.tar.bz2
Description: application/bzip2

Reply via email to