I generally build Open MPI from a source rpm (and I'm the author of that
srpm's spec file). That way, Open MPI is built consistently between linux
distros...
I'm running into an issue that works on one distro; breaks on another.
I'd like to track down where the bug is (the distro, or Open MPI) Since
one distro is still a prerelease version, I'm quite willing to believe
that it's a problem with the distro, but just in case...
I'm using InfiniBand (openib.org RC4), and presta's 'allred' and 'com'
tests. Open MPI, the IB libraries, and the test are compiled from the
same set of source RPMS on each distro.
I've got one machine, using Fedora Core 4 (gcc 4.0.0), vanilla linux
kernel 2.6.16, and Open MPI 1.0.2.
With FC4, things work fine (for a sufficiently small number of nodes --
see ticket #40)
'mpirun -np 4 -machinefile foo allred 10 10 10'
'mpirun -np 4 -machinefile foo com -o 100'
distro X (pre-release version, and I don't want to violate any NDA's I
don't know about...), is using GCC 4.1.0, distro kernel 2.6.16, and Open
MPI 1.0.2
This time, when I try to run presta's 'allred', I receive the following:
[n1:04214] *** An error occurred in MPI_Gather
[n1:04214] *** on communicator MPI_COMM_WORLD
[n1:04214] *** MPI_ERR_ARG: invalid argument of some other kind
[n1:04214] *** MPI_ERRORS_ARE_FATAL (goodbye)
[n1:04215] *** An error occurred in MPI_Gather
[n1:04215] *** on communicator MPI_COMM_WORLD
[n1:04215] *** MPI_ERR_ARG: invalid argument of some other kind
[n1:04215] *** MPI_ERRORS_ARE_FATAL (goodbye)
Another note: On FC4, openib works, TCP doesn't (see ticket #41).
the 'com' test ends with:
[n1:04941] *** An error occurred in MPI_Gather
[n1:04941] *** on communicator MPI_COMM_WORLD
[n1:04941] *** MPI_ERR_ARG: invalid argument of some other kind
[n1:04941] *** MPI_ERRORS_ARE_FATAL (goodbye)
note: The error is identical for TCP and openib
note: On FC4, openib works, TCP doesn't (see ticket #41).
And yes, I'm going to try out the dev snapshots of 1.0.3 and 1.1... I'm
just not there yet...
(For those tracking tickets #40 and #41 -- I know it would be nice to see
if distro X has same the behavior I see with FC4, but I don't have the
hardware to do any sort of scale testing with distro X.)
--
Troy Telford