George Bosilca wrote:
Glen,
Thanks for the spending time benchmarking OpenMPI and for sending us the
feedback. We know we have some issues on the 1.0.2 version, more precisely
with the collective communications. We just look inside the CMAQ code, and
there are a lot of reduce and Allreduce. As it look like the collective
are intensively used it's normal that the 1.0.2a4 is slower than MPICH (I
expect the same behaviour for both MPICH1 and MPICH2). The collective are
now fixed in the nightly build, we are working toward moving them on the
next stable release. Until then, if you can redo the benchmark with one of
the nightly build that will be very usefull. I'm confident that the
results will improve considerably.
Hi. You're a brave guy even looking at CMAQ. =)
Anyway, here are the times on a few runs I did with Open MPI 1.1a1r887.
Basically what I'm seeing, my jobs run ok when they're local to one
machine, but as soon as they're split up between multiple machines
performance can vary:
4 cpu jobs:
2:16:27
2:01:35
1:24:20
1:03:55
1:22:43
1:31:53
8 cpu jobs:
1:02:53
1:08:52
1:46:25
1:17:39
0:43:44
1:02:31
And by the way. I was doing some maintenance work on my machines this
weekend, so absolutely everyone was kicked off. I'm positive nothing
else was interfering with these jobs.
Also, someone had asked what my setup was, so here it is basically:
HP Procurve 2848 gigabit ethernet switch.
Tyan K8S boards, with dual opteron 246's, 2 gigs of ram, and built in
broadcom gigabit ethernet adapters.
Rocks 4.0, with the latest updates from Red Hat, running a
2.6.9-22.0.1.ELsmp kernel.
Network attached storage via NFS.
I don't think my setup is the problem though anyway, as these jobs have
been running file for a while now with MPICH.
Glen