I recently switched to OpenMPI (v1.1.1) from LAM/MPI. My application
runs at approximately 1/4th the speed of the same program running
under LAM. Let me explain my setup.
The program is executed as 16 processes on 8 dual-processor Apple
Xserve Nodes with one gigabit card (per node) interfaced to a gigabit
switch. The application requires communication for every 1 ms of
model time (under LAM the program used to run slightly faster than
realtime). When communication occurs every process needs
information from each of the other processes. The information that
needs to be transmitted from any given process varies from one int (4
bytes) to about 1200-1500 bytes (just under one normal ethernet
frame). Jumbo frames are not supported by the switch. The case of
4-50 bytes happens often > 80 % of the time.
The communication scheme that I had generated to compress the traffic
was this. Each node transfers data from the higher ranked process on
that node to the other via shared memory. Then the lower ranked
processes from each node communicate in a treed round-robin scheme
(to avoid contention for resources [the nic] and minimise traffic).
See pseudo code below. Then the lower ranked process on each node
tells the higher rank process via shared memory. Under both LAM and
OpenMPI the processes are distributed "--byslots." And, yes this
scheme was ~3x faster than a Alltoallv or AllGatherv under LAM. One
more point first, the transfers were partioned into packets of 1500
bytes at each stage and padded if necessary.
Pseudocode for tree'd round-robbin scheme:
// share on node first
if (mpi_rank % 2 == 0) {
MPI_recv();
Merge_current_info_with_new_info;
} else {
MPI_send();
}
// share between nodes
for (i = 1; i < ceil(log2(mpi_size)); i++) {
share_partner = mpi_rank ^ (1 << i);
if (share_partner < mpip_size) { // does partner exist?
MPI_isend();
MPI_irecv();
MPI_Waitall();
Merge_current_info_with_new_info;
}
}
// share on node afterward
if (mpi_rank % 2 == 0) {
MPI_send();
} else {
MPI_recv();
}
I know this is a detailed email, but it is important I resolve this
(the faster the model runs, the faster I graduate). One more
interesting tidbit, under LAM this scaled up to the 8 nodes (linear
scaling up to 4 nodes) for this program. OpenMPI performance is just
about all the same beyond 1 node (2 processes).
Thanks for any help!!!
Karl Dockendorf
Research Fellow
Department of Biomedical Engineering
University of Florida
ka...@ufl.edu