[OMPI users] OMPI performance vs. LAM

Karl Dockendorf Thu, 26 Oct 2006 13:33:18 -0400

I recently switched to OpenMPI (v1.1.1) from LAM/MPI. My applicationruns at approximately 1/4th the speed of the same program runningunder LAM. Let me explain my setup.

The program is executed as 16 processes on 8 dual-processor AppleXserve Nodes with one gigabit card (per node) interfaced to a gigabitswitch. The application requires communication for every 1 ms ofmodel time (under LAM the program used to run slightly faster thanrealtime). When communication occurs every process needsinformation from each of the other processes. The information thatneeds to be transmitted from any given process varies from one int (4bytes) to about 1200-1500 bytes (just under one normal ethernetframe). Jumbo frames are not supported by the switch. The case of4-50 bytes happens often > 80 % of the time.

The communication scheme that I had generated to compress the trafficwas this. Each node transfers data from the higher ranked process onthat node to the other via shared memory. Then the lower rankedprocesses from each node communicate in a treed round-robin scheme(to avoid contention for resources [the nic] and minimise traffic).See pseudo code below. Then the lower ranked process on each nodetells the higher rank process via shared memory. Under both LAM andOpenMPI the processes are distributed "--byslots." And, yes thisscheme was ~3x faster than a Alltoallv or AllGatherv under LAM. Onemore point first, the transfers were partioned into packets of 1500bytes at each stage and padded if necessary.


Pseudocode for tree'd round-robbin scheme:

// share on node first
if (mpi_rank % 2 == 0) {
        MPI_recv();
        Merge_current_info_with_new_info;
} else {
        MPI_send();
}
// share between nodes
for (i = 1; i < ceil(log2(mpi_size)); i++) {
        share_partner = mpi_rank ^ (1 << i);
        if (share_partner < mpip_size) { // does partner exist?
                MPI_isend();
                MPI_irecv();
                MPI_Waitall();
                Merge_current_info_with_new_info;
        }
}
// share on node afterward
if (mpi_rank % 2 == 0) {
        MPI_send();
} else {
        MPI_recv();
}

I know this is a detailed email, but it is important I resolve this(the faster the model runs, the faster I graduate). One moreinteresting tidbit, under LAM this scaled up to the 8 nodes (linearscaling up to 4 nodes) for this program. OpenMPI performance is justabout all the same beyond 1 node (2 processes).


Thanks for any help!!!

Karl Dockendorf
Research Fellow
Department of Biomedical Engineering
University of Florida
ka...@ufl.edu

[OMPI users] OMPI performance vs. LAM

Reply via email to