Swamy Kandadai wrote:

Jeff:

I'm not Jeff, but...

Linpack has different characteristics at different problem sizes. At small problem sizes, any number of different overheads could be the problem. At large problem sizes, one should approach the peak floating-point performance of the machine and the efficiency of one's DGEMM (and blocking one uses, etc.) should become the issues. So, one question is whether there is a difference in the overheads or whether the large-N performance is actually different.

I recommend measuring performance for a range of matrix sizes. The data should be able to tell you if there are performance differences at small N that disappear with sufficiently large N or if there is a performance difference that would persist regardless of how large one were to make N.

Again, I think it's better to look at trends as a function of N rather than just looking at one data point. You can get better understanding that way. Plus, it's cheaper! (Run time grows as N^3, so it's faster to run many small Ns than to run one or two blockbuster Ns.)

Anyhow, one would think the data will indicate that large-N performance is independent of the MPI implementation -- so long as you use the same DGEMMs in both cases (and you say you're using MKL in both cases). But this is an important assumption to check.

If it's a matter of small-N overheads taking the edge off your big-N performance, then you could maybe start profiling small-N runs.

I am running on a 2.66 GHz Nehalem node. On this node, the turbo mode and hyperthreading are enabled. When I run LINPACK with Intel MPI, I get 82.68 GFlops without much trouble.

When I ran with OpenMPI (I have OpenMPI 1.2.8 but my colleague was using 1.3.2). I was using the same MKL libraries both with OpenMPI and Intel MPI. But with OpenMPI, the best I got so far is 80.22 GFlops and I could never achieve close to what I am getting with Intel MPI.
Here are muy options with OpenMPI:

mpirun -n 8 --machinefile hf --mca rmaps_rank_file_path rankfile --mca coll_sm_info_num_procs 8 --mca btl self,sm -mca mpi_leave_pinned 1 ./xhpl_ompi

Here is my rankfile:

at rankfile
rank 0=i02n05 slot=0
rank 1=i02n05 slot=1
rank 2=i02n05 slot=2
rank 3=i02n05 slot=3
rank 4=i02n05 slot=4
rank 5=i02n05 slot=5
rank 6=i02n05 slot=6
rank 7=i02n05 slot=7

In this case the physical cores are 0-7 while the additional logical processors with hyperthreading are 8-15. With "top" command, I could see all the 8 tasks are running on 8 different physical cores. I did not see 2 MPI tasks running on the same physical core. Also, the program is not paging as the problem size
fits in the meory.

Do you have any ideas how I can improve the performance so that it matches with Intel MPI performance?
Any suggestions will be greatly appreciated.

Reply via email to