Swamy Kandadai wrote:
Jeff:
I'm not Jeff, but...
Linpack has different characteristics at different problem sizes. At
small problem sizes, any number of different overheads could be the
problem. At large problem sizes, one should approach the peak
floating-point performance of the machine and the efficiency of one's
DGEMM (and blocking one uses, etc.) should become the issues. So, one
question is whether there is a difference in the overheads or whether
the large-N performance is actually different.
I recommend measuring performance for a range of matrix sizes. The data
should be able to tell you if there are performance differences at small
N that disappear with sufficiently large N or if there is a performance
difference that would persist regardless of how large one were to make N.
Again, I think it's better to look at trends as a function of N rather
than just looking at one data point. You can get better understanding
that way. Plus, it's cheaper! (Run time grows as N^3, so it's faster
to run many small Ns than to run one or two blockbuster Ns.)
Anyhow, one would think the data will indicate that large-N performance
is independent of the MPI implementation -- so long as you use the same
DGEMMs in both cases (and you say you're using MKL in both cases). But
this is an important assumption to check.
If it's a matter of small-N overheads taking the edge off your big-N
performance, then you could maybe start profiling small-N runs.
I am running on a 2.66 GHz Nehalem node. On this node, the turbo mode
and hyperthreading are enabled.
When I run LINPACK with Intel MPI, I get 82.68 GFlops without much
trouble.
When I ran with OpenMPI (I have OpenMPI 1.2.8 but my colleague was
using 1.3.2). I was using the same MKL libraries both with OpenMPI and
Intel MPI. But with OpenMPI, the best I got so far is 80.22 GFlops and
I could never achieve close to what I am getting with Intel MPI.
Here are muy options with OpenMPI:
mpirun -n 8 --machinefile hf --mca rmaps_rank_file_path rankfile --mca
coll_sm_info_num_procs 8 --mca btl self,sm -mca mpi_leave_pinned 1
./xhpl_ompi
Here is my rankfile:
at rankfile
rank 0=i02n05 slot=0
rank 1=i02n05 slot=1
rank 2=i02n05 slot=2
rank 3=i02n05 slot=3
rank 4=i02n05 slot=4
rank 5=i02n05 slot=5
rank 6=i02n05 slot=6
rank 7=i02n05 slot=7
In this case the physical cores are 0-7 while the additional logical
processors with hyperthreading are 8-15.
With "top" command, I could see all the 8 tasks are running on 8
different physical cores. I did not see
2 MPI tasks running on the same physical core. Also, the program is
not paging as the problem size
fits in the meory.
Do you have any ideas how I can improve the performance so that it
matches with Intel MPI performance?
Any suggestions will be greatly appreciated.