See my comments in line...

________________________________
 From: Yevgeny Kliteynik <klit...@dev.mellanox.co.il>
To: Randolph Pullen <randolph_pul...@yahoo.com.au> 
Cc: OpenMPI Users <us...@open-mpi.org> 
Sent: Sunday, 9 September 2012 6:18 PM
Subject: Re: [OMPI users] Infiniband performance Problem and stalling
 
Randolph,

On 9/7/2012 7:43 AM, Randolph Pullen wrote:
> Yevgeny,
> The ibstat results:
> CA 'mthca0'
> CA type: MT25208 (MT23108 compat mode)

What you have is InfiniHost III HCA, which is 4x SDR card.
This card has theoretical peak of 10 Gb/s, which is 1GB/s in IB bit coding.

> And more interestingly, ib_write_bw:
> Conflicting CPU frequency values detected: 1600.000000 != 3301.000000
> 
> What does Conflicting CPU frequency values mean?
> 
> Examining the /proc/cpuinfo file however shows:
> processor : 0
> cpu MHz : 3301.000
> processor : 1
> cpu MHz : 3301.000
> processor : 2
> cpu MHz : 1600.000
> processor : 3
> cpu MHz : 1600.000
> 
> Which seems oddly wierd to me...

You need to have all the cores running at highest clock to get better numbers.
May be you have power governor not set to optimal performance on these machines.
Google for "Linux CPU scaling governor" to get more info on this subject, or
contact your system admin and ask him to take care of the CPU frequencies.

Once this is done, check all the pairs of your machines - ensure that you get
a good number with ib_write_br.
Note that if you have a slower machine in the cluster, general application
performance will suffer from this.

I have anchored the clocks speeds to:
[root@vh1 ~]#   cat /sys/devices/system/cpu/*/cpufreq/cpuinfo_cur_freq
3600000
3600000
3600000
3600000
3600000
3600000
3600000
3600000

[root@vh2 ~]#  cat /sys/devices/system/cpu/*/cpufreq/cpuinfo_cur_freq
3200000
3200000
3200000
3200000

However /proc/cpuinfo still reports them incorrectly
 [deepcloud@vh2 c]$  grep MHz /proc/cpuinfo 
cpu MHz         : 3300.000
cpu MHz         : 1600.000
cpu MHz         : 1600.000
cpu MHz         : 1600.000

I don't think this is the problem, so I used -F option in  ib_write_bw to push 
ahead. ie;
[deepcloud@vh2 c]$  ib_write_bw -F vh1
------------------------------------------------------------------
                    RDMA_Write BW Test
 Number of qps   : 1
 Connection type : RC
 TX depth        : 300
 CQ Moderation   : 50
 Link type       : IB
 Mtu             : 2048
 Inline data is used up to 0 bytes message
 local address: LID 0x04 QPN 0xaa0408 PSN 0xf9c072 RKey 0x59260052 VAddr 
0x002b03a8af3000
 remote address: LID 0x03 QPN 0x8b0408 PSN 0xe4890d RKey 0x4a62003c VAddr 
0x002b8e44297000
------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]
Conflicting CPU frequency values detected: 3300.000000 != 1600.000000
Test integrity may be harmed !
Conflicting CPU frequency values detected: 3300.000000 != 1600.000000
Test integrity may be harmed !
Conflicting CPU frequency values detected: 3300.000000 != 1600.000000
Test integrity may be harmed !
Warning: measured timestamp frequency 3092.95 differs from nominal 3300 MHz
 65536     5000           937.61             937.60 
------------------------------------------------------------------


>  > On 8/31/2012 10:53 AM, Randolph Pullen wrote:
>  > > (reposted with consolidatedinformation)
>  > > I have a test rig comprising 2 i7 systems 8GB RAM with Melanox III HCA 
>10G cards
>  > > running Centos 5.7 Kernel 2.6.18-274
>  > > Open MPI 1.4.3
>  > > MLNX_OFED_LINUX-1.5.3-1.0.0.2 (OFED-1.5.3-1.0.0.2):
>  > > On a Cisco 24 pt switch
>  > > Normal performance is:
>  > > $ mpirun --mca btl openib,self -n 2 -hostfile mpi.hosts PingPong
>  > > results in:
>  > > Max rate = 958.388867 MB/sec Min latency = 4.529953 usec
>  > > and:
>  > > $ mpirun --mca btl tcp,self -n 2 -hostfile mpi.hosts PingPong
>  > > Max rate = 653.547293 MB/sec Min latency = 19.550323 usec

These numbers look fine - 958 MB/s on IB is close to theoretical limit.
654 MB/s for IPoIB look fine too.

>  > > My problem is I see better performance under IPoIB then I do on native 
>IB (RDMA_CM).

I don't see this in your numbers. What do I miss?

Runs in 9 seconds:
mpirun --mca btl_openib_eager_limit 64 --mca mpi_leave_pinned 1 --mca btl 
openib,self -H vh2,vh1 -np 9 --bycore  prog
mpirun --mca btl_openib_flags 2 --mca mpi_leave_pinned 1 --mca btl tcp,self -H 
vh2,vh1 -np 9 --bycore prog

Runs in 24 seconds or more:
mpirun --mca btl_openib_flags 2 --mca mpi_leave_pinned 1 --mca btl openib,self 
-H vh2,vh1 -np 9 --bycore prog
mpirun --mca btl_openib_flags 2 --mca mpi_leave_pinned 1 --mca btl 
openib,self,sm -H vh2,vh1 -np 9 --bycore prog
mpirun --mca btl_openib_eager_limit 64 --mca mpi_leave_pinned 1 --mca btl 
openib,self,sm  -H vh2,vh1 -np 9 --bycore  prog


Note:
- adding sm to the fastest openib run results in a 13 second penalty
- Subsequent runs with openib usually add at least 10 seconds per run or stall

>  > > My understanding is that IPoIB is limited to about 1G/s so I am at a 
>loss to know why it is faster.

Again, I see IPoIB performance under 1 GB/s.

>  > > And this one produces similar run times but seems to degrade with 
>repeated cycles:
>  > > mpirun --mca btl_openib_eager_limit 64 --mca mpi_leave_pinned 1 --mca 
>btl openib,self -H vh2,vh1 -np 9 --bycore prog
> 
> You're running 9 ranks on two machines, but you're using IB for intra-node 
> communication.
> Is it intentional? If not, you can add "sm" btl and have performance improved.

Also, don't forget to include "sm" btl if you have more than 1 MPI rank per 
node.
See above: adding sm to the fastest openib run results in a 13 second penalty

-- YK

Reply via email to