Re: [OMPI users] slow MPI_BCast for messages size from 24K bytes to 800K bytes. (fwd)

kmuriki Wed, 14 Jan 2009 14:39:00 -0500


Hi Jeff,


Here is the code with a warmup broadcast of 10K real values and
actual broadcast of 100K real*8 values(different buffers):

[kmuriki@n0000 pub]$ more testbcast.f90
program em3d
implicit real*8 (a-h,o-z)
include 'mpif.h'
! em3d_inv main driver
!  INITIALIZE MPI AND DETERMINE BOTH INDIVIDUAL PROCESSOR #
!   AND THE TOTAL NUMBER OF PROCESSORS
!
integer:: Proc
real*8, allocatable:: dbuf(:)
real warmup(10000)

call MPI_INIT(ierror)
call MPI_COMM_RANK(MPI_COMM_WORLD,Proc,IERROR)
call MPI_COMM_SIZE(MPI_COMM_WORLD,Num_Proc,IERROR)

ndat=100000

!print*,'bcasting to no of tasks',num_proc
allocate(dbuf(ndat))
do i=1,ndat
  dbuf(i)=dble(i)
enddo

do i=1,10000
  warmup(i)=(i)
enddo

!print*, 'Making warmup BCAST',proc
call MPI_BCAST(warmup,10000, &
     MPI_REAL,0,MPI_COMM_WORLD,ierror)

t1=MPI_WTIME()
call MPI_BCAST(dbuf,ndat, &
     MPI_DOUBLE_PRECISION,0,MPI_COMM_WORLD,ierror)
!print*, 'Done with call to broadcast',proc
t2=MPI_WTIME()
write(*,*)'time for bcast',t2-t1

deallocate(dbuf)
call MPI_FINALIZE(IERROR)
end program em3d
[kmuriki@n0000 pub]$ !mpif90
mpif90 -o testbcast testbcast.f90
testbcast.f90(20): (col. 1) remark: LOOP WAS VECTORIZED.
testbcast.f90(24): (col. 1) remark: LOOP WAS VECTORIZED.

/global/software/centos-5.x86_64/modules/intel/fce/10.1.018/lib/libimf.so:warning: warning: feupdateenv is not implemented and will always fail

[kmuriki@n0000 pub]$ !mpirun

mpirun -v -display-map -mca btl openib,self -mca mpi_leave_pinned 1-hostfile ./hostfile.geophys -np 4 ./testbcast

[n0000.scs00:12909]  Map for job: 1     Generated by mapping mode: byslot
        Starting vpid: 0        Vpid range: 4   Num app_contexts: 1
        Data for app_context: index 0   app: ./testbcast
                Num procs: 4
                Argv[0]: ./testbcast
                Env[0]: OMPI_MCA_btl=openib,self
                Env[1]: OMPI_MCA_mpi_leave_pinned=1
                Env[2]: OMPI_MCA_rmaps_base_display_map=1
                Env[3]: OMPI_MCA_rds_hostfile_path=./hostfile.geophys

Env[4]:OMPI_MCA_orte_precondition_transports=1e4532db63da3056-33551606203d9c19

                Env[5]: OMPI_MCA_rds=proxy
                Env[6]: OMPI_MCA_ras=proxy
                Env[7]: OMPI_MCA_rmaps=proxy
                Env[8]: OMPI_MCA_pls=proxy
                Env[9]: OMPI_MCA_rmgr=proxy

Working dir:/global/home/users/kmuriki/sample_executables/pub (user: 0)

                Num maps: 0
        Num elements in nodes list: 2
        Mapped node:

Cell: 0 Nodename: n0015.geophys Launch id: -1 Username:NULL

                Daemon name:
                        Data type: ORTE_PROCESS_NAME    Data Value: NULL
                Oversubscribed: False   Num elements in procs list: 2
                Mapped proc:
                        Proc Name:

Data type: ORTE_PROCESS_NAME Data Value:[0,1,0]Proc Rank: 0 Proc PID: 0 App_context index:0


                Mapped proc:
                        Proc Name:

Data type: ORTE_PROCESS_NAME Data Value:[0,1,1]Proc Rank: 1 Proc PID: 0 App_context index:0


        Mapped node:

Cell: 0 Nodename: n0016.geophys Launch id: -1 Username:NULL

                Daemon name:
                        Data type: ORTE_PROCESS_NAME    Data Value: NULL
                Oversubscribed: False   Num elements in procs list: 2
                Mapped proc:
                        Proc Name:

Data type: ORTE_PROCESS_NAME Data Value:[0,1,2]Proc Rank: 2 Proc PID: 0 App_context index:0


                Mapped proc:
                        Proc Name:

Data type: ORTE_PROCESS_NAME Data Value:[0,1,3]Proc Rank: 3 Proc PID: 0 App_context index:0

 time for bcast  5.556106567382812E-003
 time for bcast  5.569934844970703E-003
 time for bcast  2.491402626037598E-002
 time for bcast  2.490019798278809E-002
[kmuriki@n0000 pub]$

If I reduce the warmup size from 10K to 1K, below is the output:

 time for bcast  2.994060516357422E-003
 time for bcast  2.840995788574219E-003
 time for bcast   52.0005199909210
 time for bcast   52.0438468456268

May be when I tried 1K size warmup, as the size is small it justused copy in/copy out semantics and the RDMA buffers are not setup

so the actual bcast was slow ! and when I used 10K size warmup it
did setup RDMA buffers and hence the actual bcast was quick !.

Is it possible to see more diagnostic output from mpirun command
with any addiitional option to see if its doing copyin/copyout etc... ?
Like with Myrinet mpirun '-v' option gives a lot of diagnostic output.

Finally below are the numbers with IB and gige when I try to run
Bcast in IMB, which looks good:

[kmuriki@n0000 runIMB]$ mpirun -v -np 4 --mca btl openib,self -hostfile../pub/hostfile.geophys/global/home/groups/scs/tests/IMB/IMB_3.1/src/IMB-MPI1 -npmin 4 bcast

#---------------------------------------------------
#    Intel (R) MPI Benchmark Suite V3.1, MPI-1 part
#---------------------------------------------------
# Date                  : Wed Jan 14 11:37:54 2009
# Machine               : x86_64
# System                : Linux
# Release               : 2.6.18-92.1.18.el5
# Version               : #1 SMP Wed Nov 12 09:19:49 EST 2008
# MPI Version           : 2.0
# MPI Thread Environment: MPI_THREAD_SINGLE



# Calling sequence was:

# /global/home/groups/scs/tests/IMB/IMB_3.1/src/IMB-MPI1 -npmin 4 bcast

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM
#
#

# List of Benchmarks to run:

# Bcast

#----------------------------------------------------------------
# Benchmarking Bcast
# #processes = 4
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.05         0.05         0.05
            1         1000        11.73        11.75        11.75
            2         1000        10.38        10.40        10.39
            4         1000        10.26        10.28        10.27
            8         1000        10.43        10.45        10.44
           16         1000        10.26        10.28        10.27
           32         1000        10.46        10.48        10.47
           64         1000        10.47        10.49        10.48
          128         1000        10.41        10.43        10.42
          256         1000        11.13        11.15        11.14
          512         1000        11.30        11.31        11.31
         1024         1000        14.45        14.47        14.47
         2048         1000        26.03        26.05        26.04
         4096         1000        44.00        44.04        44.02
         8192         1000        72.21        72.28        72.26
        16384         1000       135.48       135.60       135.56
        32768         1000       297.64       297.71       297.67
        65536          640       579.20       579.37       579.28
       131072          320      1174.31      1174.81      1174.57
       262144          160      2484.21      2486.33      2485.28
       524288           80      2686.47      2695.13      2690.80
      1048576           40      5706.35      5740.59      5723.47
      2097152           20     10705.90     10761.65     10742.98
      4194304           10     21567.58     21678.50     21641.65

[kmuriki@n0000 runIMB]$ mpirun -v -np 4 --mca btl tcp,self -hostfile../pub/hostfile.geophys/global/home/groups/scs/tests/IMB/IMB_3.1/src/IMB-MPI1 -npmin 4 bcast

#---------------------------------------------------
#    Intel (R) MPI Benchmark Suite V3.1, MPI-1 part
#---------------------------------------------------
# Date                  : Wed Jan 14 11:38:01 2009
# Machine               : x86_64
# System                : Linux
# Release               : 2.6.18-92.1.18.el5
# Version               : #1 SMP Wed Nov 12 09:19:49 EST 2008
# MPI Version           : 2.0
# MPI Thread Environment: MPI_THREAD_SINGLE



# Calling sequence was:

# /global/home/groups/scs/tests/IMB/IMB_3.1/src/IMB-MPI1 -npmin 4 bcast

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM
#
#

# List of Benchmarks to run:

# Bcast

#----------------------------------------------------------------
# Benchmarking Bcast
# #processes = 4
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.05         0.06         0.05
            1         1000        51.23        51.31        51.25
            2         1000        49.98        50.08        50.01
            4         1000        49.93        50.08        49.97
            8         1000        51.23        51.39        51.27
           16         1000        49.92        50.04        49.96
           32         1000        49.88        50.02        49.93
           64         1000        49.94        50.07        49.99
          128         1000        50.03        50.19        50.08
          256         1000        53.46        53.62        53.53
          512         1000        62.36        62.52        62.41
         1024         1000        74.82        75.05        74.89
         2048         1000       190.87       191.09       190.98
         4096         1000       215.01       215.29       215.20
         8192         1000       285.16       285.41       285.28
        16384         1000       426.49       426.79       426.64
        32768         1000       680.94       681.29       681.16
        65536          640      1148.72      1149.69      1149.34
       131072          320      2511.92      2512.13      2512.03
       262144          160      4716.58      4717.14      4716.86
       524288           80      8010.99      8016.05      8013.21
      1048576           40     16657.90     16676.32     16667.73
      2097152           20     27720.20     27916.86     27825.34
      4194304           10     54355.69     54781.70     54585.30
[kmuriki@n0000 runIMB]$

thanks,
Krishna.

On Wed, 14 Jan 2009, Jeff Squyres wrote:

On Jan 13, 2009, at 3:32 PM, kmur...@lbl.gov wrote:
With IB, there's also the issue of registered memory. Open MPI v1.2.xdefaults to copy in/copy out semantics (with pre-registered memory) untilthe message reaches a certain size, and then it uses a pipelinedregister/RDMA protocol. However, even with copy in/out semantics of smallmessages, the resulting broadcast should still be much faster than overgige.Are you using the same buffer for the warmup bcast as the actual bcast?You might try using "--mca mpi_leave_pinned 1" to see if that helps aswell (will likely only help with large messages).
I'm using different buffers for warmup and actual bcast. I tried thempi_leave_pinned 1, but did not see any difference in behaviour.
In this case, you likely won't see much of a difference -- mpi_leave_pinnedwill generally only be a boost for long messages that use the same buffersrepeatedly.
May be when ever the openmpi defaults to copy in/copy out semantics on my
cluster its performing very slow (than gige) but not when it uses RDMA.
That would be quite surprising. I still think there's some kind of startupoverhead going on here.
Surprisingly just doing two consecutive 80K byte MPI_BCASTs
performs very quick (forget about warmup and actual broadcast).
wheres as a single 80K broadcast is slow. Not sure if I'm missing
anything!.
There's also the startup time and synchronization issues. Remember thatalthough MPI_BCAST does not provide any synchronization guarantees, itcould well be that the 1st bcast effectively synchronizes the processesand the 2nd one therefore runs much faster (because individual processeswon't need to spend much time blocking waiting for messages becausethey're effectively operating in lock step after the first bcast).Benchmarking is a very tricky business; it can be extremely difficult toprecisely measure exactly what you want to measure.
My main effort here is not to benchmark my cluster but to resolve a
user problem, where in he complained that his bcasts are running very slow.I tried to recreate the situation with a simple fortran program
which just performs a bcast of size similar in his code. It also performed
very slow (than gige) then I started increasing and decreasing the sizes
of bcast to observe that it performs slow only in the range 8K bytes
to 100K bytes.
Can you send your modified test program (with a warmup send)?
What happens if you run a benchmark like the broadcast section of IMB on TCPand IB?
--
Jeff Squyres
Cisco Systems

Re: [OMPI users] slow MPI_BCast for messages size from 24K bytes to 800K bytes. (fwd)

Reply via email to