Hi Jeff,
Here is the code with a warmup broadcast of 10K real values and
actual broadcast of 100K real*8 values(different buffers):
[kmuriki@n0000 pub]$ more testbcast.f90
program em3d
implicit real*8 (a-h,o-z)
include 'mpif.h'
! em3d_inv main driver
! INITIALIZE MPI AND DETERMINE BOTH INDIVIDUAL PROCESSOR #
! AND THE TOTAL NUMBER OF PROCESSORS
!
integer:: Proc
real*8, allocatable:: dbuf(:)
real warmup(10000)
call MPI_INIT(ierror)
call MPI_COMM_RANK(MPI_COMM_WORLD,Proc,IERROR)
call MPI_COMM_SIZE(MPI_COMM_WORLD,Num_Proc,IERROR)
ndat=100000
!print*,'bcasting to no of tasks',num_proc
allocate(dbuf(ndat))
do i=1,ndat
dbuf(i)=dble(i)
enddo
do i=1,10000
warmup(i)=(i)
enddo
!print*, 'Making warmup BCAST',proc
call MPI_BCAST(warmup,10000, &
MPI_REAL,0,MPI_COMM_WORLD,ierror)
t1=MPI_WTIME()
call MPI_BCAST(dbuf,ndat, &
MPI_DOUBLE_PRECISION,0,MPI_COMM_WORLD,ierror)
!print*, 'Done with call to broadcast',proc
t2=MPI_WTIME()
write(*,*)'time for bcast',t2-t1
deallocate(dbuf)
call MPI_FINALIZE(IERROR)
end program em3d
[kmuriki@n0000 pub]$ !mpif90
mpif90 -o testbcast testbcast.f90
testbcast.f90(20): (col. 1) remark: LOOP WAS VECTORIZED.
testbcast.f90(24): (col. 1) remark: LOOP WAS VECTORIZED.
/global/software/centos-5.x86_64/modules/intel/fce/10.1.018/lib/libimf.so:
warning: warning: feupdateenv is not implemented and will always fail
[kmuriki@n0000 pub]$ !mpirun
mpirun -v -display-map -mca btl openib,self -mca mpi_leave_pinned 1
-hostfile ./hostfile.geophys -np 4 ./testbcast
[n0000.scs00:12909] Map for job: 1 Generated by mapping mode: byslot
Starting vpid: 0 Vpid range: 4 Num app_contexts: 1
Data for app_context: index 0 app: ./testbcast
Num procs: 4
Argv[0]: ./testbcast
Env[0]: OMPI_MCA_btl=openib,self
Env[1]: OMPI_MCA_mpi_leave_pinned=1
Env[2]: OMPI_MCA_rmaps_base_display_map=1
Env[3]: OMPI_MCA_rds_hostfile_path=./hostfile.geophys
Env[4]:
OMPI_MCA_orte_precondition_transports=1e4532db63da3056-33551606203d9c19
Env[5]: OMPI_MCA_rds=proxy
Env[6]: OMPI_MCA_ras=proxy
Env[7]: OMPI_MCA_rmaps=proxy
Env[8]: OMPI_MCA_pls=proxy
Env[9]: OMPI_MCA_rmgr=proxy
Working dir:
/global/home/users/kmuriki/sample_executables/pub (user: 0)
Num maps: 0
Num elements in nodes list: 2
Mapped node:
Cell: 0 Nodename: n0015.geophys Launch id: -1 Username:
NULL
Daemon name:
Data type: ORTE_PROCESS_NAME Data Value: NULL
Oversubscribed: False Num elements in procs list: 2
Mapped proc:
Proc Name:
Data type: ORTE_PROCESS_NAME Data Value:
[0,1,0]
Proc Rank: 0 Proc PID: 0 App_context index:
0
Mapped proc:
Proc Name:
Data type: ORTE_PROCESS_NAME Data Value:
[0,1,1]
Proc Rank: 1 Proc PID: 0 App_context index:
0
Mapped node:
Cell: 0 Nodename: n0016.geophys Launch id: -1 Username:
NULL
Daemon name:
Data type: ORTE_PROCESS_NAME Data Value: NULL
Oversubscribed: False Num elements in procs list: 2
Mapped proc:
Proc Name:
Data type: ORTE_PROCESS_NAME Data Value:
[0,1,2]
Proc Rank: 2 Proc PID: 0 App_context index:
0
Mapped proc:
Proc Name:
Data type: ORTE_PROCESS_NAME Data Value:
[0,1,3]
Proc Rank: 3 Proc PID: 0 App_context index:
0
time for bcast 5.556106567382812E-003
time for bcast 5.569934844970703E-003
time for bcast 2.491402626037598E-002
time for bcast 2.490019798278809E-002
[kmuriki@n0000 pub]$
If I reduce the warmup size from 10K to 1K, below is the output:
time for bcast 2.994060516357422E-003
time for bcast 2.840995788574219E-003
time for bcast 52.0005199909210
time for bcast 52.0438468456268
May be when I tried 1K size warmup, as the size is small it just
used copy in/copy out semantics and the RDMA buffers are not setup
so the actual bcast was slow ! and when I used 10K size warmup it
did setup RDMA buffers and hence the actual bcast was quick !.
Is it possible to see more diagnostic output from mpirun command
with any addiitional option to see if its doing copyin/copyout etc... ?
Like with Myrinet mpirun '-v' option gives a lot of diagnostic output.
Finally below are the numbers with IB and gige when I try to run
Bcast in IMB, which looks good:
[kmuriki@n0000 runIMB]$ mpirun -v -np 4 --mca btl openib,self -hostfile
../pub/hostfile.geophys
/global/home/groups/scs/tests/IMB/IMB_3.1/src/IMB-MPI1 -npmin 4 bcast
#---------------------------------------------------
# Intel (R) MPI Benchmark Suite V3.1, MPI-1 part
#---------------------------------------------------
# Date : Wed Jan 14 11:37:54 2009
# Machine : x86_64
# System : Linux
# Release : 2.6.18-92.1.18.el5
# Version : #1 SMP Wed Nov 12 09:19:49 EST 2008
# MPI Version : 2.0
# MPI Thread Environment: MPI_THREAD_SINGLE
# Calling sequence was:
# /global/home/groups/scs/tests/IMB/IMB_3.1/src/IMB-MPI1 -npmin 4 bcast
# Minimum message length in bytes: 0
# Maximum message length in bytes: 4194304
#
# MPI_Datatype : MPI_BYTE
# MPI_Datatype for reductions : MPI_FLOAT
# MPI_Op : MPI_SUM
#
#
# List of Benchmarks to run:
# Bcast
#----------------------------------------------------------------
# Benchmarking Bcast
# #processes = 4
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.05 0.05 0.05
1 1000 11.73 11.75 11.75
2 1000 10.38 10.40 10.39
4 1000 10.26 10.28 10.27
8 1000 10.43 10.45 10.44
16 1000 10.26 10.28 10.27
32 1000 10.46 10.48 10.47
64 1000 10.47 10.49 10.48
128 1000 10.41 10.43 10.42
256 1000 11.13 11.15 11.14
512 1000 11.30 11.31 11.31
1024 1000 14.45 14.47 14.47
2048 1000 26.03 26.05 26.04
4096 1000 44.00 44.04 44.02
8192 1000 72.21 72.28 72.26
16384 1000 135.48 135.60 135.56
32768 1000 297.64 297.71 297.67
65536 640 579.20 579.37 579.28
131072 320 1174.31 1174.81 1174.57
262144 160 2484.21 2486.33 2485.28
524288 80 2686.47 2695.13 2690.80
1048576 40 5706.35 5740.59 5723.47
2097152 20 10705.90 10761.65 10742.98
4194304 10 21567.58 21678.50 21641.65
[kmuriki@n0000 runIMB]$ mpirun -v -np 4 --mca btl tcp,self -hostfile
../pub/hostfile.geophys
/global/home/groups/scs/tests/IMB/IMB_3.1/src/IMB-MPI1 -npmin 4 bcast
#---------------------------------------------------
# Intel (R) MPI Benchmark Suite V3.1, MPI-1 part
#---------------------------------------------------
# Date : Wed Jan 14 11:38:01 2009
# Machine : x86_64
# System : Linux
# Release : 2.6.18-92.1.18.el5
# Version : #1 SMP Wed Nov 12 09:19:49 EST 2008
# MPI Version : 2.0
# MPI Thread Environment: MPI_THREAD_SINGLE
# Calling sequence was:
# /global/home/groups/scs/tests/IMB/IMB_3.1/src/IMB-MPI1 -npmin 4 bcast
# Minimum message length in bytes: 0
# Maximum message length in bytes: 4194304
#
# MPI_Datatype : MPI_BYTE
# MPI_Datatype for reductions : MPI_FLOAT
# MPI_Op : MPI_SUM
#
#
# List of Benchmarks to run:
# Bcast
#----------------------------------------------------------------
# Benchmarking Bcast
# #processes = 4
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.05 0.06 0.05
1 1000 51.23 51.31 51.25
2 1000 49.98 50.08 50.01
4 1000 49.93 50.08 49.97
8 1000 51.23 51.39 51.27
16 1000 49.92 50.04 49.96
32 1000 49.88 50.02 49.93
64 1000 49.94 50.07 49.99
128 1000 50.03 50.19 50.08
256 1000 53.46 53.62 53.53
512 1000 62.36 62.52 62.41
1024 1000 74.82 75.05 74.89
2048 1000 190.87 191.09 190.98
4096 1000 215.01 215.29 215.20
8192 1000 285.16 285.41 285.28
16384 1000 426.49 426.79 426.64
32768 1000 680.94 681.29 681.16
65536 640 1148.72 1149.69 1149.34
131072 320 2511.92 2512.13 2512.03
262144 160 4716.58 4717.14 4716.86
524288 80 8010.99 8016.05 8013.21
1048576 40 16657.90 16676.32 16667.73
2097152 20 27720.20 27916.86 27825.34
4194304 10 54355.69 54781.70 54585.30
[kmuriki@n0000 runIMB]$
thanks,
Krishna.
On Wed, 14 Jan 2009, Jeff Squyres wrote:
On Jan 13, 2009, at 3:32 PM, kmur...@lbl.gov wrote:
With IB, there's also the issue of registered memory. Open MPI v1.2.x
defaults to copy in/copy out semantics (with pre-registered memory) until
the message reaches a certain size, and then it uses a pipelined
register/RDMA protocol. However, even with copy in/out semantics of small
messages, the resulting broadcast should still be much faster than over
gige.
Are you using the same buffer for the warmup bcast as the actual bcast?
You might try using "--mca mpi_leave_pinned 1" to see if that helps as
well (will likely only help with large messages).
I'm using different buffers for warmup and actual bcast. I tried the
mpi_leave_pinned 1, but did not see any difference in behaviour.
In this case, you likely won't see much of a difference -- mpi_leave_pinned
will generally only be a boost for long messages that use the same buffers
repeatedly.
May be when ever the openmpi defaults to copy in/copy out semantics on my
cluster its performing very slow (than gige) but not when it uses RDMA.
That would be quite surprising. I still think there's some kind of startup
overhead going on here.
Surprisingly just doing two consecutive 80K byte MPI_BCASTs
performs very quick (forget about warmup and actual broadcast).
wheres as a single 80K broadcast is slow. Not sure if I'm missing
anything!.
There's also the startup time and synchronization issues. Remember that
although MPI_BCAST does not provide any synchronization guarantees, it
could well be that the 1st bcast effectively synchronizes the processes
and the 2nd one therefore runs much faster (because individual processes
won't need to spend much time blocking waiting for messages because
they're effectively operating in lock step after the first bcast).
Benchmarking is a very tricky business; it can be extremely difficult to
precisely measure exactly what you want to measure.
My main effort here is not to benchmark my cluster but to resolve a
user problem, where in he complained that his bcasts are running very slow.
I tried to recreate the situation with a simple fortran program
which just performs a bcast of size similar in his code. It also performed
very slow (than gige) then I started increasing and decreasing the sizes
of bcast to observe that it performs slow only in the range 8K bytes
to 100K bytes.
Can you send your modified test program (with a warmup send)?
What happens if you run a benchmark like the broadcast section of IMB on TCP
and IB?
--
Jeff Squyres
Cisco Systems