Re: [OMPI users] NetPipe benchmark & spanning multiple interconnects

Galen Shipman Tue, 13 Feb 2007 10:23:54 -0500

Alex,

For OpenIB + GM you are probably going to be limited by the memory bus.

Take the InfiniBand Nic, it peaks at say 900 MBytes/Sec, the Myrinet2-G will peak at say 250 MBytes/Sec.Unless you are doing direct DMAs from pre-registered host memory thanyou will not see 900 + 250 MBytes/Sec bandwidth.The reason is that either you copy into registered memory, so thecopy is the bottleneck, or you register/unregister memory on demand,which would be the bottleneck.

So the "solution" for micro-benchmarks is to register the memory andleave it registered. Probably the best way to do this is to useMPI_ALLOC_MEM when allocating memory, this allows us to register thememory with all the available NICs.

For applications it is bit more difficult to say if there is benefit.You may be able to decrease congestion for some systems. Someapplications that use pack buffers and are bandwidth limited may alsobenefit, the pack buffer could be obtained via MPI_ALLOC_MEM.

I would also say that this is a very uncommon mode of operation, ourarchitecture allows it, but certainly isn't optimized for this case.


- Galen



On Feb 12, 2007, at 6:48 PM, Alex Tumanov wrote:

Anyone else who may provide some feedback/comments on this issue? How
typical/widespread is the use of multiple interconnects in the HPC
community? Judging from the feedback I'm getting in this thread, it
appears that this is fairly uncommon?

Thanks for your attention to this thread.

Alex.

On 2/8/07, Alex Tumanov <atuma...@gmail.com> wrote:

Thanks for your insight George.

Strange, the latency is supposed to be there too. Anyway, thelatencyis only used to determine which one is faster, in order to use itfor
small messages.


I searched the code base for mca parameter registering and did indeed
discover that latency setting is possible for tcp and tcp alone:

------------------------------------------------------------------------------------------------------

[OMPISRCDIR]# grep -r param_register * |egrep -i "latency|bandwidth"
ompi/mca/btl/openib/btl_openib_component.c:
mca_btl_openib_param_register_int("bandwidth", "Approximate maximum
bandwidth of interconnect",
ompi/mca/btl/tcp/btl_tcp_component.c:    btl->super.btl_bandwidth =
mca_btl_tcp_param_register_int(param, 0);
ompi/mca/btl/tcp/btl_tcp_component.c:    btl->super.btl_latency =
mca_btl_tcp_param_register_int(param, 0);
ompi/mca/btl/gm/btl_gm_component.c:
mca_btl_gm_param_register_int("bandwidth", 250);
ompi/mca/btl/mvapi/btl_mvapi_component.c:
mca_btl_mvapi_param_register_int("bandwidth", "Approximate maximum
bandwidth of interconnect",

------------------------------------------------------------------------------------------------------

For all others, btl_latency appears to be set to zero when the btl
module gets constructed. Would zero latency prevent message striping?

An interesting side-issue that surfaces as a result of this little
investigation is the inconsistency between the ompi_info output and
the actual mca param availability for tcp_latency:

[OMPISRCDIR]# ompi_info --param all all |egrep -i "latency|bandwidth"

MCA btl: parameter "btl_gm_bandwidth" (currentvalue: "250")MCA btl: parameter "btl_mvapi_bandwidth" (currentvalue: "800")Approximate maximum bandwidth ofinterconnect

                 MCA btl: parameter "btl_openib_bandwidth" (current
value: "800")

Approximate maximum bandwidth ofinterconnect


You also mentioned the exclusivity factor. I looked through the code
for that, and it appears that interconnect btl module developers are

setting exclusivity to various different integer values. In oneplace,

the comment suggests that exclusivity is what gets used to prioritize

interconnects... So a) I'm not sure what to set exclusivity to,and b)

it's unclear whether its latency or exclusivity that determines the

order. According to btl.h and you - it's the latency, according tothe

following - exclusivity has something to do with it as well:

btl/mx/btl_mx_component.c :  mca_base_param_reg_int(
(mca_base_component_t*)&mca_btl_mx_component, "exclusivity",

"Priority compared with the othersdevices

(used only when several devices are available",
                            false, false, 50, (int*)
&mca_btl_mx_module.super.btl_exclusivity );

What should exclusivity be set to in order to allow using multiple
interconnects?

Finally,

For bandwidth, what
really matters is the relative ratio. We sum all bandwidths and they

we divide by the device bandwidth to find out how much data weshould

send over each interconnect (that's really close with what happens
there).

That's precisely how I would've done it and makes perfect sense.Since

it's the relative ratio that matters and not the absolute value, why
then my openib+gm test failed to deliver better bandwidth performance
than just openib? I had bandwidth values set for both of those btls.
The expected behavior in my case would be to send roughly 1/4
(250/1050) across gm and 3/4 (800/1050) across openib? My hunch is
that there's something else preventing message striping other than
incorrect absolute values for the bandwidth here...

Thanks a lot for your feedback on this one. It gave me good pointers
to follow. Please do let me know if you can think of anything else
that I need to check.

Sincerely,
Alex.

On 2/8/07, George Bosilca <bosi...@cs.utk.edu> wrote:
In order to get any performance improvement from stripping the
messages over multiple interconnects one has to specify thelatencyand bandwidth for these interconnects, and to make sure thatany ofthem don't ask for exclusivity. I'm usually running overmultiple TCP
interconnects and here is my mca-params.conf file:
btl_tcp_if_include = eth0,eth1
btl_tcp_max_rdma_size = 524288

btl_tcp_latency_eth0 = 47
btl_tcp_bandwidth_eth0 = 587

btl_tcp_latency_eth1 = 51
btl_tcp_bandwidth_eth1 = 233
Something similar has to be done for openib and gm, in order toallow
us to strip the messages correctly.

   Thanks,
     george.

On Feb 8, 2007, at 12:02 PM, Alex Tumanov wrote:
Hello Jeff. Thanks for pointing out NetPipe to me. I've played
around
with it a little in hope to see clear evidence/effect of message
striping in OpenMPI. Unfortunately, what I saw is that theresult ofrunning NPmpi over several interconnects is identical torunning it
over a single fastest one :-( That was not the expected behavior,
and
I'm hoping that I'm doing something wrong. I'm usingNetPIPE_3.6.2
over OMPI 1.1.4. NetPipe was compiled by making sure Open MPI's
mpicc
can be found and simply  running 'make mpi' under NetPIPE_3.6.2
directory.

I experimented with 3 interconnects: openib, gm, and gig-e.
Specifically, I found that the times (and, correspondingly,
bandwidth)
reported for openib+gm is pretty much identical to the times
reported
for just openib. Here are the commands I used to initiate the
benchmark:

#  mpirun -H f0-0,c0-0 --prefix $MPIHOME --mca btl openib,gm,self
~/NPmpi > ~/testdir/ompi/netpipe/ompi_netpipe_openib+gm.log 2>&1
#  mpirun -H f0-0,c0-0 --prefix $MPIHOME --mca btl openib,self ~/
NPmpi
ompi_netpipe_openib.log 2>&1
Similarly, for tcp+gm the reported times were identical to just
running the benchmark over gm alone. The commands were:
# mpirun -H f0-0,c0-0 --prefix $MPIHOME --mca btl tcp,gm,self--mca
btl_tcp_if_exclude lo,ib0,ib1 ~/NPmpi
# mpirun -H f0-0,c0-0 --prefix $MPIHOME --mca btl gm,self ~/NPmpi
Orthogonally, I've also observed that trying to use any
combination of
interconnects that includes openib (except using it exclusively)
fails
as soon as the benchmark reaches trials with 1.5MB messagesizes. Infact the CPU load remained at 100% on the headnode, but nofurtheroutput is sent to the log file or the screen (see the tailsbelow).
This behavior is fairly consistent and may be of interest to Open
MPI
development community. If anybody has tried using openib in
combination with other interconnects please let me know whatissuesyou've encountered and what tips and tricks you could share inthis
regard.

Many thanks. Keep up the good work!

Sincerely,
Alex.

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] NetPipe benchmark & spanning multiple interconnects

Reply via email to