Re: [OMPI users] multiple LIDs

2006-12-06 Thread Chevchenkovic Chevchenkovic

Hi,
 Actually I was wondering why there is a facility for having multiple
LIDs for the same port. This led me to the entire series of questions.
  It is still not very clear to, as to what is the advantage of
assigning multiple LIDs to the same port. Does it give some
performance advantages?
-Chev


On 12/5/06, Jeff Squyres  wrote:

There are two distinct layers of software being discussed here:

- the PML (basically the back-end to MPI_SEND and friends)
- the BTL (byte transfer layer, the back-end bit movers for the ob1
and dr PMLs -- this distinction is important because there is nothing
in the PML design that forces the use of BTL's; indeed, there is at
least one current PML that does not use BTL's as the back-end bit
mover [the cm PML])

The ob1 and dr PMLs know nothing about how the back-end bitmovers
work (BTL components) -- the BTLs are given considerable freedom to
operate within their specific interface contracts.

Generally, ob1/dr queries each BTL component when Open MPI starts
up.  Each BTL responds with whether it wants to run or not.  If it
does, it gives back the one or more modules (think of a module as an
"instance" of a component).  Typically, these modules correspond to
multiple NICs / HCAs / network endpoints.  For example, if you have 2
ethernet cards, the tcp BTL will create and return 2 modules.  ob1 /
dr will treat these as two paths to send data (reachability is
computed as well, of course -- ob1/dr will only send data down btls
for which the target peer is reachable).  In general, ob1/dr will
round-robin across all available BTL modules when sending large
messages (as Gleb has described).  See http://www.open-mpi.org/papers/
euro-pvmmpi-2006-hpc-protocols/ for a general description of the ob1/
dr protocols.

The openib BTL can return multiple modules if multiple LIDs are
available.  So the ob1/dr doesn't know that these are not physical
devices -- it just treats each module as an equivalent mechanism to
send data.

This is actually somewhat lame as a scheme, and we talked internally
about doing something more intelligent.  But we decided to hold off
and let people (like you!) with real-world apps and networks give
this stuff a try and see what really works (and what doesn't work)
before trying to implement anything else.

So -- all that explanation aside -- we'd love to hear your feedback
with regards to the multi-LID stuff in Open MPI.  :-)



On Dec 4, 2006, at 1:27 PM, Chevchenkovic Chevchenkovic wrote:

>  Thanks for that.
>
>  Suppose,  if there there are multiple interconnects, say ethernet and
> infiniband  and a million byte of data is to be sent, then in this
> case the data will be sent through infiniband (since its a fast path
> .. please correct me here if i m wrong).
>
>   If there are mulitple such sends, do you mean to say that each send
> will go  through  different BTLs in a RR manner if they are connected
> to the same port?
>
>  -chev
>
>
> On 12/4/06, Gleb Natapov  wrote:
>> On Mon, Dec 04, 2006 at 10:53:26PM +0530, Chevchenkovic
>> Chevchenkovic wrote:
>>> Hi,
>>>  It is not clear from the code as mentioned by you from
>>> ompi/mca/pml/ob1/  where exactly the selection of BTL bound to a
>>> particular LID occurs. Could you please specify the file/function
>>> name
>>> for the same?
>> There is no such code there. OB1 knows nothing about LIDs. It does RR
>> over all available interconnects. It can do RR between ethernet, IB
>> and Myrinet for instance. BTL presents each LID as different
>> virtual HCA
>> to OB1 and it does round-robin between them without event knowing
>> this
>> is the same port of the same HCA.
>>
>> Can you explain what are you trying to achieve?
>>
>>>  -chev
>>>
>>>
>>> On 12/4/06, Gleb Natapov  wrote:
 On Mon, Dec 04, 2006 at 01:07:08AM +0530, Chevchenkovic
 Chevchenkovic wrote:
> Also could you please tell me which part of the openMPI code
> needs to
> be touched so that I can do some modifications in it to
> incorporate
> changes regarding LID selection...
>
 It depend what do you want to do. The part that does RR over all
 available LIDs is in OB1 PML (ompi/mca/pml/ob1/), but the code
 doesn't
 aware of the fact that it is doing RR over different LIDs and not
 different NICs (yet?).

 The code that controls what LIDs will be used is in
 ompi/mca/btl/openib/btl_openib_component.c.

> On 12/4/06, Chevchenkovic Chevchenkovic
>  wrote:
>> Is it possible to control the LID where the send and recvs are
>> posted.. on either ends?
>>
>> On 12/3/06, Gleb Natapov  wrote:
>>> On Sun, Dec 03, 2006 at 07:03:33PM +0530, Chevchenkovic
>>> Chevchenkovic
>> wrote:
 Hi,
  I had this query. I hope some expert replies to it.
 I have 2 nodes connected point-to-point using infiniband
 cable. There
 are multiple LIDs for each of the end node ports.
When I give an MPI_Send, Are the sends are posted on
>>

Re: [OMPI users] multiple LIDs

2006-12-06 Thread Gleb Natapov
On Wed, Dec 06, 2006 at 12:14:35PM +0530, Chevchenkovic Chevchenkovic wrote:
> Hi,
>   Actually I was wondering why there is a facility for having multiple
> LIDs for the same port. This led me to the entire series of questions.
>It is still not very clear to, as to what is the advantage of
> assigning multiple LIDs to the same port. Does it give some
> performance advantages?
Each LID has its own path through the fabric (ideally), this is the way to
lower a congestion.

> -Chev
> 
> 
> On 12/5/06, Jeff Squyres  wrote:
> > There are two distinct layers of software being discussed here:
> >
> > - the PML (basically the back-end to MPI_SEND and friends)
> > - the BTL (byte transfer layer, the back-end bit movers for the ob1
> > and dr PMLs -- this distinction is important because there is nothing
> > in the PML design that forces the use of BTL's; indeed, there is at
> > least one current PML that does not use BTL's as the back-end bit
> > mover [the cm PML])
> >
> > The ob1 and dr PMLs know nothing about how the back-end bitmovers
> > work (BTL components) -- the BTLs are given considerable freedom to
> > operate within their specific interface contracts.
> >
> > Generally, ob1/dr queries each BTL component when Open MPI starts
> > up.  Each BTL responds with whether it wants to run or not.  If it
> > does, it gives back the one or more modules (think of a module as an
> > "instance" of a component).  Typically, these modules correspond to
> > multiple NICs / HCAs / network endpoints.  For example, if you have 2
> > ethernet cards, the tcp BTL will create and return 2 modules.  ob1 /
> > dr will treat these as two paths to send data (reachability is
> > computed as well, of course -- ob1/dr will only send data down btls
> > for which the target peer is reachable).  In general, ob1/dr will
> > round-robin across all available BTL modules when sending large
> > messages (as Gleb has described).  See http://www.open-mpi.org/papers/
> > euro-pvmmpi-2006-hpc-protocols/ for a general description of the ob1/
> > dr protocols.
> >
> > The openib BTL can return multiple modules if multiple LIDs are
> > available.  So the ob1/dr doesn't know that these are not physical
> > devices -- it just treats each module as an equivalent mechanism to
> > send data.
> >
> > This is actually somewhat lame as a scheme, and we talked internally
> > about doing something more intelligent.  But we decided to hold off
> > and let people (like you!) with real-world apps and networks give
> > this stuff a try and see what really works (and what doesn't work)
> > before trying to implement anything else.
> >
> > So -- all that explanation aside -- we'd love to hear your feedback
> > with regards to the multi-LID stuff in Open MPI.  :-)
> >
> >
> >
> > On Dec 4, 2006, at 1:27 PM, Chevchenkovic Chevchenkovic wrote:
> >
> > >  Thanks for that.
> > >
> > >  Suppose,  if there there are multiple interconnects, say ethernet and
> > > infiniband  and a million byte of data is to be sent, then in this
> > > case the data will be sent through infiniband (since its a fast path
> > > .. please correct me here if i m wrong).
> > >
> > >   If there are mulitple such sends, do you mean to say that each send
> > > will go  through  different BTLs in a RR manner if they are connected
> > > to the same port?
> > >
> > >  -chev
> > >
> > >
> > > On 12/4/06, Gleb Natapov  wrote:
> > >> On Mon, Dec 04, 2006 at 10:53:26PM +0530, Chevchenkovic
> > >> Chevchenkovic wrote:
> > >>> Hi,
> > >>>  It is not clear from the code as mentioned by you from
> > >>> ompi/mca/pml/ob1/  where exactly the selection of BTL bound to a
> > >>> particular LID occurs. Could you please specify the file/function
> > >>> name
> > >>> for the same?
> > >> There is no such code there. OB1 knows nothing about LIDs. It does RR
> > >> over all available interconnects. It can do RR between ethernet, IB
> > >> and Myrinet for instance. BTL presents each LID as different
> > >> virtual HCA
> > >> to OB1 and it does round-robin between them without event knowing
> > >> this
> > >> is the same port of the same HCA.
> > >>
> > >> Can you explain what are you trying to achieve?
> > >>
> > >>>  -chev
> > >>>
> > >>>
> > >>> On 12/4/06, Gleb Natapov  wrote:
> >  On Mon, Dec 04, 2006 at 01:07:08AM +0530, Chevchenkovic
> >  Chevchenkovic wrote:
> > > Also could you please tell me which part of the openMPI code
> > > needs to
> > > be touched so that I can do some modifications in it to
> > > incorporate
> > > changes regarding LID selection...
> > >
> >  It depend what do you want to do. The part that does RR over all
> >  available LIDs is in OB1 PML (ompi/mca/pml/ob1/), but the code
> >  doesn't
> >  aware of the fact that it is doing RR over different LIDs and not
> >  different NICs (yet?).
> > 
> >  The code that controls what LIDs will be used is in
> >  ompi/mca/btl/openib/btl_openib_component.c.
> > 
> > > On 12/4/0

Re: [OMPI users] multiple LIDs

2006-12-06 Thread Jeff Squyres

On Dec 6, 2006, at 2:05 AM, Gleb Natapov wrote:

  Actually I was wondering why there is a facility for having  
multiple
LIDs for the same port. This led me to the entire series of  
questions.

   It is still not very clear to, as to what is the advantage of
assigning multiple LIDs to the same port. Does it give some
performance advantages?
Each LID has its own path through the fabric (ideally), this is the  
way to

lower a congestion.


More specifically, multi-LID support is best for networks where there  
are multiple paths between the same pair of peers (e.g., a fat tree  
network with multiple core switches).  If you don't have a multi- 
route topology on your IB network, multi-LID support from the same  
port is not likely to be useful.


To answer your specific question: multi-LID support from a single  
port is not a bandwidth multiplier because you're still sending and  
receiving from a single port on the host (which has a fixed bandwidth  
capability).


--
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems



Re: [OMPI users] OpenMPI for 32/64 bit IB drivers

2006-12-06 Thread Jeff Squyres

On Dec 5, 2006, at 7:12 PM, Aaron McDonough wrote:


We have a mix of i686 and x86_64 SLES9 nodes, some with IB interfaces
and some without. Ideally, we want users to be able to run the same
binary on any node. Can I build a common OpenMPI for both platforms  
that
will work with either 32 or 64 bit IB drivers (Topspin)? Or do I  
have to

use all 32bit IB drivers? Any advice is appreciated.


A few things:

1. OMPI's heterogeneity support in the 1.1 series has some known  
issues (e.g., mixing 32 and 64 bit executables in a single  
MPI_COMM_WORLD).  The development head (and therefore the upcoming  
1.2 series) has many bug fixes in this area (but is still not perfect  
-- there's a few open tickets about heterogeneity that are actively  
being worked on).


2. In theory, using OMPI with 32 bit IB support in the same  
MPI_COMM_WORLD with OMPI with 64 bit IB support *should* be ok, but I  
don't know if anyone has tested this configuration (and subject to  
the constraints of #1).  Brad/IBM -- have you guys done so?


3. Depending on your needs, Cisco is recommending moving away from  
the Topspin drivers and to the OFED IB driver stack for HPC  
clusters.  The VAPI (i.e., Topspin drivers) support in Open MPI is  
pretty static; we'll do critical bug fixes for it, but no new  
features and very little functionality testing is occurring there.   
All new work is being doing on the OpenIB (a.k.a. OpenFabrics a.k.a.  
OFED) drivers for Open MPI.


--
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems



Re: [OMPI users] running with the dr pml.

2006-12-06 Thread Brock Palen



Is there any gotchas on using the dr pml?
also if the dr pml is finding errors, and is resending data, can i
have it tell me when that happens?  Like a verbose mode?




Unfortunately you will need to update the source and recompile, try:

Updating this file: topdir/ompi/mca/pml/dr/pml_dr.h:245:#define
MCA_PML_DR_DEBUG_LEVEL -1
And change MCA_PML_DR_DEBUG_LEVEL to 0..
Well i did this, to no avail, i still get nothing out STDOUT, I get  
the usual ompi messages when i kill it though.
Also if i let it run long enough that it should complete it doest not  
finish.

Should i try anything else before filing a bug for 1.2b1 dr on OSX PPC?
Try a nightly snapshot?
Brock




I also tried the following (just praying)  That it might work,

$ mpirun  --mca pml dr --mca btl ^gm -np 4 ./xhpl




You are telling Open MPI not to use GM in this case...


I still get no output to the screen.

This is on G5 xserve, with 1.2b1 OMPI.

Brock Palen
Center for Advanced Computing
bro...@umich.edu
(734)936-1985


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users






Re: [OMPI users] running with the dr pml.

2006-12-06 Thread Galen Shipman


The problem is that, when running HPL, he sees failed residuals. When
running HPL under MPICH-GM, he does not.

I have tried running HPCC (HPL plus other benchmarks) using OMPI with
GM on 32-bit Xeons and 64-bit Opterons. I do not see any failed
residuals. I am trying to get access to a couple of OSX machines to
replicate Brock's setup.



I wonder if we can narrow this down a bit to perhaps a PML protocol  
issue.

Start by disabling RDMA by using:
-mca btl_gm_flags 1

Let's see if that helps things out at all.

- Galen



Scott
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] running with the dr pml.

2006-12-06 Thread Brock Palen


I wonder if we can narrow this down a bit to perhaps a PML protocol
issue.
Start by disabling RDMA by using:
-mca btl_gm_flags 1


This helps some,  I at least now see the start up of HPL, but i never  
get a single pass,  output ends at:


- Computational tests pass if scaled residuals are less  
than   16.0


On the other-hand,  with OB1  using btl_gm_flags 1  fixed the error  
problem with OMPI!  Which is a great first step.


mpirun -np 4 --mca btl_gm_flags 1 ./xhpl

Allowed HPL to run with no errors.  I verified the performance was  
better than when ran without gm


(added --mca btl ^gm )

So still a problem with DR  which i dont need but im willing to help  
test it.


Scott,

Can we look into why leaving RDMA on if causing a problem?

Brock


Let's see if that helps things out at all.

- Galen



Scott
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users






Re: [OMPI users] running with the dr pml.

2006-12-06 Thread Scott Atchley

On Dec 6, 2006, at 2:29 PM, Brock Palen wrote:



I wonder if we can narrow this down a bit to perhaps a PML protocol
issue.
Start by disabling RDMA by using:
-mca btl_gm_flags 1


On the other-hand,  with OB1  using btl_gm_flags 1  fixed the error
problem with OMPI!  Which is a great first step.

mpirun -np 4 --mca btl_gm_flags 1 ./xhpl

Allowed HPL to run with no errors.  I verified the performance was
better than when ran without gm

(added --mca btl ^gm )

So still a problem with DR  which i dont need but im willing to help
test it.

Scott,

Can we look into why leaving RDMA on if causing a problem?

Brock


Brock and Galen,

We are willing to assist. Our best guess is that OMPI is using the  
code in a way different than MPICH-GM does. One of our other  
developers who is more comfortable with the GM API is looking into it.


Testing with HPCC, in addition to the HPL failed residuals, I am also  
seeing these messages:


[3]: ERROR: from right: expected 2 and 3 as first and last byte, but  
got 2 and 5 instead
[3]: ERROR: from right: expected 3 and 4 as first and last byte, but  
got 3 and 7 instead
[1]: ERROR: from right: expected 4 and 5 as first and last byte, but  
got 4 and 3 instead
[1]: ERROR: from right: expected 7 and 8 as first and last byte, but  
got 7 and 5 instead


which is from $HPCC/src/bench_lat_bw_1.5.2.c.

Scott


Re: [OMPI users] OpenMPE build failure

2006-12-06 Thread Ryan Thompson

Hi Anthony,

I made some progress, however, I still get the same trace_API.h  
error, although I'm not certain if it is important.


It appears that the binaries are built regardless and the  
installcheck-all appears to pass on all tests.


As requested, I've attached a gzip'd tarball of my configure, make,  
make install, & make installcheck-all.


Also, here are my configure arguments, as they appear in my 'do- 
configure' shell script...


# MPE do.sh

JAVA="/opt/sun-jdk-1.5.0.08"
MPERUN="/opt/openmpi/bin/mpiexec -n 4"

./configure --prefix=/opt/openmpi \
--enable-logging=yes \
--disable-f77 \
--enable-slog2=build \
--enable-collchk=no \
--enable-graphics=no \
--with-mpicc="/opt/openmpi/bin/mpicc" \
--with-mpiinc="-I/opt/openmpi/include" \
--with-mpilibs="-L/opt/openmpi/lib" \
--with-java=$JAVA

Thanks,
Ryan



mpe_build.tgz
Description: Binary data



--
Ryan Thompson
HPC & Systems Admin
Zymeworks, Inc.
r...@zymeworks.com


On 5-Dec-06, at 2:37 PM, Anthony Chan wrote:




On Tue, 5 Dec 2006, Ryan Thompson wrote:


I'm attempting to build MPE without success. When I try to make it, I
recieve the error:

trace_input.c:23:23: error: trace_API.h: No such file or directory


I just built the related mpe2's subpackage, slog2sdk, on a AMD64  
(Ubuntu

6.06.1) with gcc-4.0 and I don't see the strange errors that you
observed...   I put the latest mpe2 on our ftp server:

ftp://ftp.mcs.anl.gov/pub/mpi/mpe/beta/mpe2-1.0.5b2.tar.gz

which contains various bugfixes over mpe2-1.0.4. I have tested
mpe2-1.0.5b2 with the openmpi-1.1.2 on an IA32 linux box,  
everything seems

working fine.


Where is this file supposed to come from?

Here are my configure arguments...

JAVA="/opt/sun-jdk-1.5.0.08"

./configure --prefix=/opt/mpe \
--sharedstatedir=/var/mpe \
--localstatedir=/com/mpe \
--enable-misc=yes \
--enable-logging=yes \
--enable-f77=no \
--enable-wrappers=yes \
--enable-slog2=build \
--enable-collchk=no \
--enable-graphics=no \
--with-mpicc="/opt/openmpi/bin/mpicc" \
--with-mpiinc="-I/opt/openmpi/include" \
--with-mpilibs="-L/opt/openmpi/lib" \
--includedir=$JAVA/include \
--with-java=$JAVA



mpe2 does not use sharedstatedir and localstatedir, so you don't need
to specify --sharedstatedir and --localstatedir.  The only configure
option I see problem is --includedir=$JAVA/include which will force
mpe2 to install mpe include files to /opt/sun-jdk-1.5.0.08/include.
I believe it is a mistake.

FYI: here is my configure command to build mpe2 for openmpi:

mkdir 
cd 
/configure CC=
 F77=
 MPI_CC=/bin/mpicc
 MPI_F77=/bin/mpif90
 --with-java=/opt/sun-jdk-1.5.0.08
 --prefix=/opt/mpe
 MPERUN="/bin/mpiexec - 
n 4"

make
make install
make installcheck-all

If you don't need fortran support, don't use F77 and MPI_F77 and add
--disable-f77.  The configure option MPERUN="..." and
"make installcheck-all" enable a series of tests for the typical  
features

of MPE2.  Let me know if you see any problem.  If you do, send me the
configure output as seen on your screen (not config.log), i.e.  
c.txt from

the following command

With csh-like shell
configure  |& tee c.txt

or

With bourne-like shell
configure ... | tee c.txt 2>&1

A.Chan
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users





Re: [OMPI users] OpenMPE build failure

2006-12-06 Thread Anthony Chan


On Wed, 6 Dec 2006, Ryan Thompson wrote:

> Hi Anthony,
>
> I made some progress, however, I still get the same trace_API.h
> error, although I'm not certain if it is important.

trace_sample is a sample TRACE-API implementation for SLOG2, e.g. for
people who write their own trace and to generate their own SLOG2 files.
trace_rlog is MPICH2's internal logging format (standalone MPE does not
distinguish which MPI implementation you are using).  If you don't need
them, add --disable-rlog and --disable-sample to your configure command.

> It appears that the binaries are built regardless and the
> installcheck-all appears to pass on all tests.

I don't see any errors in any of the output files that you sent me.
Everything appears to finish successfully.  Did you pipe both standard
output and standard error to the files ?

>
> As requested, I've attached a gzip'd tarball of my configure, make,
> make install, & make installcheck-all.
>
> Also, here are my configure arguments, as they appear in my 'do-
> configure' shell script...
>
> # MPE do.sh
>
> JAVA="/opt/sun-jdk-1.5.0.08"
> MPERUN="/opt/openmpi/bin/mpiexec -n 4"
>
> ./configure --prefix=/opt/openmpi \
>  --enable-logging=yes \
>  --disable-f77 \
>  --enable-slog2=build \
>  --enable-collchk=no \
>  --enable-graphics=no \
>  --with-mpicc="/opt/openmpi/bin/mpicc" \
>  --with-mpiinc="-I/opt/openmpi/include" \
>  --with-mpilibs="-L/opt/openmpi/lib" \
>  --with-java=$JAVA

A.Chan