Re: [OMPI users] multiple LIDs
Hi, Actually I was wondering why there is a facility for having multiple LIDs for the same port. This led me to the entire series of questions. It is still not very clear to, as to what is the advantage of assigning multiple LIDs to the same port. Does it give some performance advantages? -Chev On 12/5/06, Jeff Squyres wrote: There are two distinct layers of software being discussed here: - the PML (basically the back-end to MPI_SEND and friends) - the BTL (byte transfer layer, the back-end bit movers for the ob1 and dr PMLs -- this distinction is important because there is nothing in the PML design that forces the use of BTL's; indeed, there is at least one current PML that does not use BTL's as the back-end bit mover [the cm PML]) The ob1 and dr PMLs know nothing about how the back-end bitmovers work (BTL components) -- the BTLs are given considerable freedom to operate within their specific interface contracts. Generally, ob1/dr queries each BTL component when Open MPI starts up. Each BTL responds with whether it wants to run or not. If it does, it gives back the one or more modules (think of a module as an "instance" of a component). Typically, these modules correspond to multiple NICs / HCAs / network endpoints. For example, if you have 2 ethernet cards, the tcp BTL will create and return 2 modules. ob1 / dr will treat these as two paths to send data (reachability is computed as well, of course -- ob1/dr will only send data down btls for which the target peer is reachable). In general, ob1/dr will round-robin across all available BTL modules when sending large messages (as Gleb has described). See http://www.open-mpi.org/papers/ euro-pvmmpi-2006-hpc-protocols/ for a general description of the ob1/ dr protocols. The openib BTL can return multiple modules if multiple LIDs are available. So the ob1/dr doesn't know that these are not physical devices -- it just treats each module as an equivalent mechanism to send data. This is actually somewhat lame as a scheme, and we talked internally about doing something more intelligent. But we decided to hold off and let people (like you!) with real-world apps and networks give this stuff a try and see what really works (and what doesn't work) before trying to implement anything else. So -- all that explanation aside -- we'd love to hear your feedback with regards to the multi-LID stuff in Open MPI. :-) On Dec 4, 2006, at 1:27 PM, Chevchenkovic Chevchenkovic wrote: > Thanks for that. > > Suppose, if there there are multiple interconnects, say ethernet and > infiniband and a million byte of data is to be sent, then in this > case the data will be sent through infiniband (since its a fast path > .. please correct me here if i m wrong). > > If there are mulitple such sends, do you mean to say that each send > will go through different BTLs in a RR manner if they are connected > to the same port? > > -chev > > > On 12/4/06, Gleb Natapov wrote: >> On Mon, Dec 04, 2006 at 10:53:26PM +0530, Chevchenkovic >> Chevchenkovic wrote: >>> Hi, >>> It is not clear from the code as mentioned by you from >>> ompi/mca/pml/ob1/ where exactly the selection of BTL bound to a >>> particular LID occurs. Could you please specify the file/function >>> name >>> for the same? >> There is no such code there. OB1 knows nothing about LIDs. It does RR >> over all available interconnects. It can do RR between ethernet, IB >> and Myrinet for instance. BTL presents each LID as different >> virtual HCA >> to OB1 and it does round-robin between them without event knowing >> this >> is the same port of the same HCA. >> >> Can you explain what are you trying to achieve? >> >>> -chev >>> >>> >>> On 12/4/06, Gleb Natapov wrote: On Mon, Dec 04, 2006 at 01:07:08AM +0530, Chevchenkovic Chevchenkovic wrote: > Also could you please tell me which part of the openMPI code > needs to > be touched so that I can do some modifications in it to > incorporate > changes regarding LID selection... > It depend what do you want to do. The part that does RR over all available LIDs is in OB1 PML (ompi/mca/pml/ob1/), but the code doesn't aware of the fact that it is doing RR over different LIDs and not different NICs (yet?). The code that controls what LIDs will be used is in ompi/mca/btl/openib/btl_openib_component.c. > On 12/4/06, Chevchenkovic Chevchenkovic > wrote: >> Is it possible to control the LID where the send and recvs are >> posted.. on either ends? >> >> On 12/3/06, Gleb Natapov wrote: >>> On Sun, Dec 03, 2006 at 07:03:33PM +0530, Chevchenkovic >>> Chevchenkovic >> wrote: Hi, I had this query. I hope some expert replies to it. I have 2 nodes connected point-to-point using infiniband cable. There are multiple LIDs for each of the end node ports. When I give an MPI_Send, Are the sends are posted on >>
Re: [OMPI users] multiple LIDs
On Wed, Dec 06, 2006 at 12:14:35PM +0530, Chevchenkovic Chevchenkovic wrote: > Hi, > Actually I was wondering why there is a facility for having multiple > LIDs for the same port. This led me to the entire series of questions. >It is still not very clear to, as to what is the advantage of > assigning multiple LIDs to the same port. Does it give some > performance advantages? Each LID has its own path through the fabric (ideally), this is the way to lower a congestion. > -Chev > > > On 12/5/06, Jeff Squyres wrote: > > There are two distinct layers of software being discussed here: > > > > - the PML (basically the back-end to MPI_SEND and friends) > > - the BTL (byte transfer layer, the back-end bit movers for the ob1 > > and dr PMLs -- this distinction is important because there is nothing > > in the PML design that forces the use of BTL's; indeed, there is at > > least one current PML that does not use BTL's as the back-end bit > > mover [the cm PML]) > > > > The ob1 and dr PMLs know nothing about how the back-end bitmovers > > work (BTL components) -- the BTLs are given considerable freedom to > > operate within their specific interface contracts. > > > > Generally, ob1/dr queries each BTL component when Open MPI starts > > up. Each BTL responds with whether it wants to run or not. If it > > does, it gives back the one or more modules (think of a module as an > > "instance" of a component). Typically, these modules correspond to > > multiple NICs / HCAs / network endpoints. For example, if you have 2 > > ethernet cards, the tcp BTL will create and return 2 modules. ob1 / > > dr will treat these as two paths to send data (reachability is > > computed as well, of course -- ob1/dr will only send data down btls > > for which the target peer is reachable). In general, ob1/dr will > > round-robin across all available BTL modules when sending large > > messages (as Gleb has described). See http://www.open-mpi.org/papers/ > > euro-pvmmpi-2006-hpc-protocols/ for a general description of the ob1/ > > dr protocols. > > > > The openib BTL can return multiple modules if multiple LIDs are > > available. So the ob1/dr doesn't know that these are not physical > > devices -- it just treats each module as an equivalent mechanism to > > send data. > > > > This is actually somewhat lame as a scheme, and we talked internally > > about doing something more intelligent. But we decided to hold off > > and let people (like you!) with real-world apps and networks give > > this stuff a try and see what really works (and what doesn't work) > > before trying to implement anything else. > > > > So -- all that explanation aside -- we'd love to hear your feedback > > with regards to the multi-LID stuff in Open MPI. :-) > > > > > > > > On Dec 4, 2006, at 1:27 PM, Chevchenkovic Chevchenkovic wrote: > > > > > Thanks for that. > > > > > > Suppose, if there there are multiple interconnects, say ethernet and > > > infiniband and a million byte of data is to be sent, then in this > > > case the data will be sent through infiniband (since its a fast path > > > .. please correct me here if i m wrong). > > > > > > If there are mulitple such sends, do you mean to say that each send > > > will go through different BTLs in a RR manner if they are connected > > > to the same port? > > > > > > -chev > > > > > > > > > On 12/4/06, Gleb Natapov wrote: > > >> On Mon, Dec 04, 2006 at 10:53:26PM +0530, Chevchenkovic > > >> Chevchenkovic wrote: > > >>> Hi, > > >>> It is not clear from the code as mentioned by you from > > >>> ompi/mca/pml/ob1/ where exactly the selection of BTL bound to a > > >>> particular LID occurs. Could you please specify the file/function > > >>> name > > >>> for the same? > > >> There is no such code there. OB1 knows nothing about LIDs. It does RR > > >> over all available interconnects. It can do RR between ethernet, IB > > >> and Myrinet for instance. BTL presents each LID as different > > >> virtual HCA > > >> to OB1 and it does round-robin between them without event knowing > > >> this > > >> is the same port of the same HCA. > > >> > > >> Can you explain what are you trying to achieve? > > >> > > >>> -chev > > >>> > > >>> > > >>> On 12/4/06, Gleb Natapov wrote: > > On Mon, Dec 04, 2006 at 01:07:08AM +0530, Chevchenkovic > > Chevchenkovic wrote: > > > Also could you please tell me which part of the openMPI code > > > needs to > > > be touched so that I can do some modifications in it to > > > incorporate > > > changes regarding LID selection... > > > > > It depend what do you want to do. The part that does RR over all > > available LIDs is in OB1 PML (ompi/mca/pml/ob1/), but the code > > doesn't > > aware of the fact that it is doing RR over different LIDs and not > > different NICs (yet?). > > > > The code that controls what LIDs will be used is in > > ompi/mca/btl/openib/btl_openib_component.c. > > > > > On 12/4/0
Re: [OMPI users] multiple LIDs
On Dec 6, 2006, at 2:05 AM, Gleb Natapov wrote: Actually I was wondering why there is a facility for having multiple LIDs for the same port. This led me to the entire series of questions. It is still not very clear to, as to what is the advantage of assigning multiple LIDs to the same port. Does it give some performance advantages? Each LID has its own path through the fabric (ideally), this is the way to lower a congestion. More specifically, multi-LID support is best for networks where there are multiple paths between the same pair of peers (e.g., a fat tree network with multiple core switches). If you don't have a multi- route topology on your IB network, multi-LID support from the same port is not likely to be useful. To answer your specific question: multi-LID support from a single port is not a bandwidth multiplier because you're still sending and receiving from a single port on the host (which has a fixed bandwidth capability). -- Jeff Squyres Server Virtualization Business Unit Cisco Systems
Re: [OMPI users] OpenMPI for 32/64 bit IB drivers
On Dec 5, 2006, at 7:12 PM, Aaron McDonough wrote: We have a mix of i686 and x86_64 SLES9 nodes, some with IB interfaces and some without. Ideally, we want users to be able to run the same binary on any node. Can I build a common OpenMPI for both platforms that will work with either 32 or 64 bit IB drivers (Topspin)? Or do I have to use all 32bit IB drivers? Any advice is appreciated. A few things: 1. OMPI's heterogeneity support in the 1.1 series has some known issues (e.g., mixing 32 and 64 bit executables in a single MPI_COMM_WORLD). The development head (and therefore the upcoming 1.2 series) has many bug fixes in this area (but is still not perfect -- there's a few open tickets about heterogeneity that are actively being worked on). 2. In theory, using OMPI with 32 bit IB support in the same MPI_COMM_WORLD with OMPI with 64 bit IB support *should* be ok, but I don't know if anyone has tested this configuration (and subject to the constraints of #1). Brad/IBM -- have you guys done so? 3. Depending on your needs, Cisco is recommending moving away from the Topspin drivers and to the OFED IB driver stack for HPC clusters. The VAPI (i.e., Topspin drivers) support in Open MPI is pretty static; we'll do critical bug fixes for it, but no new features and very little functionality testing is occurring there. All new work is being doing on the OpenIB (a.k.a. OpenFabrics a.k.a. OFED) drivers for Open MPI. -- Jeff Squyres Server Virtualization Business Unit Cisco Systems
Re: [OMPI users] running with the dr pml.
Is there any gotchas on using the dr pml? also if the dr pml is finding errors, and is resending data, can i have it tell me when that happens? Like a verbose mode? Unfortunately you will need to update the source and recompile, try: Updating this file: topdir/ompi/mca/pml/dr/pml_dr.h:245:#define MCA_PML_DR_DEBUG_LEVEL -1 And change MCA_PML_DR_DEBUG_LEVEL to 0.. Well i did this, to no avail, i still get nothing out STDOUT, I get the usual ompi messages when i kill it though. Also if i let it run long enough that it should complete it doest not finish. Should i try anything else before filing a bug for 1.2b1 dr on OSX PPC? Try a nightly snapshot? Brock I also tried the following (just praying) That it might work, $ mpirun --mca pml dr --mca btl ^gm -np 4 ./xhpl You are telling Open MPI not to use GM in this case... I still get no output to the screen. This is on G5 xserve, with 1.2b1 OMPI. Brock Palen Center for Advanced Computing bro...@umich.edu (734)936-1985 ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] running with the dr pml.
The problem is that, when running HPL, he sees failed residuals. When running HPL under MPICH-GM, he does not. I have tried running HPCC (HPL plus other benchmarks) using OMPI with GM on 32-bit Xeons and 64-bit Opterons. I do not see any failed residuals. I am trying to get access to a couple of OSX machines to replicate Brock's setup. I wonder if we can narrow this down a bit to perhaps a PML protocol issue. Start by disabling RDMA by using: -mca btl_gm_flags 1 Let's see if that helps things out at all. - Galen Scott ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] running with the dr pml.
I wonder if we can narrow this down a bit to perhaps a PML protocol issue. Start by disabling RDMA by using: -mca btl_gm_flags 1 This helps some, I at least now see the start up of HPL, but i never get a single pass, output ends at: - Computational tests pass if scaled residuals are less than 16.0 On the other-hand, with OB1 using btl_gm_flags 1 fixed the error problem with OMPI! Which is a great first step. mpirun -np 4 --mca btl_gm_flags 1 ./xhpl Allowed HPL to run with no errors. I verified the performance was better than when ran without gm (added --mca btl ^gm ) So still a problem with DR which i dont need but im willing to help test it. Scott, Can we look into why leaving RDMA on if causing a problem? Brock Let's see if that helps things out at all. - Galen Scott ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] running with the dr pml.
On Dec 6, 2006, at 2:29 PM, Brock Palen wrote: I wonder if we can narrow this down a bit to perhaps a PML protocol issue. Start by disabling RDMA by using: -mca btl_gm_flags 1 On the other-hand, with OB1 using btl_gm_flags 1 fixed the error problem with OMPI! Which is a great first step. mpirun -np 4 --mca btl_gm_flags 1 ./xhpl Allowed HPL to run with no errors. I verified the performance was better than when ran without gm (added --mca btl ^gm ) So still a problem with DR which i dont need but im willing to help test it. Scott, Can we look into why leaving RDMA on if causing a problem? Brock Brock and Galen, We are willing to assist. Our best guess is that OMPI is using the code in a way different than MPICH-GM does. One of our other developers who is more comfortable with the GM API is looking into it. Testing with HPCC, in addition to the HPL failed residuals, I am also seeing these messages: [3]: ERROR: from right: expected 2 and 3 as first and last byte, but got 2 and 5 instead [3]: ERROR: from right: expected 3 and 4 as first and last byte, but got 3 and 7 instead [1]: ERROR: from right: expected 4 and 5 as first and last byte, but got 4 and 3 instead [1]: ERROR: from right: expected 7 and 8 as first and last byte, but got 7 and 5 instead which is from $HPCC/src/bench_lat_bw_1.5.2.c. Scott
Re: [OMPI users] OpenMPE build failure
Hi Anthony, I made some progress, however, I still get the same trace_API.h error, although I'm not certain if it is important. It appears that the binaries are built regardless and the installcheck-all appears to pass on all tests. As requested, I've attached a gzip'd tarball of my configure, make, make install, & make installcheck-all. Also, here are my configure arguments, as they appear in my 'do- configure' shell script... # MPE do.sh JAVA="/opt/sun-jdk-1.5.0.08" MPERUN="/opt/openmpi/bin/mpiexec -n 4" ./configure --prefix=/opt/openmpi \ --enable-logging=yes \ --disable-f77 \ --enable-slog2=build \ --enable-collchk=no \ --enable-graphics=no \ --with-mpicc="/opt/openmpi/bin/mpicc" \ --with-mpiinc="-I/opt/openmpi/include" \ --with-mpilibs="-L/opt/openmpi/lib" \ --with-java=$JAVA Thanks, Ryan mpe_build.tgz Description: Binary data -- Ryan Thompson HPC & Systems Admin Zymeworks, Inc. r...@zymeworks.com On 5-Dec-06, at 2:37 PM, Anthony Chan wrote: On Tue, 5 Dec 2006, Ryan Thompson wrote: I'm attempting to build MPE without success. When I try to make it, I recieve the error: trace_input.c:23:23: error: trace_API.h: No such file or directory I just built the related mpe2's subpackage, slog2sdk, on a AMD64 (Ubuntu 6.06.1) with gcc-4.0 and I don't see the strange errors that you observed... I put the latest mpe2 on our ftp server: ftp://ftp.mcs.anl.gov/pub/mpi/mpe/beta/mpe2-1.0.5b2.tar.gz which contains various bugfixes over mpe2-1.0.4. I have tested mpe2-1.0.5b2 with the openmpi-1.1.2 on an IA32 linux box, everything seems working fine. Where is this file supposed to come from? Here are my configure arguments... JAVA="/opt/sun-jdk-1.5.0.08" ./configure --prefix=/opt/mpe \ --sharedstatedir=/var/mpe \ --localstatedir=/com/mpe \ --enable-misc=yes \ --enable-logging=yes \ --enable-f77=no \ --enable-wrappers=yes \ --enable-slog2=build \ --enable-collchk=no \ --enable-graphics=no \ --with-mpicc="/opt/openmpi/bin/mpicc" \ --with-mpiinc="-I/opt/openmpi/include" \ --with-mpilibs="-L/opt/openmpi/lib" \ --includedir=$JAVA/include \ --with-java=$JAVA mpe2 does not use sharedstatedir and localstatedir, so you don't need to specify --sharedstatedir and --localstatedir. The only configure option I see problem is --includedir=$JAVA/include which will force mpe2 to install mpe include files to /opt/sun-jdk-1.5.0.08/include. I believe it is a mistake. FYI: here is my configure command to build mpe2 for openmpi: mkdir cd /configure CC= F77= MPI_CC=/bin/mpicc MPI_F77=/bin/mpif90 --with-java=/opt/sun-jdk-1.5.0.08 --prefix=/opt/mpe MPERUN="/bin/mpiexec - n 4" make make install make installcheck-all If you don't need fortran support, don't use F77 and MPI_F77 and add --disable-f77. The configure option MPERUN="..." and "make installcheck-all" enable a series of tests for the typical features of MPE2. Let me know if you see any problem. If you do, send me the configure output as seen on your screen (not config.log), i.e. c.txt from the following command With csh-like shell configure |& tee c.txt or With bourne-like shell configure ... | tee c.txt 2>&1 A.Chan ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] OpenMPE build failure
On Wed, 6 Dec 2006, Ryan Thompson wrote: > Hi Anthony, > > I made some progress, however, I still get the same trace_API.h > error, although I'm not certain if it is important. trace_sample is a sample TRACE-API implementation for SLOG2, e.g. for people who write their own trace and to generate their own SLOG2 files. trace_rlog is MPICH2's internal logging format (standalone MPE does not distinguish which MPI implementation you are using). If you don't need them, add --disable-rlog and --disable-sample to your configure command. > It appears that the binaries are built regardless and the > installcheck-all appears to pass on all tests. I don't see any errors in any of the output files that you sent me. Everything appears to finish successfully. Did you pipe both standard output and standard error to the files ? > > As requested, I've attached a gzip'd tarball of my configure, make, > make install, & make installcheck-all. > > Also, here are my configure arguments, as they appear in my 'do- > configure' shell script... > > # MPE do.sh > > JAVA="/opt/sun-jdk-1.5.0.08" > MPERUN="/opt/openmpi/bin/mpiexec -n 4" > > ./configure --prefix=/opt/openmpi \ > --enable-logging=yes \ > --disable-f77 \ > --enable-slog2=build \ > --enable-collchk=no \ > --enable-graphics=no \ > --with-mpicc="/opt/openmpi/bin/mpicc" \ > --with-mpiinc="-I/opt/openmpi/include" \ > --with-mpilibs="-L/opt/openmpi/lib" \ > --with-java=$JAVA A.Chan