Steve, thank for the details.

What is the command line that you use to run the benchmark ?
Can you try to add follow mca parameters to your command line:
"--mca btl openib,sm,self --mca btl_openib_max_btls 1"

Thanks,
Pasha

Repsher, Stephen J wrote:
Thanks for keeping on this.... Hopefully this answers all the questions:

The cluster has some blades with XRC, others without.  I've tested on both with 
the same results. For MVAPICH, a flag is set to turn on XRC; I'm not sure how 
OpenMPI handles it but my build is configured --enable-openib-connectx-xrc.

OpenMPI is built on a head node with a 2-port HCA (1 active) and installed on a 
shared file system.  The compute blades I'm using are Infinihost IIIs, 1-port 
HCAs.

As for nRepeats in bounce, I could increase it, but if that were the problem 
then I'd expect MVAPICH to report sporadic results as well.

I just downloaded the OSU benchmarks and tried osu_latency.... It's report ~40 
microsecs for OpenMPI, and ~3 micrcosecs for MVAPICH.  Still puzzled...

Steve


-----Original Message-----
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Pavel Shamis (Pasha)
Sent: Thursday, February 18, 2010 3:33 AM
To: Open MPI Users
Subject: Re: [OMPI users] Bad Infiniband latency with subounce

Hey,
I only may to add the XRC and RC have the same latency.
What is the command line that you use to run this benchmark ?
What is the system configuration  (one hca, one active port ) ?
Any addition information about system configuration, mpi command line, etc. 
will help to analyze your issue.

Regards,
Pasha (Mellanox guy :-) )

Jeff Squyres wrote:
I'll defer to the Mellanox guys to reply more in detail, but here's a few 
thoughts:

- Is MVAPICH using XRC? (I never played with XRC much; it would surprise me if it caused instability on the order of up to 100 micros -- I ask just to see if it is an apples-to-apples comparison)

- The nRepeats value in this code is only 10, meaning that it only seems to be 
doing 10 iterations on each size.  For small sizes, this might well be not 
enough to be accurate.  Have you tried increasing it?  Or using a different 
benchmark app, such as NetPIPE, osu_latency, ...etc.?



On Feb 16, 2010, at 8:49 AM, Repsher, Stephen J wrote:

Well the "good" news is I can end your debate over binding here...setting 
mpi_paffinity_alone 1 did nothing. (And personally as a user, I don't care what the 
default is so long as info is readily apparent in the main docs...and I did see the FAQs 
on it).

It did lead me to try another parameter though, -mca mpi_preconnect_all 1, 
which seems to reduce the measured latency reliably of subounce, but it's still 
sporadic and order ~10-100 microseconds.  It leads me to think that OpenMPI has 
issues with the method of measurement, which is simply to send progressively 
larger blocked messages right after calling MPI_Init (starting at 0 bytes which 
it times as the latency). OpenMPI's lazy connections clearly mess with this.

But still not consistently 1-2 microsecs...

Steve


-----Original Message-----
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Monday, February 15, 2010 11:21 PM
To: Open MPI Users
Subject: Re: [OMPI users] Bad Infiniband latency with subounce


On Feb 15, 2010, at 8:44 PM, Terry Frankcombe wrote:

On Mon, 2010-02-15 at 20:18 -0700, Ralph Castain wrote:
Did you run it with -mca mpi_paffinity_alone 1? Given this is 1.4.1, you can 
set the bindings to -bind-to-socket or -bind-to-core. Either will give you 
improved performance.

IIRC, MVAPICH defaults to -bind-to-socket. OMPI defaults to no binding.
Is this sensible? Won't most users want processes bound? OMPI's supposed to "to the right thing" out of the box, right?
Well, that depends on how you look at it. Been the subject of a lot of debate 
within the devel community. If you bind by default and it is a shared node 
cluster, then you can really mess people up. On the other hand, if you don't 
bind by default, then people that run benchmarks without looking at the options 
can get bad numbers. Unfortunately, there is no automated way to tell if the 
cluster is configured for shared use or dedicated nodes.

I honestly don't know that "most users want processes bound". One installation I was at set binding by default using the system mca param file, and got yelled at by a group of users that had threaded apps - and most definitely did -not- want their processes bound. After a while, it became clear that nothing we could do would make everyone happy :-/

I doubt there is a right/wrong answer - at least, we sure can't find one. So we don't 
bind by default so we "do no harm", and put out FAQs, man pages, mpirun option 
help messages, etc. that explain the situation and tell you when/how to bind.

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to