Re: [OMPI users] OpenMPI slowdown in latency bound application

2019-08-28 Thread Peter Kjellström via users
On Tue, 27 Aug 2019 14:36:54 -0500
Cooper Burns via users  wrote:

> Hello all,
> 
> I have been doing some MPI benchmarking on an Infiniband cluster.
> 
> Specs are:
> 12 cores/node
> 2.9ghz/core
> Infiniband interconnect (TCP also available)
> 
> Some runtime numbers:
> 192 cores total: (16 nodes)
> IntelMPI:
> 0.4 seconds
> OpenMPI 3.1.3 (--mca btl ^tcp):
> 2.5 seconds
> OpenMPI 3.1.3 (--mca btl ^openib):
> 26 seconds

5x is quite a difference...

Here are a few possible reasons I can think of:

1) The app was placed/pinned differently by the two MPIs. Often this
would probably not cause such a big difference.

2) Bad luck wrt collective performance. Different MPIs have different
weak spots across the parameter space of
numranks,transfersize,mpi-collective.

3) You're not on Mellanox infiniband but Qlogic/Intel (Truescale)
infiniband. Using openib there is better than tcp but not ideal (it
uses psm for native transport).

4) You changed more than the MPI. For example Intel compilers +
intel-mpi vs OpenMPI + gcc.

/Peter K 
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] OpenMPI slowdown in latency bound application

2019-08-28 Thread Cooper Burns via users
Peter,

Thanks for your input!
I tried some things:

*1) The app was placed/pinned differently by the two MPIs. Often this would
probably not cause such a big difference.*
I agree this is unlikely the cause, however I tried various configurations
of map-by, bind-to, etc and none of them had any measurable impact at all,
which points to this not being the cause (as you suspected)


*2) Bad luck wrt collective performance. Different MPIs have different weak
spots across the parameter space of numranks,transfersize,mpi-coll**ective.*
This is possible... But the magnitude of the runtime difference seems too
large to me... Are there any options we can give to OMPI to cause it to use
different collective algorithms so that we can test this theory?




*3) You're not on Mellanox infiniband but Qlogic/Intel (Truescale)
infiniband. Using openib there is better than tcp but not ideal (it uses
psm for native transport).*
I double checked - the cluster is using Mellanox Infiniband


*4) You changed more than the MPI. For example Intel compilers + intel-mpi
vs OpenMPI + gcc.*
This is correct - however I also ran 2 other tests:
IntelMPI + Clangcc:
0.3 seconds
IntelMPI + Intelcc:
0.4 seconds
MPICH MPI + Clangcc:
1 second
OpenMPI + Clangcc:
2.6 seconds

So it looks like the compiler is not the issue.

Any other ideas?
Thanks,
Cooper
Cooper Burns
Senior Research Engineer



  
(608) 230-1551
convergecfd.com



On Wed, Aug 28, 2019 at 2:11 AM Peter Kjellström  wrote:

> On Tue, 27 Aug 2019 14:36:54 -0500
> Cooper Burns via users  wrote:
>
> > Hello all,
> >
> > I have been doing some MPI benchmarking on an Infiniband cluster.
> >
> > Specs are:
> > 12 cores/node
> > 2.9ghz/core
> > Infiniband interconnect (TCP also available)
> >
> > Some runtime numbers:
> > 192 cores total: (16 nodes)
> > IntelMPI:
> > 0.4 seconds
> > OpenMPI 3.1.3 (--mca btl ^tcp):
> > 2.5 seconds
> > OpenMPI 3.1.3 (--mca btl ^openib):
> > 26 seconds
>
> 5x is quite a difference...
>
> Here are a few possible reasons I can think of:
>
> 1) The app was placed/pinned differently by the two MPIs. Often this
> would probably not cause such a big difference.
>
> 2) Bad luck wrt collective performance. Different MPIs have different
> weak spots across the parameter space of
> numranks,transfersize,mpi-collective.
>
> 3) You're not on Mellanox infiniband but Qlogic/Intel (Truescale)
> infiniband. Using openib there is better than tcp but not ideal (it
> uses psm for native transport).
>
> 4) You changed more than the MPI. For example Intel compilers +
> intel-mpi vs OpenMPI + gcc.
>
> /Peter K
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] OpenMPI slowdown in latency bound application

2019-08-28 Thread Peter Kjellström via users
On Wed, 28 Aug 2019 09:45:15 -0500
Cooper Burns  wrote:

> Peter,
> 
> Thanks for your input!
> I tried some things:
> 
> *1) The app was placed/pinned differently by the two MPIs. Often this
> would probably not cause such a big difference.*
> I agree this is unlikely the cause, however I tried various
> configurations of map-by, bind-to, etc and none of them had any
> measurable impact at all, which points to this not being the cause
> (as you suspected)

OK, there's still one thing to rule out, which rank was placed on which
node.

For OpenMPI you can pass "-report-bindings" and verify that the first N
ranks are placed on the first node (for N cores or ranks per node).

node0: r0 r4 r8 ...
node1: r1 ...
node2: r2 ...
node3: r3 ...

vs

node0: r0 r1 r2 r3 ...

> *2) Bad luck wrt collective performance. Different MPIs have
> different weak spots across the parameter space of
> numranks,transfersize,mpi-coll**ective.* This is possible... But the
> magnitude of the runtime difference seems too large to me... Are
> there any options we can give to OMPI to cause it to use different
> collective algorithms so that we can test this theory?

It can certainly cause the observed difference. I've seen very large
differences...

To get collective tunables from OpenMPI do something like:

 ompi_info --param coll all --level 5

But it will really help to know or suspect what collectives the
application depend on.

For example, if you suspected alltoall to be a factor you could sweep
all valid alltoall algorithms by setting:

 -mca coll coll_tuned_alltoall_algorithm X

Where X is 0..6 in my case (ompi_info returned: 0 ignore, 1 basic
linear, 2 bruck, 3 recursive doubling, 4 ring, 5 neighbor exchange, 6:
two proc only.)

/Peter
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] OpenMPI slowdown in latency bound application

2019-08-28 Thread Cooper Burns via users
Peter,

It looks like:
Node0:
rank0, rank1, rank2, etc..
Node1:
rank12, rank13, etc
etc

So the mapping looks good to me.

Thanks,
Cooper
Cooper Burns
Senior Research Engineer



  
(608) 230-1551
convergecfd.com



On Wed, Aug 28, 2019 at 10:50 AM Peter Kjellström  wrote:

> On Wed, 28 Aug 2019 09:45:15 -0500
> Cooper Burns  wrote:
>
> > Peter,
> >
> > Thanks for your input!
> > I tried some things:
> >
> > *1) The app was placed/pinned differently by the two MPIs. Often this
> > would probably not cause such a big difference.*
> > I agree this is unlikely the cause, however I tried various
> > configurations of map-by, bind-to, etc and none of them had any
> > measurable impact at all, which points to this not being the cause
> > (as you suspected)
>
> OK, there's still one thing to rule out, which rank was placed on which
> node.
>
> For OpenMPI you can pass "-report-bindings" and verify that the first N
> ranks are placed on the first node (for N cores or ranks per node).
>
> node0: r0 r4 r8 ...
> node1: r1 ...
> node2: r2 ...
> node3: r3 ...
>
> vs
>
> node0: r0 r1 r2 r3 ...
>
> > *2) Bad luck wrt collective performance. Different MPIs have
> > different weak spots across the parameter space of
> > numranks,transfersize,mpi-coll**ective.* This is possible... But the
> > magnitude of the runtime difference seems too large to me... Are
> > there any options we can give to OMPI to cause it to use different
> > collective algorithms so that we can test this theory?
>
> It can certainly cause the observed difference. I've seen very large
> differences...
>
> To get collective tunables from OpenMPI do something like:
>
>  ompi_info --param coll all --level 5
>
> But it will really help to know or suspect what collectives the
> application depend on.
>
> For example, if you suspected alltoall to be a factor you could sweep
> all valid alltoall algorithms by setting:
>
>  -mca coll coll_tuned_alltoall_algorithm X
>
> Where X is 0..6 in my case (ompi_info returned: 0 ignore, 1 basic
> linear, 2 bruck, 3 recursive doubling, 4 ring, 5 neighbor exchange, 6:
> two proc only.)
>
> /Peter
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] OpenMPI slowdown in latency bound application

2019-08-28 Thread Nathan Hjelm via users
Is this overall runtime or solve time? The former is essentially meaningless as 
it includes all the startup time (launch, connections, etc). Especially since 
we are talking about seconds here.

-Nathan

> On Aug 28, 2019, at 9:10 AM, Cooper Burns via users 
>  wrote:
> 
> Peter,
> 
> It looks like:
> Node0:
> rank0, rank1, rank2, etc..
> Node1:
> rank12, rank13, etc
> etc
> 
> So the mapping looks good to me.
> 
> Thanks,
> Cooper
> Cooper Burns
> Senior Research Engineer
> 
> 
> (608) 230-1551
> convergecfd.com
> 
> 
> 
>> On Wed, Aug 28, 2019 at 10:50 AM Peter Kjellström  wrote:
>> On Wed, 28 Aug 2019 09:45:15 -0500
>> Cooper Burns  wrote:
>> 
>> > Peter,
>> > 
>> > Thanks for your input!
>> > I tried some things:
>> > 
>> > *1) The app was placed/pinned differently by the two MPIs. Often this
>> > would probably not cause such a big difference.*
>> > I agree this is unlikely the cause, however I tried various
>> > configurations of map-by, bind-to, etc and none of them had any
>> > measurable impact at all, which points to this not being the cause
>> > (as you suspected)
>> 
>> OK, there's still one thing to rule out, which rank was placed on which
>> node.
>> 
>> For OpenMPI you can pass "-report-bindings" and verify that the first N
>> ranks are placed on the first node (for N cores or ranks per node).
>> 
>> node0: r0 r4 r8 ...
>> node1: r1 ...
>> node2: r2 ...
>> node3: r3 ...
>> 
>> vs
>> 
>> node0: r0 r1 r2 r3 ...
>> 
>> > *2) Bad luck wrt collective performance. Different MPIs have
>> > different weak spots across the parameter space of
>> > numranks,transfersize,mpi-coll**ective.* This is possible... But the
>> > magnitude of the runtime difference seems too large to me... Are
>> > there any options we can give to OMPI to cause it to use different
>> > collective algorithms so that we can test this theory?
>> 
>> It can certainly cause the observed difference. I've seen very large
>> differences...
>> 
>> To get collective tunables from OpenMPI do something like:
>> 
>>  ompi_info --param coll all --level 5
>> 
>> But it will really help to know or suspect what collectives the
>> application depend on.
>> 
>> For example, if you suspected alltoall to be a factor you could sweep
>> all valid alltoall algorithms by setting:
>> 
>>  -mca coll coll_tuned_alltoall_algorithm X
>> 
>> Where X is 0..6 in my case (ompi_info returned: 0 ignore, 1 basic
>> linear, 2 bruck, 3 recursive doubling, 4 ring, 5 neighbor exchange, 6:
>> two proc only.)
>> 
>> /Peter
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] OpenMPI slowdown in latency bound application

2019-08-28 Thread Cooper Burns via users
Nathan,
Our application runs many 'cycles' during a single run. Each cycle advances
the time slightly and then re-solves the appropriate equations. Each cycle
does effectively the same thing over and over.

The times I provided were an approximate average time per cycle for the
first ~10 cycles (I just eyeballed it, but they are quite consistent, and I
ran each MPI multiple times with the same results).

They therefore do not include any time for setup/teardown or any I/O time.

Thanks,
Cooper
Cooper Burns
Senior Research Engineer



  
(608) 230-1551
convergecfd.com



On Wed, Aug 28, 2019 at 5:54 PM Nathan Hjelm  wrote:

> Is this overall runtime or solve time? The former is essentially
> meaningless as it includes all the startup time (launch, connections, etc).
> Especially since we are talking about seconds here.
>
> -Nathan
>
> On Aug 28, 2019, at 9:10 AM, Cooper Burns via users <
> users@lists.open-mpi.org> wrote:
>
> Peter,
>
> It looks like:
> Node0:
> rank0, rank1, rank2, etc..
> Node1:
> rank12, rank13, etc
> etc
>
> So the mapping looks good to me.
>
> Thanks,
> Cooper
> Cooper Burns
> Senior Research Engineer
> 
> 
> 
> 
> 
> (608) 230-1551
> convergecfd.com
> 
>
>
> On Wed, Aug 28, 2019 at 10:50 AM Peter Kjellström  wrote:
>
>> On Wed, 28 Aug 2019 09:45:15 -0500
>> Cooper Burns  wrote:
>>
>> > Peter,
>> >
>> > Thanks for your input!
>> > I tried some things:
>> >
>> > *1) The app was placed/pinned differently by the two MPIs. Often this
>> > would probably not cause such a big difference.*
>> > I agree this is unlikely the cause, however I tried various
>> > configurations of map-by, bind-to, etc and none of them had any
>> > measurable impact at all, which points to this not being the cause
>> > (as you suspected)
>>
>> OK, there's still one thing to rule out, which rank was placed on which
>> node.
>>
>> For OpenMPI you can pass "-report-bindings" and verify that the first N
>> ranks are placed on the first node (for N cores or ranks per node).
>>
>> node0: r0 r4 r8 ...
>> node1: r1 ...
>> node2: r2 ...
>> node3: r3 ...
>>
>> vs
>>
>> node0: r0 r1 r2 r3 ...
>>
>> > *2) Bad luck wrt collective performance. Different MPIs have
>> > different weak spots across the parameter space of
>> > numranks,transfersize,mpi-coll**ective.* This is possible... But the
>> > magnitude of the runtime difference seems too large to me... Are
>> > there any options we can give to OMPI to cause it to use different
>> > collective algorithms so that we can test this theory?
>>
>> It can certainly cause the observed difference. I've seen very large
>> differences...
>>
>> To get collective tunables from OpenMPI do something like:
>>
>>  ompi_info --param coll all --level 5
>>
>> But it will really help to know or suspect what collectives the
>> application depend on.
>>
>> For example, if you suspected alltoall to be a factor you could sweep
>> all valid alltoall algorithms by setting:
>>
>>  -mca coll coll_tuned_alltoall_algorithm X
>>
>> Where X is 0..6 in my case (ompi_info returned: 0 ignore, 1 basic
>> linear, 2 bruck, 3 recursive doubling, 4 ring, 5 neighbor exchange, 6:
>> two proc only.)
>>
>> /Peter
>>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users