Nathan,
unfortunately '--mca memory_linux_disable 1' does not help on this issue - it does not change the behaviour at all. Note that the pathological behaviour is present in Open MPI 2.0.2 as well as in /1.10.x, and Intel OmniPath (OPA) network-capable nodes are affected only.

The known workaround is to disable InfiniBand failback by '--mca btl ^tcp,openib' on nodes with OPA network. (On IB nodes, the same tweak lead to 5% performance improvement on single-node jobs; but obviously disabling IB on nodes connected via IB is not a solution for multi-node jobs, huh).


On 03/07/17 20:22, Nathan Hjelm wrote:
If this is with 1.10.x or older run with --mca memory_linux_disable 1. There is 
a bad interaction between ptmalloc2 and psm2 support. This problem is not 
present in v2.0.x and newer.

-Nathan

On Mar 7, 2017, at 10:30 AM, Paul Kapinos <kapi...@itc.rwth-aachen.de> wrote:

Hi Dave,


On 03/06/17 18:09, Dave Love wrote:
I've been looking at a new version of an application (cp2k, for for what
it's worth) which is calling mpi_alloc_mem/mpi_free_mem, and I don't

Welcome to the club! :o)
In our measures we see some 70% of time in 'mpi_free_mem'... and 15x 
performance loss if using Open MPI vs. Intel MPI. So it goes.

https://www.mail-archive.com/users@lists.open-mpi.org//msg30593.html


think it did so the previous version I looked at.  I found on an
IB-based system it's spending about half its time in those allocation
routines (according to its own profiling) -- a tad surprising.

It turns out that's due to some pathological interaction with openib,
and just having openib loaded.  It shows up on a single-node run iff I
don't suppress the openib btl, and doesn't for multi-node PSM runs iff I
suppress openib (on a mixed Mellanox/Infinipath system).

we're lucky - our issue is on Intel OmniPath (OPA) network (and we will junk IB 
hardware in near future, I think) - so we disabled the IB transport failback,
--mca btl ^tcp,openib

For single-node jobs this will also help on plain IB nodes, likely. (you can 
disable IB if you do not use it)


Can anyone say why, and whether there's a workaround?  (I can't easily
diagnose what it's up to as ptrace is turned off on the system
concerned, and I can't find anything relevant in archives.)

I had the idea to try libfabric instead for multi-node jobs, and that
doesn't show the pathological behaviour iff openib is suppressed.
However, it requires ompi 1.10, not 1.8, which I was trying to use.
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users



--
Dipl.-Inform. Paul Kapinos   -   High Performance Computing,
RWTH Aachen University, IT Center
Seffenter Weg 23,  D 52074  Aachen (Germany)
Tel: +49 241/80-24915

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users



--
Dipl.-Inform. Paul Kapinos   -   High Performance Computing,
RWTH Aachen University, IT Center
Seffenter Weg 23,  D 52074  Aachen (Germany)
Tel: +49 241/80-24915

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to