Hi Jingchao,
My bad, I should have read closer into your thread. The problem is indeed in that CP2K calls MPI_Alloc_mem to allocate memory for practically everything and all the time. This somehow managed to escape our earlier profiling runs, perhaps because we were too concentrated on finding a communication issue. We profiled the program again with a different tool and it showed 70% of the run time spent in memory allocation. Disabling the openib BTL prevents the memory registration and solves the issue. It appears we will be disabling the openib BTL on the entire Omni-Path partition. Regards, Hristo From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Jingchao Zhang Sent: Wednesday, February 08, 2017 6:40 PM To: users@lists.open-mpi.org Subject: Re: [OMPI users] Severe performance issue with PSM2 and single-node CP2K jobs Hi Hristo, We have a similar problem here and I started a thread a few days ago. https://mail-archive.com/users@lists.open-mpi.org/msg30581.html Regard, Jingchao _____ From: users <users-boun...@lists.open-mpi.org> on behalf of Iliev, Hristo <il...@itc.rwth-aachen.de> Sent: Wednesday, February 8, 2017 10:43:54 AM To: users@lists.open-mpi.org Subject: [OMPI users] Severe performance issue with PSM2 and single-node CP2K jobs Hi, While trying to debug a severe performance regression of CP2K runs with Open MPI 1.10.4 on our new cluster, after reproducing the problem with single-node jobs too, we found out that the root cause is that the presence of Intel Omni-Path hardware triggers the use of the cm PML and consequently the use of psm2 MTL for shared-memory communication instead of the sm BTL. As subsequent tests with NetPIPE on a single socket showed (see the attached graph), the ping-pong latency of PSM2's shared-memory implementation is always significantly higher (20-60%) except for a relatively narrow range of message lengths 10-100 KiB, for which it is faster. Tests with processes on two sockets show that sm outperforms psm2 with smaller message sizes and psm2 outperforms sm for larger message sizes, at least for messages of less than 32 MiB. The real problem is though that the ScaLAPACK routines used by CP2K further exaggerate the difference, which results in orders of magnitude slower execution. We've tested it with both MKL and ScaLAPACK (and even BLAS) from Netlib in order to exclude possible performance regressions in MKL when used with Open MPI, which is our default configuration. While disabling the psm2 MTL or enforcing the ob1 PML is a viable workaround for single-node jobs, it is not really a solution to our problem in general as utilising Omni-Path via its InfiniBand interface results in high latency and poor network bandwidth. As expected, disabling the "shm" device of PSM2 crashes the program. My question is actually whether it is currently possible for several PMLs to coexist and to be used at the same time? Ideally, ob1 driving the sm BTL for intranode communication and cm driving the psm2 MTL for internode communication. From my limited understanding of the Open MPI source code, that doesn't really seem possible. While the psm2 MTL appears to be a relatively thin wrapper around the PSM2 API and therefore the problem might not really be in Open MPI but in the PSM2 library itself, it somehow does not affect Intel MPI. It seems to be a CP2K specific problem as a different software (Quantum ESPRESSO built with ScaLAPACK) runs fine, but then it could be the due to different ScaLAPACK routines being used. The attached graphs show the ratio of the MPI ping-pong latency as measured by NetPIPE when run as follows: mpiexec -n 2 --map-by core/socket --bind-to core NPmpi -a -I -l 1 -u 33554432 with and without --mca pml ob1. I also performed tests with Linux CMA support in PSM2 switched on and off (it is on by default), which doesn't change much. Our default Open MPI is built without CMA support. Has anyone successfully run ScaLAPACK applications, and CP2K in particular, on systems with Intel Omni-Path? Perhaps I'm missing something here? I'm sorry if this has already been discussed here. I went through the list archives, but couldn't find anything. If it was, I would be grateful if anyone could provide pointers to the relevant thread(s). Kind regards, Hristo -- Hristo Iliev, PhD JARA-HPC CSG "Parallel Efficiency" IT Center Group: High Performance Computing Division: Computational Science and Engineering RWTH Aachen University Seffenter Weg 23 52074 Aachen, Germany Tel: +49 (241) 80-24367 Fax: +49 (241) 80-624367 il...@itc.rwth-aachen.de http://www.itc.rwth-aachen.de
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users