Odd. I wonder if it is something affected by your session directory. It might be worth moving the segment to /dev/shm. I don’t expect it will have an impact but you could try the following patch:
diff --git a/ompi/mca/osc/sm/osc_sm_component.c b/ompi/mca/osc/sm/osc_sm_component.c index f7211cd93c..bfc26b39f2 100644 --- a/ompi/mca/osc/sm/osc_sm_component.c +++ b/ompi/mca/osc/sm/osc_sm_component.c @@ -262,8 +262,8 @@ component_select(struct ompi_win_t *win, void **base, size_t size, int disp_unit posts_size += OPAL_ALIGN_PAD_AMOUNT(posts_size, 64); if (0 == ompi_comm_rank (module->comm)) { char *data_file; - if (asprintf(&data_file, "%s"OPAL_PATH_SEP"shared_window_%d.%s", - ompi_process_info.proc_session_dir, + if (asprintf(&data_file, "/dev/shm/%d.shared_window_%d.%s", + ompi_process_info.my_name.jobid, ompi_comm_get_cid(module->comm), ompi_process_info.nodename) < 0) { return OMPI_ERR_OUT_OF_RESOURCE; > On May 23, 2018, at 6:11 AM, Joseph Schuchart <schuch...@hlrs.de> wrote: > > I tested with Open MPI 3.1.0 and Open MPI 3.0.0, both compiled with GCC 7.1.0 > on the Bull Cluster. I only ran on a single node but haven't tested what > happens if more than one node is involved. > > Joseph > > On 05/23/2018 02:04 PM, Nathan Hjelm wrote: >> What Open MPI version are you using? Does this happen when you run on a >> single node or multiple nodes? >> -Nathan >>> On May 23, 2018, at 4:45 AM, Joseph Schuchart <schuch...@hlrs.de> wrote: >>> >>> All, >>> >>> We are observing some strange/interesting performance issues in accessing >>> memory that has been allocated through MPI_Win_allocate. I am attaching our >>> test case, which allocates memory for 100M integer values on each process >>> both through malloc and MPI_Win_allocate and writes to the local ranges >>> sequentially. >>> >>> On different systems (incl. SuperMUC and a Bull Cluster), we see that >>> accessing the memory allocated through MPI is significantly slower than >>> accessing the malloc'ed memory if multiple processes run on a single node, >>> increasing the effect with increasing number of processes per node. As an >>> example, running 24 processes per node with the example attached we see the >>> operations on the malloc'ed memory to take ~0.4s while the MPI allocated >>> memory takes up to 10s. >>> >>> After some experiments, I think there are two factors involved: >>> >>> 1) Initialization: it appears that the first iteration is significantly >>> slower than any subsequent accesses (1.1s vs 0.4s with 12 processes on a >>> single socket). Excluding the first iteration from the timing or memsetting >>> the range leads to comparable performance. I assume that this is due to >>> page faults that stem from first accessing the mmap'ed memory that backs >>> the shared memory used in the window. The effect of presetting the >>> malloc'ed memory seems smaller (0.4s vs 0.6s). >>> >>> 2) NUMA effects: Given proper initialization, running on two sockets still >>> leads to fluctuating performance degradation under the MPI window memory, >>> which ranges up to 20x (in extreme cases). The performance of accessing the >>> malloc'ed memory is rather stable. The difference seems to get smaller (but >>> does not disappear) with increasing number of repetitions. I am not sure >>> what causes these effects as each process should first-touch their local >>> memory. >>> >>> Are these known issues? Does anyone have any thoughts on my analysis? >>> >>> It is problematic for us that replacing local memory allocation with MPI >>> memory allocation leads to performance degradation as we rely on this >>> mechanism in our distributed data structures. While we can ensure proper >>> initialization of the memory to mitigate 1) for performance measurements, I >>> don't see a way to control the NUMA effects. If there is one I'd be happy >>> about any hints :) >>> >>> I should note that we also tested MPICH-based implementations, which showed >>> similar effects (as they also mmap their window memory). Not surprisingly, >>> using MPI_Alloc_mem and attaching that memory to a dynamic window does not >>> cause these effects while using shared memory windows does. I ran my >>> experiments using Open MPI 3.1.0 with the following command lines: >>> >>> - 12 cores / 1 socket: >>> mpirun -n 12 --bind-to socket --map-by ppr:12:socket >>> - 24 cores / 2 sockets: >>> mpirun -n 24 --bind-to socket >>> >>> and verified the binding using --report-bindings. >>> >>> Any help or comment would be much appreciated. >>> >>> Cheers >>> Joseph >>> >>> -- >>> Dipl.-Inf. Joseph Schuchart >>> High Performance Computing Center Stuttgart (HLRS) >>> Nobelstr. 19 >>> D-70569 Stuttgart >>> >>> Tel.: +49(0)711-68565890 >>> Fax: +49(0)711-6856832 >>> E-Mail: schuch...@hlrs.de >>> <mpiwin_vs_malloc.c>_______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org >>> https://lists.open-mpi.org/mailman/listinfo/users >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://lists.open-mpi.org/mailman/listinfo/users > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users
signature.asc
Description: Message signed with OpenPGP
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users