PR is up https://github.com/open-mpi/ompi/pull/5193
-Nathan > On May 24, 2018, at 7:09 AM, Nathan Hjelm <hje...@me.com> wrote: > > Ok, thanks for testing that. I will open a PR for master changing the default > backing location to /dev/shm on linux. Will be PR’d to v3.0.x and v3.1.x. > > -Nathan > >> On May 24, 2018, at 6:46 AM, Joseph Schuchart <schuch...@hlrs.de> wrote: >> >> Thank you all for your input! >> >> Nathan: thanks for that hint, this seems to be the culprit: With your patch, >> I do not observe a difference in the performance between the two memory >> allocations. I remembered that Open MPI allows to change the shmem allocator >> on the command line. Using vanilla Open MPI 3.1.0 and increasing the >> priority of the POSIX shmem implementation using `--mca shmem_posix_priority >> 100` leads to good performance, too. The reason could be that on the Bull >> machine /tmp is mounted on a disk partition (SSD, iirc). Maybe there is >> actual I/O involved that hurts performance if the shm backing file is >> located on a disk (even though the file is unlinked before the memory is >> accessed)? >> >> Regarding the other hints: I tried using MPI_Win_allocate_shared with the >> noncontig hint. Using POSIX shmem, I do not observe a difference in >> performance to the other two options. If using the disk-backed shmem file, >> performance fluctuations are similar to MPI_Win_allocate. >> >> On this machine /proc/sys/kernel/numa_balancing is not available, so I >> assume that this is not the cause in this case. It's good to know for the >> future that this might become an issue on other systems. >> >> Cheers >> Joseph >> >> On 05/23/2018 02:26 PM, Nathan Hjelm wrote: >>> Odd. I wonder if it is something affected by your session directory. It >>> might be worth moving the segment to /dev/shm. I don’t expect it will have >>> an impact but you could try the following patch: >>> diff --git a/ompi/mca/osc/sm/osc_sm_component.c >>> b/ompi/mca/osc/sm/osc_sm_component.c >>> index f7211cd93c..bfc26b39f2 100644 >>> --- a/ompi/mca/osc/sm/osc_sm_component.c >>> +++ b/ompi/mca/osc/sm/osc_sm_component.c >>> @@ -262,8 +262,8 @@ component_select(struct ompi_win_t *win, void **base, >>> size_t size, int disp_unit >>> posts_size += OPAL_ALIGN_PAD_AMOUNT(posts_size, 64); >>> if (0 == ompi_comm_rank (module->comm)) { >>> char *data_file; >>> - if (asprintf(&data_file, >>> "%s"OPAL_PATH_SEP"shared_window_%d.%s", >>> - ompi_process_info.proc_session_dir, >>> + if (asprintf(&data_file, "/dev/shm/%d.shared_window_%d.%s", >>> + ompi_process_info.my_name.jobid, >>> ompi_comm_get_cid(module->comm), >>> ompi_process_info.nodename) < 0) { >>> return OMPI_ERR_OUT_OF_RESOURCE; >>>> On May 23, 2018, at 6:11 AM, Joseph Schuchart <schuch...@hlrs.de> wrote: >>>> >>>> I tested with Open MPI 3.1.0 and Open MPI 3.0.0, both compiled with GCC >>>> 7.1.0 on the Bull Cluster. I only ran on a single node but haven't tested >>>> what happens if more than one node is involved. >>>> >>>> Joseph >>>> >>>> On 05/23/2018 02:04 PM, Nathan Hjelm wrote: >>>>> What Open MPI version are you using? Does this happen when you run on a >>>>> single node or multiple nodes? >>>>> -Nathan >>>>>> On May 23, 2018, at 4:45 AM, Joseph Schuchart <schuch...@hlrs.de> wrote: >>>>>> >>>>>> All, >>>>>> >>>>>> We are observing some strange/interesting performance issues in >>>>>> accessing memory that has been allocated through MPI_Win_allocate. I am >>>>>> attaching our test case, which allocates memory for 100M integer values >>>>>> on each process both through malloc and MPI_Win_allocate and writes to >>>>>> the local ranges sequentially. >>>>>> >>>>>> On different systems (incl. SuperMUC and a Bull Cluster), we see that >>>>>> accessing the memory allocated through MPI is significantly slower than >>>>>> accessing the malloc'ed memory if multiple processes run on a single >>>>>> node, increasing the effect with increasing number of processes per >>>>>> node. As an example, running 24 processes per node with the example >>>>>> attached we see the operations on the malloc'ed memory to take ~0.4s >>>>>> while the MPI allocated memory takes up to 10s. >>>>>> >>>>>> After some experiments, I think there are two factors involved: >>>>>> >>>>>> 1) Initialization: it appears that the first iteration is significantly >>>>>> slower than any subsequent accesses (1.1s vs 0.4s with 12 processes on a >>>>>> single socket). Excluding the first iteration from the timing or >>>>>> memsetting the range leads to comparable performance. I assume that this >>>>>> is due to page faults that stem from first accessing the mmap'ed memory >>>>>> that backs the shared memory used in the window. The effect of >>>>>> presetting the malloc'ed memory seems smaller (0.4s vs 0.6s). >>>>>> >>>>>> 2) NUMA effects: Given proper initialization, running on two sockets >>>>>> still leads to fluctuating performance degradation under the MPI window >>>>>> memory, which ranges up to 20x (in extreme cases). The performance of >>>>>> accessing the malloc'ed memory is rather stable. The difference seems to >>>>>> get smaller (but does not disappear) with increasing number of >>>>>> repetitions. I am not sure what causes these effects as each process >>>>>> should first-touch their local memory. >>>>>> >>>>>> Are these known issues? Does anyone have any thoughts on my analysis? >>>>>> >>>>>> It is problematic for us that replacing local memory allocation with MPI >>>>>> memory allocation leads to performance degradation as we rely on this >>>>>> mechanism in our distributed data structures. While we can ensure proper >>>>>> initialization of the memory to mitigate 1) for performance >>>>>> measurements, I don't see a way to control the NUMA effects. If there is >>>>>> one I'd be happy about any hints :) >>>>>> >>>>>> I should note that we also tested MPICH-based implementations, which >>>>>> showed similar effects (as they also mmap their window memory). Not >>>>>> surprisingly, using MPI_Alloc_mem and attaching that memory to a dynamic >>>>>> window does not cause these effects while using shared memory windows >>>>>> does. I ran my experiments using Open MPI 3.1.0 with the following >>>>>> command lines: >>>>>> >>>>>> - 12 cores / 1 socket: >>>>>> mpirun -n 12 --bind-to socket --map-by ppr:12:socket >>>>>> - 24 cores / 2 sockets: >>>>>> mpirun -n 24 --bind-to socket >>>>>> >>>>>> and verified the binding using --report-bindings. >>>>>> >>>>>> Any help or comment would be much appreciated. >>>>>> >>>>>> Cheers >>>>>> Joseph >>>>>> >>>>>> -- >>>>>> Dipl.-Inf. Joseph Schuchart >>>>>> High Performance Computing Center Stuttgart (HLRS) >>>>>> Nobelstr. 19 >>>>>> D-70569 Stuttgart >>>>>> >>>>>> Tel.: +49(0)711-68565890 >>>>>> Fax: +49(0)711-6856832 >>>>>> E-Mail: schuch...@hlrs.de >>>>>> <mpiwin_vs_malloc.c>_______________________________________________ >>>>>> users mailing list >>>>>> users@lists.open-mpi.org >>>>>> https://lists.open-mpi.org/mailman/listinfo/users >>>>> _______________________________________________ >>>>> users mailing list >>>>> users@lists.open-mpi.org >>>>> https://lists.open-mpi.org/mailman/listinfo/users >>>> _______________________________________________ >>>> users mailing list >>>> users@lists.open-mpi.org >>>> https://lists.open-mpi.org/mailman/listinfo/users >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org >>> https://lists.open-mpi.org/mailman/listinfo/users >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://lists.open-mpi.org/mailman/listinfo/users > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users
signature.asc
Description: Message signed with OpenPGP
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users