Ok, thanks for testing that. I will open a PR for master changing the default backing location to /dev/shm on linux. Will be PR’d to v3.0.x and v3.1.x.
-Nathan > On May 24, 2018, at 6:46 AM, Joseph Schuchart <schuch...@hlrs.de> wrote: > > Thank you all for your input! > > Nathan: thanks for that hint, this seems to be the culprit: With your patch, > I do not observe a difference in the performance between the two memory > allocations. I remembered that Open MPI allows to change the shmem allocator > on the command line. Using vanilla Open MPI 3.1.0 and increasing the priority > of the POSIX shmem implementation using `--mca shmem_posix_priority 100` > leads to good performance, too. The reason could be that on the Bull machine > /tmp is mounted on a disk partition (SSD, iirc). Maybe there is actual I/O > involved that hurts performance if the shm backing file is located on a disk > (even though the file is unlinked before the memory is accessed)? > > Regarding the other hints: I tried using MPI_Win_allocate_shared with the > noncontig hint. Using POSIX shmem, I do not observe a difference in > performance to the other two options. If using the disk-backed shmem file, > performance fluctuations are similar to MPI_Win_allocate. > > On this machine /proc/sys/kernel/numa_balancing is not available, so I assume > that this is not the cause in this case. It's good to know for the future > that this might become an issue on other systems. > > Cheers > Joseph > > On 05/23/2018 02:26 PM, Nathan Hjelm wrote: >> Odd. I wonder if it is something affected by your session directory. It >> might be worth moving the segment to /dev/shm. I don’t expect it will have >> an impact but you could try the following patch: >> diff --git a/ompi/mca/osc/sm/osc_sm_component.c >> b/ompi/mca/osc/sm/osc_sm_component.c >> index f7211cd93c..bfc26b39f2 100644 >> --- a/ompi/mca/osc/sm/osc_sm_component.c >> +++ b/ompi/mca/osc/sm/osc_sm_component.c >> @@ -262,8 +262,8 @@ component_select(struct ompi_win_t *win, void **base, >> size_t size, int disp_unit >> posts_size += OPAL_ALIGN_PAD_AMOUNT(posts_size, 64); >> if (0 == ompi_comm_rank (module->comm)) { >> char *data_file; >> - if (asprintf(&data_file, "%s"OPAL_PATH_SEP"shared_window_%d.%s", >> - ompi_process_info.proc_session_dir, >> + if (asprintf(&data_file, "/dev/shm/%d.shared_window_%d.%s", >> + ompi_process_info.my_name.jobid, >> ompi_comm_get_cid(module->comm), >> ompi_process_info.nodename) < 0) { >> return OMPI_ERR_OUT_OF_RESOURCE; >>> On May 23, 2018, at 6:11 AM, Joseph Schuchart <schuch...@hlrs.de> wrote: >>> >>> I tested with Open MPI 3.1.0 and Open MPI 3.0.0, both compiled with GCC >>> 7.1.0 on the Bull Cluster. I only ran on a single node but haven't tested >>> what happens if more than one node is involved. >>> >>> Joseph >>> >>> On 05/23/2018 02:04 PM, Nathan Hjelm wrote: >>>> What Open MPI version are you using? Does this happen when you run on a >>>> single node or multiple nodes? >>>> -Nathan >>>>> On May 23, 2018, at 4:45 AM, Joseph Schuchart <schuch...@hlrs.de> wrote: >>>>> >>>>> All, >>>>> >>>>> We are observing some strange/interesting performance issues in accessing >>>>> memory that has been allocated through MPI_Win_allocate. I am attaching >>>>> our test case, which allocates memory for 100M integer values on each >>>>> process both through malloc and MPI_Win_allocate and writes to the local >>>>> ranges sequentially. >>>>> >>>>> On different systems (incl. SuperMUC and a Bull Cluster), we see that >>>>> accessing the memory allocated through MPI is significantly slower than >>>>> accessing the malloc'ed memory if multiple processes run on a single >>>>> node, increasing the effect with increasing number of processes per node. >>>>> As an example, running 24 processes per node with the example attached we >>>>> see the operations on the malloc'ed memory to take ~0.4s while the MPI >>>>> allocated memory takes up to 10s. >>>>> >>>>> After some experiments, I think there are two factors involved: >>>>> >>>>> 1) Initialization: it appears that the first iteration is significantly >>>>> slower than any subsequent accesses (1.1s vs 0.4s with 12 processes on a >>>>> single socket). Excluding the first iteration from the timing or >>>>> memsetting the range leads to comparable performance. I assume that this >>>>> is due to page faults that stem from first accessing the mmap'ed memory >>>>> that backs the shared memory used in the window. The effect of presetting >>>>> the malloc'ed memory seems smaller (0.4s vs 0.6s). >>>>> >>>>> 2) NUMA effects: Given proper initialization, running on two sockets >>>>> still leads to fluctuating performance degradation under the MPI window >>>>> memory, which ranges up to 20x (in extreme cases). The performance of >>>>> accessing the malloc'ed memory is rather stable. The difference seems to >>>>> get smaller (but does not disappear) with increasing number of >>>>> repetitions. I am not sure what causes these effects as each process >>>>> should first-touch their local memory. >>>>> >>>>> Are these known issues? Does anyone have any thoughts on my analysis? >>>>> >>>>> It is problematic for us that replacing local memory allocation with MPI >>>>> memory allocation leads to performance degradation as we rely on this >>>>> mechanism in our distributed data structures. While we can ensure proper >>>>> initialization of the memory to mitigate 1) for performance measurements, >>>>> I don't see a way to control the NUMA effects. If there is one I'd be >>>>> happy about any hints :) >>>>> >>>>> I should note that we also tested MPICH-based implementations, which >>>>> showed similar effects (as they also mmap their window memory). Not >>>>> surprisingly, using MPI_Alloc_mem and attaching that memory to a dynamic >>>>> window does not cause these effects while using shared memory windows >>>>> does. I ran my experiments using Open MPI 3.1.0 with the following >>>>> command lines: >>>>> >>>>> - 12 cores / 1 socket: >>>>> mpirun -n 12 --bind-to socket --map-by ppr:12:socket >>>>> - 24 cores / 2 sockets: >>>>> mpirun -n 24 --bind-to socket >>>>> >>>>> and verified the binding using --report-bindings. >>>>> >>>>> Any help or comment would be much appreciated. >>>>> >>>>> Cheers >>>>> Joseph >>>>> >>>>> -- >>>>> Dipl.-Inf. Joseph Schuchart >>>>> High Performance Computing Center Stuttgart (HLRS) >>>>> Nobelstr. 19 >>>>> D-70569 Stuttgart >>>>> >>>>> Tel.: +49(0)711-68565890 >>>>> Fax: +49(0)711-6856832 >>>>> E-Mail: schuch...@hlrs.de >>>>> <mpiwin_vs_malloc.c>_______________________________________________ >>>>> users mailing list >>>>> users@lists.open-mpi.org >>>>> https://lists.open-mpi.org/mailman/listinfo/users >>>> _______________________________________________ >>>> users mailing list >>>> users@lists.open-mpi.org >>>> https://lists.open-mpi.org/mailman/listinfo/users >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org >>> https://lists.open-mpi.org/mailman/listinfo/users >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://lists.open-mpi.org/mailman/listinfo/users > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users
signature.asc
Description: Message signed with OpenPGP
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users