Joseph
On 11/8/18 1:20 PM, Nathan Hjelm via users wrote:
Quick scan of the program and it looks ok to me. I will dig deeper and see if I can determine the underlying cause.What Open MPI version are you using? -Nathan On Nov 08, 2018, at 11:10 AM, Joseph Schuchart <schuch...@hlrs.de> wrote:While using the mca parameter in a real application I noticed a strange effect, which took me a while to figure out: It appears that on the Aries network the accumulate operations are not atomic anymore. I am attaching a test program that shows the problem: all but one processes continuously increment a counter while rank 0 is continuously subtracting a large value and adding it again, eventually checking for the correct number of increments. Without the mca parameter the test at the end succeeds as all increments are accounted for: ``` $ mpirun -n 16 -N 1 ./mpi_fetch_op_local_remote result:15000 ``` When setting the mca parameter the test fails with garbage in the result: ``` $ mpirun --mca osc_rdma_acc_single_intrinsic true -n 16 -N 1 ./mpi_fetch_op_local_remote result:25769849013 mpi_fetch_op_local_remote: mpi_fetch_op_local_remote.c:97: main: Assertion `sum == 1000*(comm_size-1)' failed. ``` All processes perform only MPI_Fetch_and_op in combination with MPI_SUM so I assume that the test in combination with the mca flag is correct. I cannot reproduce this issue on our IB cluster. Is that an issue in Open MPI or is there some problem in the test case that I am missing? Thanks in advance, Joseph On 11/6/18 1:15 PM, Joseph Schuchart wrote:Thanks a lot for the quick reply, setting osc_rdma_acc_single_intrinsic=true does the trick for both shared and exclusive locks and brings it down to <2us per operation. I hope that the info key will make it into the next version of the standard, I certainly have use for it :) Cheers, Joseph On 11/6/18 12:13 PM, Nathan Hjelm via users wrote:All of this is completely expected. Due to the requirements of the standard it is difficult to make use of network atomics even for MPI_Compare_and_swap (MPI_Accumulate and MPI_Get_accumulate spoil the party). If you want MPI_Fetch_and_op to be fast set this MCA parameter: osc_rdma_acc_single_intrinsic=true Shared lock is slower than an exclusive lock because there is an extra lock step as part of the accumulate (it isn't needed if there is an exclusive lock). When setting the above parameter you are telling the implementation that you will only be using a single count and we can optimize that with the hardware. The RMA working group is working on an info key that will essentially do the same thing. Note the above parameter won't help you with IB if you are using UCX unless you set this (master only right now): btl_uct_transports=dc_mlx5 btl=self,vader,uct osc=^ucx Though there may be a way to get osc/ucx to enable the same sort of optimization. I don't know. -NathanOn Nov 06, 2018, at 09:38 AM, Joseph Schuchart <schuch...@hlrs.de> wrote:All, I am currently experimenting with MPI atomic operations and wanted to share some interesting results I am observing. The numbers below are measurements from both an IB-based cluster and our Cray XC40. The benchmarks look like the following snippet: ``` if (rank == 1) { uint64_t res, val; for (size_t i = 0; i < NUM_REPS; ++i) { MPI_Fetch_and_op(&val, &res, MPI_UINT32_T, 0, 0, MPI_SUM, win); MPI_Win_flush(target, win); } } MPI_Barrier(MPI_COMM_WORLD); ``` Only rank 1 performs atomic operations, rank 0 waits in a barrier (I have tried to confirm that the operations are done in hardware by letting rank 0 sleep for a while and ensuring that communicationprogresses). Of particular interest for my use-case is fetch_op but I amincluding other operations here nevertheless: * Linux Cluster, IB QDR * average of 100000 iterations Exclusive lock, MPI_UINT32_T: fetch_op: 4.323384us compare_exchange: 2.035905us accumulate: 4.326358us get_accumulate: 4.334831us Exclusive lock, MPI_UINT64_T: fetch_op: 2.438080us compare_exchange: 2.398836us accumulate: 2.435378us get_accumulate: 2.448347us Shared lock, MPI_UINT32_T: fetch_op: 6.819977us compare_exchange: 4.551417us accumulate: 6.807766us get_accumulate: 6.817602us Shared lock, MPI_UINT64_T: fetch_op: 4.954860us compare_exchange: 2.399373us accumulate: 4.965702us get_accumulate: 4.977876us There are two interesting observations: a) operations on 64bit operands generally seem to have lower latencies than operations on 32bit b) Using an exclusive lock leads to lower latencies Overall, there is a factor of almost 3 between SharedLock+uint32_t and ExclusiveLock+uint64_t for fetch_and_op, accumulate, and get_accumulate (compare_exchange seems to be somewhat of an outlier). * Cray XC40, Aries * average of 100000 iterations Exclusive lock, MPI_UINT32_T: fetch_op: 2.011794us compare_exchange: 1.740825us accumulate: 1.795500us get_accumulate: 1.985409us Exclusive lock, MPI_UINT64_T: fetch_op: 2.017172us compare_exchange: 1.846202us accumulate: 1.812578us get_accumulate: 2.005541us Shared lock, MPI_UINT32_T: fetch_op: 5.380455us compare_exchange: 5.164458us accumulate: 5.230184us get_accumulate: 5.399722us Shared lock, MPI_UINT64_T: fetch_op: 5.415230us compare_exchange: 1.855840us accumulate: 5.212632us get_accumulate: 5.396110us The difference between exclusive and shared lock is about the same as with IB and the latencies for 32bit vs 64bit are roughly the same (except for compare_exchange, it seems). So my question is: is this to be expected? Is the higher latency when using a shared lock caused by an internal lock being acquired because the hardware operations are not actually atomic? I'd be grateful for any insight on this. Cheers, Joseph -- Dipl.-Inf. Joseph Schuchart High Performance Computing Center Stuttgart (HLRS) Nobelstr. 19 D-70569 Stuttgart Tel.: +49(0)711-68565890 Fax: +49(0)711-6856832E-Mail: schuch...@hlrs.de <mailto:schuch...@hlrs.de> <mailto:schuch...@hlrs.de>_______________________________________________ users mailing listusers@lists.open-mpi.org <mailto:users@lists.open-mpi.org> <mailto:users@lists.open-mpi.org>https://lists.open-mpi.org/mailman/listinfo/users_______________________________________________ users mailing list users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> https://lists.open-mpi.org/mailman/listinfo/users_______________________________________________ users mailing list users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> https://lists.open-mpi.org/mailman/listinfo/users_______________________________________________ users mailing list users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> https://lists.open-mpi.org/mailman/listinfo/users_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users