Sorry, forgot the attachments. On Thu, Aug 11, 2016 at 5:06 PM, Xiaolong Cui <sunshine...@gmail.com> wrote:
> Thanks! I tried it, but it didn't solve my problem. Maybe the reason is > not eager/rndv. > > The reason why I want to always use eager mode is that I want to avoid a > sender being slowed down by an unready receiver. I can prevent a sender > from slowing down by always using eager mode on InfiniBand, just like your > approach, but I cannot repeat this on OPA. Based on the experiments below, > it seems to me that a sender will be delayed to some extent due to reasons > other than eager/rndv. > > I designed a simple test (see hello_world.c in attachment) where there is > one sender rank (r0) and one receiver rank (r1). r0 always runs at full > speed, but r1 runs at full speed in one case and half speed in the second > case. To run r1 at half speed, I collate a third rank r2 with r1 (see > rankfile). Then I compare the completion time at r0 to see if there is a > slow down when r1 is "unready to receive". The result is positive. But it > is surprising that the delay varies significantly when I change the message > length. This is different from my previous observation when eager/rndv is > the cause. > > So my question is do you know other factors that cause a delay to a > MPI_Send() when the receiver is not ready to receive? > > > > > On Wed, Aug 10, 2016 at 11:48 PM, Cabral, Matias A < > matias.a.cab...@intel.com> wrote: > >> To remain in eager mode you need to increase the size of >> PSM2_MQ_RNDV_HFI_THRESH. >> >> >> PSM2_MQ_EAGER_SDMA_SZ is the threshold at which PSM changes from PIO >> (uses the CPU) to start setting SDMA engines. This summary may help: >> >> >> >> PIO Eager Mode: 0 bytes -> PSM2_MQ_EAGER_SDMA_SZ - 1 >> >> SDMA Eager Mode: PSM2_MQ_EAGER_SDMA_SZ -> PSM2_MQ_RNDV_HFI_THRESH >> - 1 >> >> RNDZ Expected: PSM2_MQ_RNDV_HFI_THRESH -> Largest supported >> value. >> >> >> >> Regards, >> >> >> >> _MAC >> >> >> >> *From:* users [mailto:users-boun...@lists.open-mpi.org] *On Behalf Of >> *Xiaolong >> Cui >> *Sent:* Wednesday, August 10, 2016 7:19 PM >> *To:* Open MPI Users <users@lists.open-mpi.org> >> *Subject:* Re: [OMPI users] runtime performance tuning for Intel OMA >> interconnect >> >> >> >> Hi Matias, >> >> >> >> Thanks a lot, that's very helpful! >> >> >> >> What I need indeed is to always use eager mode. But I didn't find any >> information about PSM2_MQ_EAGER_SDMA_SZ online. Would you please >> elaborate on "Just in case PSM2_MQ_EAGER_SDMA_SZ changes PIO to SDMA, >> always in eager mode." >> >> >> >> Thanks! >> >> Michael >> >> >> >> On Wed, Aug 10, 2016 at 3:59 PM, Cabral, Matias A < >> matias.a.cab...@intel.com> wrote: >> >> Hi Michael, >> >> >> >> When Open MPI run on Omni-Path it will choose the PSM2 MTL by default, to >> use the libpsm2.so. Strictly speaking, it has compatibility to run using >> the openib BTL. However, the performance so significantly impacted that it >> is, not only discouraged, but no tuning would make sense. Regarding the >> PSM2 MTL, currently it only supports two mca parameters >> ("mtl_psm2_connect_timeout" and "mtl_psm2_priority") which are not for what >> you are looking for. Instead, you can set values directly in the PSM2 >> library with environment variables. Further info in the Programmers Guide: >> >> >> >> http://www.intel.com/content/dam/support/us/en/documents/net >> work-and-i-o/fabric-products/Intel_PSM2_PG_H76473_v3_0.pdf >> >> More docs: >> >> >> >> https://www-ssl.intel.com/content/www/us/en/support/network- >> and-i-o/fabric-products/000016242.html?wapkw=psm2 >> >> >> >> Now, for your parameters: >> >> >> >> btl = openib,vader,self -> Ignore this one >> >> btl_openib_eager_limit = 160000 -> I don’t clearly see the diff with >> the below parameter. However, they are set to the same value. Just in case >> PSM2_MQ_EAGER_SDMA_SZ changes PIO to SDMA, always in eager mode. >> >> btl_openib_rndv_eager_limit = 160000 -> PSM2_MQ_RNDV_HFI_THRESH >> >> btl_openib_max_send_size = 160000 -> does not apply to PSM2 >> >> btl_openib_receive_queues = P,128,256,192,128:S,2048,1024, >> 1008,64:S,12288,1024,1008,64:S,160000,1024,512,512 -> does not apply >> for PSM2. >> >> >> >> Thanks, >> >> Regards, >> >> >> >> _MAC >> >> BTW, should change the subject OMA -> OPA >> >> >> >> >> >> *From:* users [mailto:users-boun...@lists.open-mpi.org >> <users-boun...@lists.open-mpi.org>] *On Behalf Of *Xiaolong Cui >> *Sent:* Tuesday, August 09, 2016 2:22 PM >> *To:* users@lists.open-mpi.org >> *Subject:* [OMPI users] runtime performance tuning for Intel OMA >> interconnect >> >> >> >> I used to tune the performance of OpenMPI on InfiniBand by changing the >> OpenMPI MCA parameters for openib component (see >> https://www.open-mpi.org/faq/?category=openfabrics). Now I migrate to a >> new cluster that deploys Intel's omni-path interconnect, and my previous >> approach does not work any more. Does anyone know how to tune the >> performance for omni-path interconnect (what OpenMPI component to change) ? >> >> >> >> The version of OpenMPI is openmpi-1.10.2-hfi. I have included the output >> from opmi_info and openib parameters that I used to change. Thanks! >> >> >> >> Sincerely, >> >> Michael >> >> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> >> >> >> >> >> >> >> _______________________________________________ >> users mailing list >> > > -- Xiaolong Cui (Michael) Department of Computer Science Dietrich School of Arts & Science University of Pittsburgh Pittsburgh, PA 15260
#include "mpi.h" #include <time.h> #include <mpi-ext.h> #include <math.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <pthread.h> #include <signal.h> #define MAT_SIZE 800 #define NITER 20 int matrix_multiply(int size){ int a[size][size]; int b[size][size]; int c[size][size]; int i, j, k; srand(time(NULL)); for(i = 0; i < size; i++) for(j = 0; j < size; j++){ a[i][j] = rand() % 100; b[i][j] = rand() % 100; } for(i = 0; i < size; i++){ for(j = 0; j < size; j++){ int temp = 0; for(k = 0; k < size; k++){ temp += a[i][k]*b[k][j]; } c[i][j] = temp; } } return c[0][0]; } int main(int argc, char** argv){ int size; int rank; int msg_len, iters; int i; int mat_size; struct timeval t1, t2; MPI_Errhandler ls_errh; if(argc != 2){ printf("Wrong arguments! Need to provide message length\n"); exit(0); } msg_len = atoi(argv[1]); char *buf = (char *)malloc(msg_len * sizeof(char)); MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Comm_rank(MPI_COMM_WORLD, &rank); gettimeofday(&t1, 0x0); /*main loop, simulating HPC workload*/ for(i = 0; i < NITER; i++){ /*local compuatation*/ matrix_multiply(MAT_SIZE); /*communication*/ if(rank == 0){ MPI_Send(buf, msg_len, MPI_CHAR, 1, 0, MPI_COMM_WORLD); } else if(rank == 1){ MPI_Recv(buf, msg_len, MPI_CHAR, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); } } gettimeofday(&t2, 0x0); double sec = (t2.tv_sec - t1.tv_sec); double usec = (t2.tv_usec - t1.tv_usec) / 1000000.0; double diff = sec + usec; printf("[%d] Total time is %.3f seconds\n", rank, diff); MPI_Finalize(); return 0; }
rankfile
Description: Binary data
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users