Re: [OMPI users] runtime performance tuning for Intel OMA interconnect

Xiaolong Cui Thu, 11 Aug 2016 14:16:58 -0700

Sorry, forgot the attachments.

On Thu, Aug 11, 2016 at 5:06 PM, Xiaolong Cui <sunshine...@gmail.com> wrote:


> Thanks! I tried it, but it didn't solve my problem. Maybe the reason is
> not eager/rndv.
>
> The reason why I want to always use eager mode is that I want to avoid a
> sender being slowed down by an unready receiver. I can prevent a sender
> from slowing down by always using eager mode on InfiniBand, just like your
> approach, but I cannot repeat this on OPA. Based on the experiments below,
> it seems to me that a sender will be delayed to some extent due to reasons
> other than eager/rndv.
>
> I designed a simple test (see hello_world.c in attachment) where there is
> one sender rank (r0) and one receiver rank (r1). r0 always runs at full
> speed, but r1 runs at full speed in one case and half speed in the second
> case. To run r1 at half speed, I collate a third rank r2 with r1 (see
> rankfile). Then I compare the completion time at r0 to see if there is a
> slow down when r1 is "unready to receive". The result is positive. But it
> is surprising that the delay varies significantly when I change the message
> length. This is different from my previous observation when eager/rndv is
> the cause.
>
> So my question is do you know other factors that cause a delay to a
> MPI_Send() when the receiver is not ready to receive?
>
>
>
>
> On Wed, Aug 10, 2016 at 11:48 PM, Cabral, Matias A <
> matias.a.cab...@intel.com> wrote:
>
>> To remain in eager mode you need to increase the size of 
>> PSM2_MQ_RNDV_HFI_THRESH.
>>
>>
>> PSM2_MQ_EAGER_SDMA_SZ is the threshold at which PSM changes from PIO
>> (uses the CPU) to start setting SDMA engines.  This summary may help:
>>
>>
>>
>> PIO Eager Mode:              0 bytes -> PSM2_MQ_EAGER_SDMA_SZ - 1
>>
>> SDMA Eager Mode:        PSM2_MQ_EAGER_SDMA_SZ -> PSM2_MQ_RNDV_HFI_THRESH
>> - 1
>>
>> RNDZ Expected:               PSM2_MQ_RNDV_HFI_THRESH -> Largest supported
>> value.
>>
>>
>>
>> Regards,
>>
>>
>>
>> _MAC
>>
>>
>>
>> *From:* users [mailto:users-boun...@lists.open-mpi.org] *On Behalf Of 
>> *Xiaolong
>> Cui
>> *Sent:* Wednesday, August 10, 2016 7:19 PM
>> *To:* Open MPI Users <users@lists.open-mpi.org>
>> *Subject:* Re: [OMPI users] runtime performance tuning for Intel OMA
>> interconnect
>>
>>
>>
>> Hi Matias,
>>
>>
>>
>> Thanks a lot, that's very helpful!
>>
>>
>>
>> What I need indeed is to always use eager mode. But I didn't find any
>> information about PSM2_MQ_EAGER_SDMA_SZ online. Would you please
>> elaborate on "Just in case PSM2_MQ_EAGER_SDMA_SZ changes PIO to SDMA,
>> always in eager mode."
>>
>>
>>
>> Thanks!
>>
>> Michael
>>
>>
>>
>> On Wed, Aug 10, 2016 at 3:59 PM, Cabral, Matias A <
>> matias.a.cab...@intel.com> wrote:
>>
>> Hi Michael,
>>
>>
>>
>> When Open MPI run on Omni-Path it will choose the PSM2 MTL by default, to
>> use the libpsm2.so. Strictly speaking, it has compatibility to run using
>> the openib BTL. However, the performance so significantly impacted that it
>> is, not only discouraged, but no tuning would make sense. Regarding the
>> PSM2 MTL, currently it only supports two mca parameters
>> ("mtl_psm2_connect_timeout" and "mtl_psm2_priority") which are not for what
>> you are looking for. Instead, you can set values directly in the PSM2
>> library with environment variables.  Further info in the Programmers Guide:
>>
>>
>>
>> http://www.intel.com/content/dam/support/us/en/documents/net
>> work-and-i-o/fabric-products/Intel_PSM2_PG_H76473_v3_0.pdf
>>
>> More docs:
>>
>>
>>
>> https://www-ssl.intel.com/content/www/us/en/support/network-
>> and-i-o/fabric-products/000016242.html?wapkw=psm2
>>
>>
>>
>> Now, for your parameters:
>>
>>
>>
>> btl = openib,vader,self  -> Ignore this one
>>
>> btl_openib_eager_limit = 160000   -> I don’t clearly see the diff with
>> the below parameter. However, they are set to the same value. Just in case
>> PSM2_MQ_EAGER_SDMA_SZ changes PIO to SDMA, always in eager mode.
>>
>> btl_openib_rndv_eager_limit = 160000  -> PSM2_MQ_RNDV_HFI_THRESH
>>
>> btl_openib_max_send_size = 160000   -> does not apply to PSM2
>>
>> btl_openib_receive_queues = P,128,256,192,128:S,2048,1024,
>> 1008,64:S,12288,1024,1008,64:S,160000,1024,512,512  -> does not apply
>> for PSM2.
>>
>>
>>
>> Thanks,
>>
>> Regards,
>>
>>
>>
>> _MAC
>>
>> BTW, should change the subject OMA -> OPA
>>
>>
>>
>>
>>
>> *From:* users [mailto:users-boun...@lists.open-mpi.org
>> <users-boun...@lists.open-mpi.org>] *On Behalf Of *Xiaolong Cui
>> *Sent:* Tuesday, August 09, 2016 2:22 PM
>> *To:* users@lists.open-mpi.org
>> *Subject:* [OMPI users] runtime performance tuning for Intel OMA
>> interconnect
>>
>>
>>
>> I used to tune the performance of OpenMPI on InfiniBand by changing the
>> OpenMPI MCA parameters for openib component (see
>> https://www.open-mpi.org/faq/?category=openfabrics). Now I migrate to a
>> new cluster that deploys Intel's omni-path interconnect, and my previous
>> approach does not work any more. Does anyone know how to tune the
>> performance for omni-path interconnect (what OpenMPI component to change) ?
>>
>>
>>
>> The version of OpenMPI is openmpi-1.10.2-hfi. I have included the output
>> from opmi_info and openib parameters that I used to change. Thanks!
>>
>>
>>
>> Sincerely,
>>
>> Michael
>>
>>
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> users mailing list
>>
>
>


-- 
Xiaolong Cui (Michael)
Department of Computer Science
Dietrich School of Arts & Science
University of Pittsburgh
Pittsburgh, PA 15260

#include "mpi.h"
#include <time.h>
#include <mpi-ext.h>
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <pthread.h>
#include <signal.h>

#define MAT_SIZE 800
#define NITER 20

int matrix_multiply(int size){
    int a[size][size];
    int b[size][size];
    int c[size][size];
    int i, j, k;

    srand(time(NULL));
    for(i = 0; i < size; i++)
        for(j = 0; j < size; j++){
            a[i][j] = rand() % 100;
            b[i][j] = rand() % 100;
        }
    for(i = 0; i < size; i++){
        for(j = 0; j < size; j++){
            int temp = 0;
            for(k = 0; k < size; k++){
                temp += a[i][k]*b[k][j];
            }
            c[i][j] = temp;
        }
    }
    return c[0][0];
}	
 
int main(int argc, char** argv){
    int size;
    int rank;
    int msg_len, iters;
    int i;
    int mat_size;
    struct timeval t1, t2;
    MPI_Errhandler ls_errh;
    
    if(argc != 2){
        printf("Wrong arguments! Need to provide message length\n");
        exit(0);
    }
    msg_len = atoi(argv[1]);
    char *buf = (char *)malloc(msg_len * sizeof(char));

    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    
    gettimeofday(&t1, 0x0);

    /*main loop, simulating HPC workload*/
    for(i = 0; i < NITER; i++){
        /*local compuatation*/
        matrix_multiply(MAT_SIZE);
        /*communication*/
        if(rank == 0){
            MPI_Send(buf, msg_len, MPI_CHAR, 1, 0, MPI_COMM_WORLD);
        }
        else if(rank == 1){
            MPI_Recv(buf, msg_len, MPI_CHAR, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
        }
    }
    gettimeofday(&t2, 0x0);
    double sec = (t2.tv_sec - t1.tv_sec);
    double usec = (t2.tv_usec - t1.tv_usec) / 1000000.0;
    double diff = sec + usec;
    printf("[%d] Total time is %.3f seconds\n", rank, diff); 
    
    MPI_Finalize();
    
    return 0;
}

rankfile
Description: Binary data

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] runtime performance tuning for Intel OMA interconnect

Reply via email to