Hello everybody,
I am measuring some timings for MPI_Send/MPI_Recv. I am doing a single communication between 2 processes and I repeat this several times to get meaningful values. The message being sent varies from 64 bytes up to 16 MBs, doubling the size each time (64, 128, 256,....8M, 16M).

I give you here some context information on the way I am executing this benchmark. The experiment is executed on a multicore architecture, 2 processes bound to 2 distinct cores of the CPU. The 2 processes run on the same node. The underlying CPU is an AMD Istanbul CPU (6 cores) 64KB L1 data cache 64 KB L2 data cache, 512KB L2 Cache and 6 MB L3 (shared) Cache. The node contains 2 sockets therefore each CPU gets exactly one of the 2 MPI processes.

I am using OpenMPI version 1.4.4 (compiled by myself using the default configurations, didn't use any fancy SM implementation)

In order to force the SM module I run my code using the following MCA parameter: "--mca btl sm,self" I am also aware of the *eager_limit* and various threshold present in the OpenMPI library. In order to not get confused I set these two parameters to 16MB (twice the size of the L3 cache): *btl_sm_eager_limit* and *btl_sm_max_send_size*

Beside the time I am measuring a couple of HW counters using PAPI. In particular I am interested in total instructions (PAPI_TOT_INS) and branch instructions (PAPI_BR_INS).



Enough with the context, this is what I am observing. At 16 MB there is a clear increase in the number of instructions and branch instructions (and this can be explained by my settings of eager_limit and max send size).

However something weird already happens at 32K where I clearly see an increase in the number of branches and total instructions. The fact is that there are almost 0 branch instructions until 32KB and starting from 32KB to 16MB there is a linear increase. At 16MB there is another jump and then again linear increase. It seems that there is another threshold driving this behavior. I tried to set these other parameters for the SM BTL, btl_sm_fifo_size, btl_sm_exclusivity but nothing changed. For my understanding of MPI, this should be a kind of pipe-lining of the message which is being transferred by chunks (of probably 32KB size).

How can I override this behavior? Is there any parameter I can set?


I also noticed that while this is happening for the MPI_Send, the MPI_Recv operation behaves differently. For the receive routine there is no bump in terms of branch and total instructions. The increase is linear starting from 64 bytes. The increase of branch instructions slows down however after the 16MB threshold. My idea about that is that probably the receive is busy waiting for the message and therefore the number of branches grows proportionally with the time spent for the message to arrive.

This is my hypothesis but you probably know better.
graphs are attached. Thanks in advance for your help.

cheers, Simone P.



Attachment: data.pdf
Description: other/unknown

Reply via email to