Hi, I am trying to implement the following collectives in MPI sharedmemory, Alltoall, Broadcast, Reduce with zero copy optimizations.So for Reduce, my compiler allocates all the send buffers in sharedmemory (mmap anonymous), and allocates only one receive buffer againin shared memory. Then all the processes reduce to the root buffer ina data parallel manner. Now it looks like openmpi is doing somethingsimilar except that they must copy from/to the send/receive buffers.So my implementation of reduce should perform better for large buffersizes. But that is not the case. Anybody knows why? Any pointers arewelcome. Also the openmpi reduce performance has large variations. I run reducewith different array sizes with np = 8 50 times and for a single arraysize, I find that there is a significant number of outliers. Didanybody face similar problems? Thanks,Nilesh.