Hi,
I would suggest that (if you haven't done it already) you trace your
program's execution with Vampir or Scalasca. The latter has some pretty nice
analysis capabilities built-in and can detect common patterns that would
make your code not to scale, no matter how good the MPI library is. Also
Out of curiosity, have you logged the time when the SP called "send" and
compared it to the time when the message was received, and when that message is
picked up in MPI_Test? In other words, have you actually verified that the
delay is in the MPI library as opposed to in your application?
On