If the above doesn't improve anything the next question is do you know what the sizes of the messages are? For very small messages I believe Scali shows a 2x better performance than Intel and OMPI (I think this is due to a fastpath optimization).
I remember that mvapich was faster that scali for small messages (I'm talking only about IB, no sm). Ompi 1.3 latency is very close to mvapich latency. So I do not see how Scali latency may be better than OMPI.
Pasha