> > No, I assumed it based on comparisions between doing and not doing small > msg rdma at various scales, from a paper Galen pointed out to me. > http://www.cs.unm.edu/~treport/tr/05-10/Infiniband.pdf >
Actually, I wasn't so much concerned with how you jumped to your conclusion. I just wanted to point out that you did. Most people who focus on ping-pong latency like you have don't realize that they're jumping to a conclusion. You suggested that optimizing for a latency micro-benchmark would benefit small clusters, and that's just not (uniformly) true. > Benchmarks are what they are. In the above paper, the tests place the > cross-over at around 64 nodes and that confirms a number of anecdotal > reports I got. It may well be that in some situations, small-msg rdma is > better only for 2 nodes, but that's note such a likely scenario; reality > is sometimes linear (at least at our scale :-) ) after all. Indeed. Well, if you didn't like me pointing out that jump, then I'll try a different one. It's fairly straightforward to correlate the latency performance of the micro-benchmark directly to RDMA versus send/recv. You can't really do the same for the NPB results, since things like collective communication performance can play a big part. So, assuming that RDMA is the reason that MVAPICH wins where it does may not hold. I apologize if it seems like I'm picking on you. I'm hypersensitive to people trying to make judgements based on micro-benchmark performance. I've been trying to make an argument that two-node ping-pong latency comparisons really only have meaning in the context of a whole system. The answer to the question of why the latency performance of my 10,000-node machine is worse than someone else's 128-node cluster has alot to do with meeting the scaling requirements of a 10,000-node machine. (To some extent it has to do with the vendor as well, but that's a different story...) -Ron