On 22-Sep-11 12:09 AM, Jeff Squyres wrote: > On Sep 21, 2011, at 4:24 PM, Sébastien Boisvert wrote: > >>> What happens if you run 2 ibv_rc_pingpong's on each node? Or N >>> ibv_rc_pingpongs? >> >> With 11 ibv_rc_pingpong's >> >> http://pastebin.com/85sPcA47 >> >> Code to do that => https://gist.github.com/1233173 >> >> Latencies are around 20 microseconds. > > This seems to imply that the network is to blame for the higher latency...?
Interesting... I'm getting the same latency with ibv_rc_pingpong. I get 8.5 usec for a single ping-pong. Please run 'ibclearcounters' to reset fabric counters, then ibdiagnet to make sure that the fabric is clean. If you have 4x QDR cluster, run ibdiagnet as follows: ibdiagnet --ls 10 --lw 4x Check that you don't have any errors/warnings. Then please run your script with ib_write_lat instead of ibv_rc_pingpong. Just replace the command in the script and the rest would be fine. If the fabric is clean, you're supposed to get typical latency of ~1.4 usec. -- YK > I.e., if you run the same pattern with MPI processes and get 20us latency, > that would tend to imply that the network itself is not performing well with > that IO pattern. > >> My job seems to do well so far with ofud ! >> >> [sboisver12@colosse2 ray]$ qstat >> job-ID prior name user state submit/start at queue >> slots ja-task-ID >> ----------------------------------------------------------------------------------------------------------------- >> 3047460 0.55384 fish-Assem sboisver12 r 09/21/2011 15:02:25 >> med@r104-n58 256 > > I would still be suspicious -- ofud is not well tested, and it can definitely > hang if there are network drops. >