On 22-Sep-11 12:09 AM, Jeff Squyres wrote:
> On Sep 21, 2011, at 4:24 PM, Sébastien Boisvert wrote:
> 
>>> What happens if you run 2 ibv_rc_pingpong's on each node?  Or N 
>>> ibv_rc_pingpongs?
>>
>> With 11 ibv_rc_pingpong's
>>
>> http://pastebin.com/85sPcA47
>>
>> Code to do that =>  https://gist.github.com/1233173
>>
>> Latencies are around 20 microseconds.
> 
> This seems to imply that the network is to blame for the higher latency...?

Interesting... I'm getting the same latency with ibv_rc_pingpong.
I get 8.5 usec for a single ping-pong.

Please run 'ibclearcounters' to reset fabric counters, then
ibdiagnet to make sure that the fabric is clean.
If you have 4x QDR cluster, run ibdiagnet as follows:

ibdiagnet --ls 10 --lw 4x 

Check that you don't have any errors/warnings.

Then please run your script with ib_write_lat instead of ibv_rc_pingpong.
Just replace the command in the script and the rest would be fine.

If the fabric is clean, you're supposed to get typical
latency of ~1.4 usec.

-- YK


> I.e., if you run the same pattern with MPI processes and get 20us latency, 
> that would tend to imply that the network itself is not performing well with 
> that IO pattern.
> 
>> My job seems to do well so far with ofud !
>>
>> [sboisver12@colosse2 ray]$ qstat
>> job-ID  prior   name       user         state submit/start at     queue      
>>                     slots ja-task-ID
>> -----------------------------------------------------------------------------------------------------------------
>> 3047460 0.55384 fish-Assem sboisver12   r     09/21/2011 15:02:25 
>> med@r104-n58                     256
> 
> I would still be suspicious -- ofud is not well tested, and it can definitely 
> hang if there are network drops.
> 

Reply via email to