On Oct 30, 2012, at 9:51 AM, Hodge, Gary C wrote:

> FYI, recently, I was tracking down the source of page faults in our 
> application that has real-time requirements.  I found that disabling the sm 
> component (--mca btl ^sm) eliminated many page faults I was seeing.  

Good point.  This is likely true; the shared memory component will definitely 
cause more page faults.  Using huge pages may alleviate this (e.g., less TLB 
usage), but we haven't studied it much.

> I now have much better deterministic performance in that I no longer see 
> outlier measurements (jobs that usually take 3 ms would sometimes take 15 
> ms).  

I'm not sure I grok that; are you benchmarking an entire *job* (i.e., a single 
"mpirun") that varies between 3 and 15 milliseconds?  If so, I'd say that both 
are pretty darn good, because mpirun invokes a lot of overhead for launching 
and completing jobs.  Furthermore, benchmarking an entire job that lasts 
significantly less than 1 second is probably not the most stable measurement, 
regardless of page faults or not -- there's lots of other distributed and OS 
effects that can cause a jump from 3 to 15 milliseconds. 

> I did not notice a performance penalty using a network stack.

Depends on the app.  Some MPI apps are latency bound; some are not.

Latency-bound applications will definitely benefit from faster point-to-point 
performance.  Shared memory will definitely have the fastest point-to-point 
latency compared to any network stack (i.e., hundreds of nanos vs. 1+ micro).

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to