Given the oversubscription on the existing HT links, could contention account 
for the difference?  (I have no idea how HT's contention management works)  
Meaning: if the stars line up in a given run, you could end up with very 
little/no contention and you get good bandwidth.  But if there's a bit of 
jitter, you could end up with quite a bit of contention that ends up cascading 
into a bunch of additional delay.

I fail to see how that could add up to 70-80 (or more) seconds of difference -- 
13 secs vs. 90+ seconds (and more), though...  70-80 seconds sounds like an IO 
delay -- perhaps paging due to the ramdisk or somesuch...?  That's a SWAG.



On Jul 15, 2010, at 10:40 AM, Jed Brown wrote:

> On Thu, 15 Jul 2010 09:36:18 -0400, Jeff Squyres <jsquy...@cisco.com> wrote:
> > Per my other disclaimer, I'm trolling through my disastrous inbox and
> > finding some orphaned / never-answered emails.  Sorry for the delay!
> 
> No problem, I should have followed up on this with further explanation.
> 
> > Just to be clear -- you're running 8 procs locally on an 8 core node,
> > right?
> 
> These are actually 4-socket quad-core nodes, so there are 16 cores
> available, but we are only running on 8, -npersocket 2 -bind-to-socket.
> This was a greatly simplified case, but is still sufficient to show the
> variability.  It tends to be somewhat worse if we use all cores of a
> node.
> 
>   (Cisco is an Intel partner -- I don't follow the AMD line
> > much) So this should all be local communication with no external
> > network involved, right?
> 
> Yes, this was the greatly simplified case, contained entirely within a
> 
> > > lsf.o240562 killed       8*a6200
> > > lsf.o240563 9.2110e+01   8*a6200
> > > lsf.o240564 1.5638e+01   8*a6237
> > > lsf.o240565 1.3873e+01   8*a6228
> >
> > Am I reading that right that it's 92 seconds vs. 13 seconds?  Woof!
> 
> Yes, an the "killed" means it wasn't done after 120 seconds.  This
> factor of 10 is about the worst we see, but of course very surprising.
> 
> > Nice and consistent, as you mentioned.  And I assume your notation
> > here means that it's across 2 nodes.
> 
> Yes, the Quadrics nodes are 2-socket dual core, so 8 procs needs two
> nodes.
> 
> The rest of your observations are consistent with my understanding.  We
> identified two other issues, neither of which accounts for a factor of
> 10, but which account for at least a factor of 3.
> 
> 1. The administrators mounted a 16 GB ramdisk on /scratch, but did not
>    ensure that it was wiped before the next task ran.  So if you got a
>    node after some job that left stinky feces there, you could
>    effectively only have 16 GB (before the old stuff would be swapped
>    out).  More importantly, the physical pages backing the ramdisk may
>    not be uniformly distributed across the sockets, and rather than
>    preemptively swap out those old ramdisk pages, the kernel would find
>    a page on some other socket (instead of locally, this could be
>    confirmed, for example, by watching the numa_foreign and numa_miss
>    counts with numastat).  Then when you went to use that memory
>    (typically in a bandwidth-limited application), it was easy to have 3
>    sockets all waiting on one bus, thus taking a factor of 3+
>    performance hit despite a resident set much less than 50% of the
>    available memory.  I have a rather complete analysis of this in case
>    someone is interested.  Note that this can affect programs with
>    static or dynamic allocation (the kernel looks for local pages when
>    you fault it, not when you allocate it), the only way I know of to
>    circumvent the problem is to allocate memory with libnuma
>    (e.g. numa_alloc_local) which will fail if local memory isn't
>    available (instead of returning and subsequently faulting remote
>    pages).
> 
> 2. The memory bandwidth is 16-18% different between sockets, with
>    sockets 0,3 being slow and sockets 1,2 having much faster available
>    bandwidth.  This is fully reproducible and acknowledged by
>    Sun/Oracle, their response to an early inquiry:
> 
>      http://59A2.org/files/SunBladeX6440STREAM-20100616.pdf
> 
>    I am not completely happy with this explanation because the issue
>    persists even with full software prefetch, packed SSE2, and
>    non-temporal stores; as long as the working set does not fit within
>    (per-socket) L3.  Note that the software prefetch allows for several
>    hundred cycles of latency, so the extra hop for snooping shouldn't be
>    a problem.  If the working set fits within L3, then all sockets are
>    the same speed (and of course much faster due to improved bandwidth).
>    Some disassembly here:
> 
>      http://gist.github.com/476942
> 
>    The three with prefetch and movntpd run within 2% of each other, the
>    other is much faster within cache and much slower when it breaks out
>    of cache (obviously).  The performance numbers are higher than with
>    the reference implementation (quoted in Sun/Oracle's repsonse), but
>    (run with taskset to each of the four sockets):
> 
>      Triad:       5842.5814       0.0329       0.0329       0.0330
>      Triad:       6843.4206       0.0281       0.0281       0.0282
>      Triad:       6827.6390       0.0282       0.0281       0.0283
>      Triad:       5862.0601       0.0329       0.0328       0.0331
> 
>    This is almost exclusively due to the prefetching, the packed
>    arithmetic is almost completely inconsequential when waiting on
>    memory bandwidth.
> 
> Jed
> 


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to