Given the oversubscription on the existing HT links, could contention account for the difference? (I have no idea how HT's contention management works) Meaning: if the stars line up in a given run, you could end up with very little/no contention and you get good bandwidth. But if there's a bit of jitter, you could end up with quite a bit of contention that ends up cascading into a bunch of additional delay.
I fail to see how that could add up to 70-80 (or more) seconds of difference -- 13 secs vs. 90+ seconds (and more), though... 70-80 seconds sounds like an IO delay -- perhaps paging due to the ramdisk or somesuch...? That's a SWAG. On Jul 15, 2010, at 10:40 AM, Jed Brown wrote: > On Thu, 15 Jul 2010 09:36:18 -0400, Jeff Squyres <jsquy...@cisco.com> wrote: > > Per my other disclaimer, I'm trolling through my disastrous inbox and > > finding some orphaned / never-answered emails. Sorry for the delay! > > No problem, I should have followed up on this with further explanation. > > > Just to be clear -- you're running 8 procs locally on an 8 core node, > > right? > > These are actually 4-socket quad-core nodes, so there are 16 cores > available, but we are only running on 8, -npersocket 2 -bind-to-socket. > This was a greatly simplified case, but is still sufficient to show the > variability. It tends to be somewhat worse if we use all cores of a > node. > > (Cisco is an Intel partner -- I don't follow the AMD line > > much) So this should all be local communication with no external > > network involved, right? > > Yes, this was the greatly simplified case, contained entirely within a > > > > lsf.o240562 killed 8*a6200 > > > lsf.o240563 9.2110e+01 8*a6200 > > > lsf.o240564 1.5638e+01 8*a6237 > > > lsf.o240565 1.3873e+01 8*a6228 > > > > Am I reading that right that it's 92 seconds vs. 13 seconds? Woof! > > Yes, an the "killed" means it wasn't done after 120 seconds. This > factor of 10 is about the worst we see, but of course very surprising. > > > Nice and consistent, as you mentioned. And I assume your notation > > here means that it's across 2 nodes. > > Yes, the Quadrics nodes are 2-socket dual core, so 8 procs needs two > nodes. > > The rest of your observations are consistent with my understanding. We > identified two other issues, neither of which accounts for a factor of > 10, but which account for at least a factor of 3. > > 1. The administrators mounted a 16 GB ramdisk on /scratch, but did not > ensure that it was wiped before the next task ran. So if you got a > node after some job that left stinky feces there, you could > effectively only have 16 GB (before the old stuff would be swapped > out). More importantly, the physical pages backing the ramdisk may > not be uniformly distributed across the sockets, and rather than > preemptively swap out those old ramdisk pages, the kernel would find > a page on some other socket (instead of locally, this could be > confirmed, for example, by watching the numa_foreign and numa_miss > counts with numastat). Then when you went to use that memory > (typically in a bandwidth-limited application), it was easy to have 3 > sockets all waiting on one bus, thus taking a factor of 3+ > performance hit despite a resident set much less than 50% of the > available memory. I have a rather complete analysis of this in case > someone is interested. Note that this can affect programs with > static or dynamic allocation (the kernel looks for local pages when > you fault it, not when you allocate it), the only way I know of to > circumvent the problem is to allocate memory with libnuma > (e.g. numa_alloc_local) which will fail if local memory isn't > available (instead of returning and subsequently faulting remote > pages). > > 2. The memory bandwidth is 16-18% different between sockets, with > sockets 0,3 being slow and sockets 1,2 having much faster available > bandwidth. This is fully reproducible and acknowledged by > Sun/Oracle, their response to an early inquiry: > > http://59A2.org/files/SunBladeX6440STREAM-20100616.pdf > > I am not completely happy with this explanation because the issue > persists even with full software prefetch, packed SSE2, and > non-temporal stores; as long as the working set does not fit within > (per-socket) L3. Note that the software prefetch allows for several > hundred cycles of latency, so the extra hop for snooping shouldn't be > a problem. If the working set fits within L3, then all sockets are > the same speed (and of course much faster due to improved bandwidth). > Some disassembly here: > > http://gist.github.com/476942 > > The three with prefetch and movntpd run within 2% of each other, the > other is much faster within cache and much slower when it breaks out > of cache (obviously). The performance numbers are higher than with > the reference implementation (quoted in Sun/Oracle's repsonse), but > (run with taskset to each of the four sockets): > > Triad: 5842.5814 0.0329 0.0329 0.0330 > Triad: 6843.4206 0.0281 0.0281 0.0282 > Triad: 6827.6390 0.0282 0.0281 0.0283 > Triad: 5862.0601 0.0329 0.0328 0.0331 > > This is almost exclusively due to the prefetching, the packed > arithmetic is almost completely inconsequential when waiting on > memory bandwidth. > > Jed > -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/