Hi David, Since your v1280 systems has NUMA characteristics, the bias that you see for one of the boards may be a result of the kernel trying to run your application's threads "close" to where they have allocated their memory. We also generally try to keep threads in the same process together, since they generally tend to work on the same data. This might explain why one of the boards is so much busier than the others.
So yes, the interesting piece of this seems to be the higher than expected run queue wait time (latency) as seen via prstat -Lm. Even with the thread-to-board/memory affinity I mentioned above, it generally shouldn't be the case that threads are willing to hang out on a run queue waiting for a CPU on their "home" when that thread *could* actually run immediately on a "remote" (off-board) CPU. Better to run remote, than not at all, or at least the saying goes :) In the case where a thread is dispatched remotely because all home CPUs are busy, the thread will try to migrate back home the next time it comes through the dispatcher and finds it can run immediately at home (either because there's an idle CPU, or because one of the running threads is lower priority than us, and we can preempt it). This migrating around means that the thread will tend to spend more time waiting on run queues, since it has to either wait for the idle() thread to switch off, or for the lower priority thread it's able to preempt to surrender the CPU. Either way, the thread shouldn't have to wait long to get the CPU, but it will have to wait a non-zero amount of time. What does the prstat -Lm output look like exactly? Is it a lot of wait time, or just more than you would expect? By the way, just to be clear, when I say "board" what I should be saying is lgroup (or locality group). This is the Solaris abstraction for a set of CPU and memory resources that are close to one another. On your system, it turns out that kernel creates an lgroup for each board, and each thread is given an affinity for one of the lgroups, such that it will try to run on the CPUs (and allocate memory from that group of resources. One thing to look at here is whether or not the kernel could be "overloading" a given lgroup. This would result in threads tending to be less sucessful in getting CPU time (and/or memory) in their home. At least for CPU time, you can see this by looking at the number of migrations and where they are taking place. If the thread isn't having much luck running at home, this means that it (and others sharing it's home) will tend to "ping-pong" between CPU in and out of the home lgroup (we refer to this as the "king of the hill" pathology). In your mpstat output, I see many migrations on one of the boards, and a good many on the other boards as well, so that might well be happening here. To get some additional observability into this issue, you might want to take a look at some of our lgroup observability/control tools we posted (available from the performance community page). They allow you to do things like query/set your application's lgroup affinity, find out about the lgroups in the system, and what resources they contain, etc. Using them you might be able to confirm some of my theory above. We would also *very* much like any feedback you (or anyone else) would be willing to provide on the tools. In the short term, there's a tunable I can suggest you take a look at that deals with how hard the kernel tries to keep threads of the same process together in the same lgroup. Tuning this should result in your workload being spread out more effectively than it currently seems to be. I'll post a follow up message tomorrow morning with these details, if you'd like to try this. In the medium-short term, we really need to implement a mechanism to dynamically change a thread's lgroup affinity when it's home becomes overloaded. We presently don't have this, as the mechanism that determines a thread's home lgroup (and does the lgroup load balancing) is static in nature (done at thread creation time). (Implemented in usr/src/uts/common/os/lgrp.c:lgrp_choose() if you'd like to take a look a the source.) In terms of our NUMA/MPO projects, this one is at the top of the 'ol TODO list. This message posted from opensolaris.org _______________________________________________ perf-discuss mailing list perf-discuss@opensolaris.org