[perf-discuss] Re: Puzzling scheduler behavior

Eric C. Saxe Wed, 31 Aug 2005 23:52:14 -0700

Hi David,

Since your v1280 systems has NUMA characteristics, the bias that you see for 
one of the boards may be a result of the kernel trying to run your 
application's threads "close" to where they have allocated their memory. We 
also generally try to keep threads in the same process together, since they 
generally tend to work on the same data. This might explain why one of the 
boards is so much busier than the others.


So yes, the interesting piece of this seems to be the higher than expected run 
queue wait time (latency) as seen via prstat -Lm. Even with the 
thread-to-board/memory affinity I mentioned above, it generally shouldn't be 
the case that threads are willing to hang out on a run queue waiting for a CPU 
on their "home" when that thread *could* actually run immediately on a "remote" 
(off-board) CPU.
Better to run remote, than not at all, or at least the saying goes :)

In the case where a thread is dispatched remotely because all home CPUs are 
busy, the thread will try to migrate back home the next time it comes through 
the dispatcher and finds it can run immediately at home (either because there's 
an idle CPU, or because one of the running threads is lower priority than us, 
and we can preempt it). This migrating around means that the thread will tend 
to spend more time waiting on run queues, since it has to either wait for the 
idle() thread to switch off, or for the lower priority thread it's able to 
preempt to surrender the CPU. Either way, the thread shouldn't have to wait 
long to get the CPU, but it will have to wait a non-zero amount of time.

What does the prstat -Lm output look like exactly? Is it a lot of wait time, or 
just more than you would expect?

By the way, just to be clear, when I say "board" what I should be saying is 
lgroup (or locality group). This is the Solaris abstraction for a set of CPU 
and memory resources that are close to one another. On your system, it turns 
out that kernel creates an lgroup for each board, and each thread is given an 
affinity for one of the lgroups, such that it will try to run on the CPUs (and 
allocate memory from that group of resources.

One thing to look at here is whether or not the kernel could be "overloading" a 
given lgroup. This would result in threads tending to be less sucessful in 
getting CPU time (and/or memory) in their home. At least for CPU time, you can 
see this by looking at the number of migrations and where they are taking 
place. If the thread isn't having much luck running at home, this means that it 
(and others sharing it's home) will tend to "ping-pong" between CPU in and out 
of the home lgroup (we refer to this as the "king of the hill" pathology). In 
your mpstat  output, I see many migrations on one of the boards, and a good 
many on the other boards as well, so that might well be happening here.

To get some additional observability into this issue, you might want to take a 
look at some of our lgroup observability/control tools we posted (available 
from the performance community page). They allow you to do things like 
query/set your application's lgroup affinity, find out about the lgroups in the 
system, and what resources they contain, etc. Using them you might be able to 
confirm some of my theory above. We would also *very* much like any feedback 
you (or anyone else) would be willing to provide on the tools.

In the short term, there's a tunable I can suggest you take a look at that 
deals with how hard the kernel tries to keep threads of the same process 
together in the same lgroup. Tuning this should result in your workload being 
spread out more effectively than it currently seems to be. I'll post a follow 
up message tomorrow morning with these details, if you'd like to try this.

In the medium-short term, we really need to implement a mechanism to 
dynamically change a thread's lgroup affinity when it's home becomes 
overloaded. We presently don't have this, as the mechanism that determines a 
thread's home lgroup (and does the lgroup load balancing) is static in nature 
(done at thread creation time). (Implemented in 
usr/src/uts/common/os/lgrp.c:lgrp_choose() if you'd like to take a look a the 
source.) In terms of our NUMA/MPO projects, this one is at the top of the 'ol 
TODO list.
This message posted from opensolaris.org
_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org

[perf-discuss] Re: Puzzling scheduler behavior

Reply via email to