Very, very enlightening, Eric. Its really terrific to have this kind
of channel for dialog.
  The "return to home base" behavior you describe is clearly consistent
with what I see and makes perfect sense.
  Let me followup with a question. In this application, processes have
not only their "own" memory, ie heap, stack program text and data, etc,
but they also share a moderately large (~ 2-5GB today) amount of memory
in the form of mmap'd files. From Sherry Moore's previous posts, I'm
assuming that at startup time that would actually be all allocated in
one board. Since I'm contemplating moving processes onto psrsets off
that board, would it be plausible to assume that I might get slightly
better net throughput if I could somehow spread that across all the
boards? I know its speculation of the highest order, so maybe my real
question is whether that's even worth testing.
  In any case, I'd love to turn the knob you mention and I'll look on
the performance community page and see what kind of trouble I can get
into. If there are any particular items you think I should check out,
guidance is welcome.
 Regards
-d

> -----Original Message-----
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] On Behalf Of 
> Eric C. Saxe
> Sent: Thursday, September 01, 2005 1:48 AM
> To: perf-discuss@opensolaris.org
> Subject: [perf-discuss] Re: Puzzling scheduler behavior
> 
> Hi David,
> 
> Since your v1280 systems has NUMA characteristics, the bias 
> that you see for one of the boards may be a result of the 
> kernel trying to run your application's threads "close" to 
> where they have allocated their memory. We also generally try 
> to keep threads in the same process together, since they 
> generally tend to work on the same data. This might explain 
> why one of the boards is so much busier than the others. 
> 
> So yes, the interesting piece of this seems to be the higher 
> than expected run queue wait time (latency) as seen via 
> prstat -Lm. Even with the thread-to-board/memory affinity I 
> mentioned above, it generally shouldn't be the case that 
> threads are willing to hang out on a run queue waiting for a 
> CPU on their "home" when that thread *could* actually run 
> immediately on a "remote" (off-board) CPU.
> Better to run remote, than not at all, or at least the saying goes :)
> 
> In the case where a thread is dispatched remotely because all 
> home CPUs are busy, the thread will try to migrate back home 
> the next time it comes through the dispatcher and finds it 
> can run immediately at home (either because there's an idle 
> CPU, or because one of the running threads is lower priority 
> than us, and we can preempt it). This migrating around means 
> that the thread will tend to spend more time waiting on run 
> queues, since it has to either wait for the idle() thread to 
> switch off, or for the lower priority thread it's able to 
> preempt to surrender the CPU. Either way, the thread 
> shouldn't have to wait long to get the CPU, but it will have 
> to wait a non-zero amount of time.
> 
> What does the prstat -Lm output look like exactly? Is it a 
> lot of wait time, or just more than you would expect?
> 
> By the way, just to be clear, when I say "board" what I 
> should be saying is lgroup (or locality group). This is the 
> Solaris abstraction for a set of CPU and memory resources 
> that are close to one another. On your system, it turns out 
> that kernel creates an lgroup for each board, and each thread 
> is given an affinity for one of the lgroups, such that it 
> will try to run on the CPUs (and allocate memory from that 
> group of resources.
> 
> One thing to look at here is whether or not the kernel could 
> be "overloading" a given lgroup. This would result in threads 
> tending to be less sucessful in getting CPU time (and/or 
> memory) in their home. At least for CPU time, you can see 
> this by looking at the number of migrations and where they 
> are taking place. If the thread isn't having much luck 
> running at home, this means that it (and others sharing it's 
> home) will tend to "ping-pong" between CPU in and out of the 
> home lgroup (we refer to this as the "king of the hill" 
> pathology). In your mpstat  output, I see many migrations on 
> one of the boards, and a good many on the other boards as 
> well, so that might well be happening here.
> 
> To get some additional observability into this issue, you 
> might want to take a look at some of our lgroup 
> observability/control tools we posted (available from the 
> performance community page). They allow you to do things like 
> query/set your application's lgroup affinity, find out about 
> the lgroups in the system, and what resources they contain, 
> etc. Using them you might be able to confirm some of my 
> theory above. We would also *very* much like any feedback you 
> (or anyone else) would be willing to provide on the tools.
> 
> In the short term, there's a tunable I can suggest you take a 
> look at that deals with how hard the kernel tries to keep 
> threads of the same process together in the same lgroup. 
> Tuning this should result in your workload being spread out 
> more effectively than it currently seems to be. I'll post a 
> follow up message tomorrow morning with these details, if 
> you'd like to try this.
> 
> In the medium-short term, we really need to implement a 
> mechanism to dynamically change a thread's lgroup affinity 
> when it's home becomes overloaded. We presently don't have 
> this, as the mechanism that determines a thread's home lgroup 
> (and does the lgroup load balancing) is static in nature 
> (done at thread creation time). (Implemented in 
> usr/src/uts/common/os/lgrp.c:lgrp_choose() if you'd like to 
> take a look a the source.) In terms of our NUMA/MPO projects, 
> this one is at the top of the 'ol TODO list.
> This message posted from opensolaris.org 
> _______________________________________________
> perf-discuss mailing list
> perf-discuss@opensolaris.org
> 
_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org

Reply via email to