RE: [perf-discuss] Re: Puzzling scheduler behavior

David McDaniel \(damcdani\) Thu, 01 Sep 2005 10:15:54 -0700

  Thanks, Jonathon for the good insights. I'll be digging into the
references you mentioned. Yes, at the end of the day I'm sure binding to
processor sets is part of the plan; having already done so in a rather
rote way I can demonstrate a very dramatic reduction in apparent cpu
utilzation, on the order of 25-30%. But before I commit engineers to
casting something in stone I want to make sure I understand the defaults
and the side effects of doing so since it potentially results in
defeating other improvements that Sun has done or will be doing. 
Regards
-d


> -----Original Message-----
> From: jonathan chew [mailto:[EMAIL PROTECTED] 
> Sent: Thursday, September 01, 2005 11:50 AM
> To: David McDaniel (damcdani)
> Cc: Eric C. Saxe; perf-discuss@opensolaris.org
> Subject: Re: [perf-discuss] Re: Puzzling scheduler behavior
> 
> Dave,
> 
> It sounds like you have an interesting application.  You 
> might want to create a processor set, leave some CPUs outside 
> the psrset for other threads to run on, and run your 
> application in a processor set to minimize interference from 
> other threads.  As long as there are enough CPUs for your 
> application in the psrset, you should see the number of 
> migrations go down because there won't be any interference 
> from other threads.
> 
> To get a better understanding of the Solaris performance 
> optimizations done for NUMA, you might want to check out the 
> overview of Memory Placement Optimization (MPO) at:
> 
>     http://opensolaris.org/os/community/performance/mpo_overview.pdf
> 
> The stickiness that you observed is because of MPO.  Binding 
> to a processor set containing one CPU set the home lgroup of 
> the thread to the lgroup containing that CPU and destroying 
> the psrset just left the thread homed there.
> 
> Your shared memory is probably spread across the system 
> already because the default MPO memory allocation policy for 
> shared memory is to allocate the memory from random lgroups 
> across the system.
> 
> We have some prototype observability tools which allow you to 
> examine the lgroup hierarchy and it contents and observe 
> and/or control how the threads and memory are placed among 
> lgroups (see 
> http://opensolaris.org/os/community/performance/numa/observabi
> lity/).  
> The source, binaries, and man pages are there.
> 
> 
> 
> Jonathan
> 
> 
> David McDaniel (damcdani) wrote:
> 
> >  Very, very enlightening, Eric. Its really terrific to have 
> this kind 
> >of channel for dialog.
> >  The "return to home base" behavior you describe is clearly 
> consistent 
> >with what I see and makes perfect sense.
> >  Let me followup with a question. In this application, 
> processes have 
> >not only their "own" memory, ie heap, stack program text and 
> data, etc, 
> >but they also share a moderately large (~ 2-5GB today) 
> amount of memory 
> >in the form of mmap'd files. From Sherry Moore's previous posts, I'm 
> >assuming that at startup time that would actually be all 
> allocated in 
> >one board. Since I'm contemplating moving processes onto psrsets off 
> >that board, would it be plausible to assume that I might get 
> slightly 
> >better net throughput if I could somehow spread that across all the 
> >boards? I know its speculation of the highest order, so 
> maybe my real 
> >question is whether that's even worth testing.
> >  In any case, I'd love to turn the knob you mention and 
> I'll look on 
> >the performance community page and see what kind of trouble 
> I can get 
> >into. If there are any particular items you think I should 
> check out, 
> >guidance is welcome.
> > Regards
> >-d
> >
> >  
> >
> >>-----Original Message-----
> >>From: [EMAIL PROTECTED]
> >>[mailto:[EMAIL PROTECTED] On Behalf Of Eric C. 
> >>Saxe
> >>Sent: Thursday, September 01, 2005 1:48 AM
> >>To: perf-discuss@opensolaris.org
> >>Subject: [perf-discuss] Re: Puzzling scheduler behavior
> >>
> >>Hi David,
> >>
> >>Since your v1280 systems has NUMA characteristics, the bias 
> that you 
> >>see for one of the boards may be a result of the kernel 
> trying to run 
> >>your application's threads "close" to where they have 
> allocated their 
> >>memory. We also generally try to keep threads in the same process 
> >>together, since they generally tend to work on the same data. This 
> >>might explain why one of the boards is so much busier than 
> the others.
> >>
> >>So yes, the interesting piece of this seems to be the higher than 
> >>expected run queue wait time (latency) as seen via prstat -Lm. Even 
> >>with the thread-to-board/memory affinity I mentioned above, it 
> >>generally shouldn't be the case that threads are willing to 
> hang out 
> >>on a run queue waiting for a CPU on their "home" when that thread 
> >>*could* actually run immediately on a "remote" (off-board) CPU.
> >>Better to run remote, than not at all, or at least the 
> saying goes :)
> >>
> >>In the case where a thread is dispatched remotely because all home 
> >>CPUs are busy, the thread will try to migrate back home the 
> next time 
> >>it comes through the dispatcher and finds it can run immediately at 
> >>home (either because there's an idle CPU, or because one of the 
> >>running threads is lower priority than us, and we can preempt it). 
> >>This migrating around means that the thread will tend to spend more 
> >>time waiting on run queues, since it has to either wait for 
> the idle() 
> >>thread to switch off, or for the lower priority thread it's able to 
> >>preempt to surrender the CPU. Either way, the thread 
> shouldn't have to 
> >>wait long to get the CPU, but it will have to wait a 
> non-zero amount 
> >>of time.
> >>
> >>What does the prstat -Lm output look like exactly? Is it a 
> lot of wait 
> >>time, or just more than you would expect?
> >>
> >>By the way, just to be clear, when I say "board" what I should be 
> >>saying is lgroup (or locality group). This is the Solaris 
> abstraction 
> >>for a set of CPU and memory resources that are close to one 
> another. 
> >>On your system, it turns out that kernel creates an lgroup for each 
> >>board, and each thread is given an affinity for one of the lgroups, 
> >>such that it will try to run on the CPUs (and allocate memory from 
> >>that group of resources.
> >>
> >>One thing to look at here is whether or not the kernel could be 
> >>"overloading" a given lgroup. This would result in threads 
> tending to 
> >>be less sucessful in getting CPU time (and/or
> >>memory) in their home. At least for CPU time, you can see this by 
> >>looking at the number of migrations and where they are 
> taking place. 
> >>If the thread isn't having much luck running at home, this 
> means that 
> >>it (and others sharing it's
> >>home) will tend to "ping-pong" between CPU in and out of the home 
> >>lgroup (we refer to this as the "king of the hill"
> >>pathology). In your mpstat  output, I see many migrations on one of 
> >>the boards, and a good many on the other boards as well, so 
> that might 
> >>well be happening here.
> >>
> >>To get some additional observability into this issue, you 
> might want 
> >>to take a look at some of our lgroup observability/control tools we 
> >>posted (available from the performance community page). 
> They allow you 
> >>to do things like query/set your application's lgroup 
> affinity, find 
> >>out about the lgroups in the system, and what resources 
> they contain, 
> >>etc. Using them you might be able to confirm some of my 
> theory above. 
> >>We would also *very* much like any feedback you (or anyone 
> else) would 
> >>be willing to provide on the tools.
> >>
> >>In the short term, there's a tunable I can suggest you take 
> a look at 
> >>that deals with how hard the kernel tries to keep threads 
> of the same 
> >>process together in the same lgroup.
> >>Tuning this should result in your workload being spread out more 
> >>effectively than it currently seems to be. I'll post a follow up 
> >>message tomorrow morning with these details, if you'd like to try 
> >>this.
> >>
> >>In the medium-short term, we really need to implement a 
> mechanism to 
> >>dynamically change a thread's lgroup affinity when it's 
> home becomes 
> >>overloaded. We presently don't have this, as the mechanism that 
> >>determines a thread's home lgroup (and does the lgroup load 
> balancing) 
> >>is static in nature (done at thread creation time). (Implemented in
> >>usr/src/uts/common/os/lgrp.c:lgrp_choose() if you'd like to take a 
> >>look a the source.) In terms of our NUMA/MPO projects, this 
> one is at 
> >>the top of the 'ol TODO list.
> >>This message posted from opensolaris.org 
> >>_______________________________________________
> >>perf-discuss mailing list
> >>perf-discuss@opensolaris.org
> >>
> >>    
> >>
> >_______________________________________________
> >perf-discuss mailing list
> >perf-discuss@opensolaris.org
> >
> >  
> >
> 
_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org

RE: [perf-discuss] Re: Puzzling scheduler behavior

Reply via email to