RE: [perf-discuss] Re: Puzzling scheduler behavior

David McDaniel \(damcdani\) Thu, 01 Sep 2005 11:34:43 -0700

  When using prsets, the migrations and involuntary context switches go
essentially to zero. As far as "other stuff", not quite sure what you
mean, but this application runs on a dedicated server so there is no
stuff of a casueal nature, however there is a lot of what I'll glom into
the category of "support" tasks, ie ntp daemons, nscd flushing caches,
fsflush running around backing up pages, etc. Was that what you meant?


> -----Original Message-----
> From: jonathan chew [mailto:[EMAIL PROTECTED] 
> Sent: Thursday, September 01, 2005 12:45 PM
> To: David McDaniel (damcdani)
> Cc: Eric C. Saxe; perf-discuss@opensolaris.org
> Subject: Re: [perf-discuss] Re: Puzzling scheduler behavior
> 
> David McDaniel (damcdani) wrote:
> 
> >  Thanks, Jonathon for the good insights. I'll be digging into the 
> >references you mentioned. Yes, at the end of the day I'm 
> sure binding 
> >to processor sets is part of the plan; having already done so in a 
> >rather rote way I can demonstrate a very dramatic reduction 
> in apparent 
> >cpu utilzation, on the order of 25-30%. But before I commit 
> engineers 
> >to casting something in stone I want to make sure I understand the 
> >defaults and the side effects of doing so since it 
> potentially results 
> >in defeating other improvements that Sun has done or will be doing.
> >  
> >
> 
> Sure.  No problem.  The overview and man pages for our tools 
> are pretty short.  The tools are very easy to use and kind of 
> fun to play with.  
> I'm going to try to post a good example of how to use them 
> later today.
> 
> I think that using a psrset is an interesting experiment to 
> see whether interference is a big factor in all the 
> migrations.  It would be nice not to have to do that by 
> default though.
> 
> It sounds like you already tried this experiment though and 
> noticed a big difference.  Did the migrations drop 
> dramatically?  What else is running on the system when you 
> don't use a psrset?
> 
> 
> Jonathan
> 
> >>-----Original Message-----
> >>From: jonathan chew [mailto:[EMAIL PROTECTED]
> >>Sent: Thursday, September 01, 2005 11:50 AM
> >>To: David McDaniel (damcdani)
> >>Cc: Eric C. Saxe; perf-discuss@opensolaris.org
> >>Subject: Re: [perf-discuss] Re: Puzzling scheduler behavior
> >>
> >>Dave,
> >>
> >>It sounds like you have an interesting application.  You 
> might want to 
> >>create a processor set, leave some CPUs outside the psrset 
> for other 
> >>threads to run on, and run your application in a processor set to 
> >>minimize interference from other threads.  As long as there 
> are enough 
> >>CPUs for your application in the psrset, you should see the 
> number of 
> >>migrations go down because there won't be any interference 
> from other 
> >>threads.
> >>
> >>To get a better understanding of the Solaris performance 
> optimizations 
> >>done for NUMA, you might want to check out the overview of Memory 
> >>Placement Optimization (MPO) at:
> >>
> >>    http://opensolaris.org/os/community/performance/mpo_overview.pdf
> >>
> >>The stickiness that you observed is because of MPO.  Binding to a 
> >>processor set containing one CPU set the home lgroup of the 
> thread to 
> >>the lgroup containing that CPU and destroying the psrset 
> just left the 
> >>thread homed there.
> >>
> >>Your shared memory is probably spread across the system already 
> >>because the default MPO memory allocation policy for shared 
> memory is 
> >>to allocate the memory from random lgroups across the system.
> >>
> >>We have some prototype observability tools which allow you 
> to examine 
> >>the lgroup hierarchy and it contents and observe and/or control how 
> >>the threads and memory are placed among lgroups (see 
> >>http://opensolaris.org/os/community/performance/numa/observabi
> >>lity/).  
> >>The source, binaries, and man pages are there.
> >>
> >>
> >>
> >>Jonathan
> >>
> >>
> >>David McDaniel (damcdani) wrote:
> >>
> >>    
> >>
> >>> Very, very enlightening, Eric. Its really terrific to have 
> >>>      
> >>>
> >>this kind 
> >>    
> >>
> >>>of channel for dialog.
> >>> The "return to home base" behavior you describe is clearly 
> >>>      
> >>>
> >>consistent 
> >>    
> >>
> >>>with what I see and makes perfect sense.
> >>> Let me followup with a question. In this application, 
> >>>      
> >>>
> >>processes have 
> >>    
> >>
> >>>not only their "own" memory, ie heap, stack program text and 
> >>>      
> >>>
> >>data, etc, 
> >>    
> >>
> >>>but they also share a moderately large (~ 2-5GB today) 
> >>>      
> >>>
> >>amount of memory 
> >>    
> >>
> >>>in the form of mmap'd files. From Sherry Moore's previous 
> posts, I'm 
> >>>assuming that at startup time that would actually be all 
> >>>      
> >>>
> >>allocated in 
> >>    
> >>
> >>>one board. Since I'm contemplating moving processes onto 
> psrsets off 
> >>>that board, would it be plausible to assume that I might get 
> >>>      
> >>>
> >>slightly 
> >>    
> >>
> >>>better net throughput if I could somehow spread that 
> across all the 
> >>>boards? I know its speculation of the highest order, so 
> >>>      
> >>>
> >>maybe my real 
> >>    
> >>
> >>>question is whether that's even worth testing.
> >>> In any case, I'd love to turn the knob you mention and 
> >>>      
> >>>
> >>I'll look on 
> >>    
> >>
> >>>the performance community page and see what kind of trouble 
> >>>      
> >>>
> >>I can get 
> >>    
> >>
> >>>into. If there are any particular items you think I should 
> >>>      
> >>>
> >>check out, 
> >>    
> >>
> >>>guidance is welcome.
> >>>Regards
> >>>-d
> >>>
> >>> 
> >>>
> >>>      
> >>>
> >>>>-----Original Message-----
> >>>>From: [EMAIL PROTECTED]
> >>>>[mailto:[EMAIL PROTECTED] On Behalf 
> Of Eric C. 
> >>>>Saxe
> >>>>Sent: Thursday, September 01, 2005 1:48 AM
> >>>>To: perf-discuss@opensolaris.org
> >>>>Subject: [perf-discuss] Re: Puzzling scheduler behavior
> >>>>
> >>>>Hi David,
> >>>>
> >>>>Since your v1280 systems has NUMA characteristics, the bias 
> >>>>        
> >>>>
> >>that you 
> >>    
> >>
> >>>>see for one of the boards may be a result of the kernel 
> >>>>        
> >>>>
> >>trying to run 
> >>    
> >>
> >>>>your application's threads "close" to where they have 
> >>>>        
> >>>>
> >>allocated their 
> >>    
> >>
> >>>>memory. We also generally try to keep threads in the same process 
> >>>>together, since they generally tend to work on the same 
> data. This 
> >>>>might explain why one of the boards is so much busier than 
> >>>>        
> >>>>
> >>the others.
> >>    
> >>
> >>>>So yes, the interesting piece of this seems to be the higher than 
> >>>>expected run queue wait time (latency) as seen via prstat 
> -Lm. Even 
> >>>>with the thread-to-board/memory affinity I mentioned above, it 
> >>>>generally shouldn't be the case that threads are willing to 
> >>>>        
> >>>>
> >>hang out 
> >>    
> >>
> >>>>on a run queue waiting for a CPU on their "home" when that thread 
> >>>>*could* actually run immediately on a "remote" (off-board) CPU.
> >>>>Better to run remote, than not at all, or at least the 
> >>>>        
> >>>>
> >>saying goes :)
> >>    
> >>
> >>>>In the case where a thread is dispatched remotely because 
> all home 
> >>>>CPUs are busy, the thread will try to migrate back home the 
> >>>>        
> >>>>
> >>next time 
> >>    
> >>
> >>>>it comes through the dispatcher and finds it can run 
> immediately at 
> >>>>home (either because there's an idle CPU, or because one of the 
> >>>>running threads is lower priority than us, and we can 
> preempt it). 
> >>>>This migrating around means that the thread will tend to 
> spend more 
> >>>>time waiting on run queues, since it has to either wait for 
> >>>>        
> >>>>
> >>the idle() 
> >>    
> >>
> >>>>thread to switch off, or for the lower priority thread 
> it's able to 
> >>>>preempt to surrender the CPU. Either way, the thread 
> >>>>        
> >>>>
> >>shouldn't have to 
> >>    
> >>
> >>>>wait long to get the CPU, but it will have to wait a 
> >>>>        
> >>>>
> >>non-zero amount 
> >>    
> >>
> >>>>of time.
> >>>>
> >>>>What does the prstat -Lm output look like exactly? Is it a 
> >>>>        
> >>>>
> >>lot of wait 
> >>    
> >>
> >>>>time, or just more than you would expect?
> >>>>
> >>>>By the way, just to be clear, when I say "board" what I should be 
> >>>>saying is lgroup (or locality group). This is the Solaris 
> >>>>        
> >>>>
> >>abstraction 
> >>    
> >>
> >>>>for a set of CPU and memory resources that are close to one 
> >>>>        
> >>>>
> >>another. 
> >>    
> >>
> >>>>On your system, it turns out that kernel creates an 
> lgroup for each 
> >>>>board, and each thread is given an affinity for one of 
> the lgroups, 
> >>>>such that it will try to run on the CPUs (and allocate 
> memory from 
> >>>>that group of resources.
> >>>>
> >>>>One thing to look at here is whether or not the kernel could be 
> >>>>"overloading" a given lgroup. This would result in threads 
> >>>>        
> >>>>
> >>tending to 
> >>    
> >>
> >>>>be less sucessful in getting CPU time (and/or
> >>>>memory) in their home. At least for CPU time, you can see this by 
> >>>>looking at the number of migrations and where they are 
> >>>>        
> >>>>
> >>taking place. 
> >>    
> >>
> >>>>If the thread isn't having much luck running at home, this 
> >>>>        
> >>>>
> >>means that 
> >>    
> >>
> >>>>it (and others sharing it's
> >>>>home) will tend to "ping-pong" between CPU in and out of the home 
> >>>>lgroup (we refer to this as the "king of the hill"
> >>>>pathology). In your mpstat  output, I see many migrations 
> on one of 
> >>>>the boards, and a good many on the other boards as well, so 
> >>>>        
> >>>>
> >>that might 
> >>    
> >>
> >>>>well be happening here.
> >>>>
> >>>>To get some additional observability into this issue, you 
> >>>>        
> >>>>
> >>might want 
> >>    
> >>
> >>>>to take a look at some of our lgroup 
> observability/control tools we 
> >>>>posted (available from the performance community page). 
> >>>>        
> >>>>
> >>They allow you 
> >>    
> >>
> >>>>to do things like query/set your application's lgroup 
> >>>>        
> >>>>
> >>affinity, find 
> >>    
> >>
> >>>>out about the lgroups in the system, and what resources 
> >>>>        
> >>>>
> >>they contain, 
> >>    
> >>
> >>>>etc. Using them you might be able to confirm some of my 
> >>>>        
> >>>>
> >>theory above. 
> >>    
> >>
> >>>>We would also *very* much like any feedback you (or anyone 
> >>>>        
> >>>>
> >>else) would 
> >>    
> >>
> >>>>be willing to provide on the tools.
> >>>>
> >>>>In the short term, there's a tunable I can suggest you take 
> >>>>        
> >>>>
> >>a look at 
> >>    
> >>
> >>>>that deals with how hard the kernel tries to keep threads 
> >>>>        
> >>>>
> >>of the same 
> >>    
> >>
> >>>>process together in the same lgroup.
> >>>>Tuning this should result in your workload being spread out more 
> >>>>effectively than it currently seems to be. I'll post a follow up 
> >>>>message tomorrow morning with these details, if you'd like to try 
> >>>>this.
> >>>>
> >>>>In the medium-short term, we really need to implement a 
> >>>>        
> >>>>
> >>mechanism to 
> >>    
> >>
> >>>>dynamically change a thread's lgroup affinity when it's 
> >>>>        
> >>>>
> >>home becomes 
> >>    
> >>
> >>>>overloaded. We presently don't have this, as the mechanism that 
> >>>>determines a thread's home lgroup (and does the lgroup load 
> >>>>        
> >>>>
> >>balancing) 
> >>    
> >>
> >>>>is static in nature (done at thread creation time). 
> (Implemented in
> >>>>usr/src/uts/common/os/lgrp.c:lgrp_choose() if you'd like 
> to take a 
> >>>>look a the source.) In terms of our NUMA/MPO projects, this 
> >>>>        
> >>>>
> >>one is at 
> >>    
> >>
> >>>>the top of the 'ol TODO list.
> >>>>This message posted from opensolaris.org 
> >>>>_______________________________________________
> >>>>perf-discuss mailing list
> >>>>perf-discuss@opensolaris.org
> >>>>
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>_______________________________________________
> >>>perf-discuss mailing list
> >>>perf-discuss@opensolaris.org
> >>>
> >>> 
> >>>
> >>>      
> >>>
> 
_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org

RE: [perf-discuss] Re: Puzzling scheduler behavior

Reply via email to