RE: [perf-discuss] Re: Puzzling scheduler behavior

David McDaniel \(damcdani\) Sun, 11 Sep 2005 15:57:37 -0700

  Thanks for the feedback, Jonathan. I've got it on my todo list to get
those tools and go spelunking a bit. I cant really say that we have a
performance problem, its more along the lines of me trying to use the
greatly improved observability tools in Solaris to get a better
understanding of things. In any case, its pretty much relegated to a
science project right now because we cant ship anything that's not part
of some "official" distribution?


> -----Original Message-----
> From: jonathan chew [mailto:[EMAIL PROTECTED] 
> Sent: Friday, September 09, 2005 6:08 PM
> To: David McDaniel (damcdani)
> Cc: Eric C. Saxe; perf-discuss@opensolaris.org
> Subject: Re: [perf-discuss] Re: Puzzling scheduler behavior
> 
> Dave,
> 
> Sorry, I forgot to reply to this sooner.  Yes, I was just 
> curious what else was running to see whether we would expect 
> your application to be perturbed much.
> 
> There could be a load imbalance due to the daemons throwing 
> everything off once in awhile.  This could be affecting how 
> the threads in your application are distributed across the 
> nodes in your NUMA machine.
> 
> Each thread is assigned a home locality group upon creation 
> and the kernel will tend to run it on CPUs in its home lgroup 
> and allocate its memory there to minimize latency and 
> maximize performance by default.  
> There is an lgroup corresponding to each of the nodes 
> (boards) in your NUMA machine.  The assignment of threads to 
> lgroups is based on lgroup load averages, so other threads 
> may cause the lgroup load average to go up or down and thus 
> affect how threads are placed among lgroups.
> 
> You can use plgrp(1) which is available on our NUMA 
> observability web page at 
> http://opensolaris.org/os/community/performance/numa/observabi
> lity to see where your application processes/threads are 
> homed.  Then we can see whether they are distributed very 
> well.  You can also use plgrp(1) to change the home lgroup of 
> a thread, but should be careful because there can be side 
> effects as explained in the example referred to below.
> 
> There are man pages, source, and binaries for our tools on 
> the web page.  I wrote up a good example of how to use the 
> tools to understand, observe, and affect thread and memory 
> placement among lgroups on a NUMA machine and posted it on 
> the web page in 
> http://opensolaris.org/os/community/performance/example.txt.
> 
> You can also try using the lgrp_expand_proc_thresh tunable 
> that Eric suggested last week.
> 
> Are the migrations that you are seeing when not running a 
> psrset causing a performance problem for your application?
> 
> 
> 
> Jonathan
> 
> 
> David McDaniel (damcdani) wrote:
> 
> >  When using prsets, the migrations and involuntary context 
> switches go
> >essentially to zero. As far as "other stuff", not quite sure what you
> >mean, but this application runs on a dedicated server so there is no
> >stuff of a casueal nature, however there is a lot of what 
> I'll glom into
> >the category of "support" tasks, ie ntp daemons, nscd 
> flushing caches,
> >fsflush running around backing up pages, etc. Was that what 
> you meant? 
> >
> >  
> >
> >>-----Original Message-----
> >>From: jonathan chew [mailto:[EMAIL PROTECTED] 
> >>Sent: Thursday, September 01, 2005 12:45 PM
> >>To: David McDaniel (damcdani)
> >>Cc: Eric C. Saxe; perf-discuss@opensolaris.org
> >>Subject: Re: [perf-discuss] Re: Puzzling scheduler behavior
> >>
> >>David McDaniel (damcdani) wrote:
> >>
> >>    
> >>
> >>> Thanks, Jonathon for the good insights. I'll be digging into the 
> >>>references you mentioned. Yes, at the end of the day I'm 
> >>>      
> >>>
> >>sure binding 
> >>    
> >>
> >>>to processor sets is part of the plan; having already done so in a 
> >>>rather rote way I can demonstrate a very dramatic reduction 
> >>>      
> >>>
> >>in apparent 
> >>    
> >>
> >>>cpu utilzation, on the order of 25-30%. But before I commit 
> >>>      
> >>>
> >>engineers 
> >>    
> >>
> >>>to casting something in stone I want to make sure I understand the 
> >>>defaults and the side effects of doing so since it 
> >>>      
> >>>
> >>potentially results 
> >>    
> >>
> >>>in defeating other improvements that Sun has done or will be doing.
> >>> 
> >>>
> >>>      
> >>>
> >>Sure.  No problem.  The overview and man pages for our tools 
> >>are pretty short.  The tools are very easy to use and kind of 
> >>fun to play with.  
> >>I'm going to try to post a good example of how to use them 
> >>later today.
> >>
> >>I think that using a psrset is an interesting experiment to 
> >>see whether interference is a big factor in all the 
> >>migrations.  It would be nice not to have to do that by 
> >>default though.
> >>
> >>It sounds like you already tried this experiment though and 
> >>noticed a big difference.  Did the migrations drop 
> >>dramatically?  What else is running on the system when you 
> >>don't use a psrset?
> >>
> >>
> >>Jonathan
> >>
> >>    
> >>
> >>>>-----Original Message-----
> >>>>From: jonathan chew [mailto:[EMAIL PROTECTED]
> >>>>Sent: Thursday, September 01, 2005 11:50 AM
> >>>>To: David McDaniel (damcdani)
> >>>>Cc: Eric C. Saxe; perf-discuss@opensolaris.org
> >>>>Subject: Re: [perf-discuss] Re: Puzzling scheduler behavior
> >>>>
> >>>>Dave,
> >>>>
> >>>>It sounds like you have an interesting application.  You 
> >>>>        
> >>>>
> >>might want to 
> >>    
> >>
> >>>>create a processor set, leave some CPUs outside the psrset 
> >>>>        
> >>>>
> >>for other 
> >>    
> >>
> >>>>threads to run on, and run your application in a processor set to 
> >>>>minimize interference from other threads.  As long as there 
> >>>>        
> >>>>
> >>are enough 
> >>    
> >>
> >>>>CPUs for your application in the psrset, you should see the 
> >>>>        
> >>>>
> >>number of 
> >>    
> >>
> >>>>migrations go down because there won't be any interference 
> >>>>        
> >>>>
> >>from other 
> >>    
> >>
> >>>>threads.
> >>>>
> >>>>To get a better understanding of the Solaris performance 
> >>>>        
> >>>>
> >>optimizations 
> >>    
> >>
> >>>>done for NUMA, you might want to check out the overview of Memory 
> >>>>Placement Optimization (MPO) at:
> >>>>
> >>>>   
> http://opensolaris.org/os/community/performance/mpo_overview.pdf
> >>>>
> >>>>The stickiness that you observed is because of MPO.  Binding to a 
> >>>>processor set containing one CPU set the home lgroup of the 
> >>>>        
> >>>>
> >>thread to 
> >>    
> >>
> >>>>the lgroup containing that CPU and destroying the psrset 
> >>>>        
> >>>>
> >>just left the 
> >>    
> >>
> >>>>thread homed there.
> >>>>
> >>>>Your shared memory is probably spread across the system already 
> >>>>because the default MPO memory allocation policy for shared 
> >>>>        
> >>>>
> >>memory is 
> >>    
> >>
> >>>>to allocate the memory from random lgroups across the system.
> >>>>
> >>>>We have some prototype observability tools which allow you 
> >>>>        
> >>>>
> >>to examine 
> >>    
> >>
> >>>>the lgroup hierarchy and it contents and observe and/or 
> control how 
> >>>>the threads and memory are placed among lgroups (see 
> >>>>http://opensolaris.org/os/community/performance/numa/observabi
> >>>>lity/).  
> >>>>The source, binaries, and man pages are there.
> >>>>
> >>>>
> >>>>
> >>>>Jonathan
> >>>>
> >>>>
> >>>>David McDaniel (damcdani) wrote:
> >>>>
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>Very, very enlightening, Eric. Its really terrific to have 
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>this kind 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>of channel for dialog.
> >>>>>The "return to home base" behavior you describe is clearly 
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>consistent 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>with what I see and makes perfect sense.
> >>>>>Let me followup with a question. In this application, 
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>processes have 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>not only their "own" memory, ie heap, stack program text and 
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>data, etc, 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>but they also share a moderately large (~ 2-5GB today) 
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>amount of memory 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>in the form of mmap'd files. From Sherry Moore's previous 
> >>>>>          
> >>>>>
> >>posts, I'm 
> >>    
> >>
> >>>>>assuming that at startup time that would actually be all 
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>allocated in 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>one board. Since I'm contemplating moving processes onto 
> >>>>>          
> >>>>>
> >>psrsets off 
> >>    
> >>
> >>>>>that board, would it be plausible to assume that I might get 
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>slightly 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>better net throughput if I could somehow spread that 
> >>>>>          
> >>>>>
> >>across all the 
> >>    
> >>
> >>>>>boards? I know its speculation of the highest order, so 
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>maybe my real 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>question is whether that's even worth testing.
> >>>>>In any case, I'd love to turn the knob you mention and 
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>I'll look on 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>the performance community page and see what kind of trouble 
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>I can get 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>into. If there are any particular items you think I should 
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>check out, 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>guidance is welcome.
> >>>>>Regards
> >>>>>-d
> >>>>>
> >>>>>
> >>>>>
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>-----Original Message-----
> >>>>>>From: [EMAIL PROTECTED]
> >>>>>>[mailto:[EMAIL PROTECTED] On Behalf 
> >>>>>>            
> >>>>>>
> >>Of Eric C. 
> >>    
> >>
> >>>>>>Saxe
> >>>>>>Sent: Thursday, September 01, 2005 1:48 AM
> >>>>>>To: perf-discuss@opensolaris.org
> >>>>>>Subject: [perf-discuss] Re: Puzzling scheduler behavior
> >>>>>>
> >>>>>>Hi David,
> >>>>>>
> >>>>>>Since your v1280 systems has NUMA characteristics, the bias 
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>that you 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>see for one of the boards may be a result of the kernel 
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>trying to run 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>your application's threads "close" to where they have 
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>allocated their 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>memory. We also generally try to keep threads in the 
> same process 
> >>>>>>together, since they generally tend to work on the same 
> >>>>>>            
> >>>>>>
> >>data. This 
> >>    
> >>
> >>>>>>might explain why one of the boards is so much busier than 
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>the others.
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>So yes, the interesting piece of this seems to be the 
> higher than 
> >>>>>>expected run queue wait time (latency) as seen via prstat 
> >>>>>>            
> >>>>>>
> >>-Lm. Even 
> >>    
> >>
> >>>>>>with the thread-to-board/memory affinity I mentioned above, it 
> >>>>>>generally shouldn't be the case that threads are willing to 
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>hang out 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>on a run queue waiting for a CPU on their "home" when 
> that thread 
> >>>>>>*could* actually run immediately on a "remote" (off-board) CPU.
> >>>>>>Better to run remote, than not at all, or at least the 
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>saying goes :)
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>In the case where a thread is dispatched remotely because 
> >>>>>>            
> >>>>>>
> >>all home 
> >>    
> >>
> >>>>>>CPUs are busy, the thread will try to migrate back home the 
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>next time 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>it comes through the dispatcher and finds it can run 
> >>>>>>            
> >>>>>>
> >>immediately at 
> >>    
> >>
> >>>>>>home (either because there's an idle CPU, or because one of the 
> >>>>>>running threads is lower priority than us, and we can 
> >>>>>>            
> >>>>>>
> >>preempt it). 
> >>    
> >>
> >>>>>>This migrating around means that the thread will tend to 
> >>>>>>            
> >>>>>>
> >>spend more 
> >>    
> >>
> >>>>>>time waiting on run queues, since it has to either wait for 
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>the idle() 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>thread to switch off, or for the lower priority thread 
> >>>>>>            
> >>>>>>
> >>it's able to 
> >>    
> >>
> >>>>>>preempt to surrender the CPU. Either way, the thread 
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>shouldn't have to 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>wait long to get the CPU, but it will have to wait a 
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>non-zero amount 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>of time.
> >>>>>>
> >>>>>>What does the prstat -Lm output look like exactly? Is it a 
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>lot of wait 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>time, or just more than you would expect?
> >>>>>>
> >>>>>>By the way, just to be clear, when I say "board" what I 
> should be 
> >>>>>>saying is lgroup (or locality group). This is the Solaris 
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>abstraction 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>for a set of CPU and memory resources that are close to one 
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>another. 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>On your system, it turns out that kernel creates an 
> >>>>>>            
> >>>>>>
> >>lgroup for each 
> >>    
> >>
> >>>>>>board, and each thread is given an affinity for one of 
> >>>>>>            
> >>>>>>
> >>the lgroups, 
> >>    
> >>
> >>>>>>such that it will try to run on the CPUs (and allocate 
> >>>>>>            
> >>>>>>
> >>memory from 
> >>    
> >>
> >>>>>>that group of resources.
> >>>>>>
> >>>>>>One thing to look at here is whether or not the kernel could be 
> >>>>>>"overloading" a given lgroup. This would result in threads 
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>tending to 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>be less sucessful in getting CPU time (and/or
> >>>>>>memory) in their home. At least for CPU time, you can 
> see this by 
> >>>>>>looking at the number of migrations and where they are 
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>taking place. 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>If the thread isn't having much luck running at home, this 
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>means that 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>it (and others sharing it's
> >>>>>>home) will tend to "ping-pong" between CPU in and out 
> of the home 
> >>>>>>lgroup (we refer to this as the "king of the hill"
> >>>>>>pathology). In your mpstat  output, I see many migrations 
> >>>>>>            
> >>>>>>
> >>on one of 
> >>    
> >>
> >>>>>>the boards, and a good many on the other boards as well, so 
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>that might 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>well be happening here.
> >>>>>>
> >>>>>>To get some additional observability into this issue, you 
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>might want 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>to take a look at some of our lgroup 
> >>>>>>            
> >>>>>>
> >>observability/control tools we 
> >>    
> >>
> >>>>>>posted (available from the performance community page). 
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>They allow you 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>to do things like query/set your application's lgroup 
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>affinity, find 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>out about the lgroups in the system, and what resources 
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>they contain, 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>etc. Using them you might be able to confirm some of my 
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>theory above. 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>We would also *very* much like any feedback you (or anyone 
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>else) would 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>be willing to provide on the tools.
> >>>>>>
> >>>>>>In the short term, there's a tunable I can suggest you take 
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>a look at 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>that deals with how hard the kernel tries to keep threads 
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>of the same 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>process together in the same lgroup.
> >>>>>>Tuning this should result in your workload being spread 
> out more 
> >>>>>>effectively than it currently seems to be. I'll post a 
> follow up 
> >>>>>>message tomorrow morning with these details, if you'd 
> like to try 
> >>>>>>this.
> >>>>>>
> >>>>>>In the medium-short term, we really need to implement a 
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>mechanism to 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>dynamically change a thread's lgroup affinity when it's 
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>home becomes 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>overloaded. We presently don't have this, as the mechanism that 
> >>>>>>determines a thread's home lgroup (and does the lgroup load 
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>balancing) 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>is static in nature (done at thread creation time). 
> >>>>>>            
> >>>>>>
> >>(Implemented in
> >>    
> >>
> >>>>>>usr/src/uts/common/os/lgrp.c:lgrp_choose() if you'd like 
> >>>>>>            
> >>>>>>
> >>to take a 
> >>    
> >>
> >>>>>>look a the source.) In terms of our NUMA/MPO projects, this 
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>one is at 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>the top of the 'ol TODO list.
> >>>>>>This message posted from opensolaris.org 
> >>>>>>_______________________________________________
> >>>>>>perf-discuss mailing list
> >>>>>>perf-discuss@opensolaris.org
> >>>>>>
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>_______________________________________________
> >>>>>perf-discuss mailing list
> >>>>>perf-discuss@opensolaris.org
> >>>>>
> >>>>>
> >>>>>
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >
> >  
> >
> 
_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org

RE: [perf-discuss] Re: Puzzling scheduler behavior

Reply via email to