RE: [perf-discuss] Re: Puzzling scheduler behavior

damcdani Sun, 18 Sep 2005 10:46:33 -0700

   RE:
> When you say that you can't ship anything that's not part of 
> some "official" distribution, are you referring to our tools 
> or your software?
   I was referring to the tools. So, as an in-house effort in our labs
we can use the tools to observe the application to expose opportunities,
etc. Then later on when the tools become "official", if we've learned
anything we can apply that knowledge in the the field.
   As far as changing the default behavior of the OS, there are only a
couple of things I know of right now that are truly problematic. The
first is the performance of POSIX NP_ROBUST mutexes. Because of the
reliability concerns we use these all over the place, and the
performance difference between them and normal, non-robust mutexes is
pretty startling. Phil Harman was looking at that, but I think it turned
out to be more involved than first thought. The other is the fact that
we cant somehow find a way to use large pages with mmap'd files. We
spend a high-single-digit percentage of our time in tlb and tsb miss
processing that could be avoidable, IMHO.
   But who knows, using these tools I might have one of those eureka
moments and either find something in our implementation or some little
thing in the OS that we could change to make worlds of difference.
-d


> -----Original Message-----
> From: jonathan chew [mailto:[EMAIL PROTECTED] 
> Sent: Friday, September 16, 2005 5:40 PM
> To: David McDaniel [EMAIL PROTECTED]
> Cc: Eric C. Saxe; perf-discuss@opensolaris.org
> Subject: Re: [perf-discuss] Re: Puzzling scheduler behavior
> 
> David McDaniel (damcdani) wrote:
> 
> >  Thanks for the feedback, Jonathan. I've got it on my todo 
> list to get 
> >those tools and go spelunking a bit. I cant really say that 
> we have a 
> >performance problem, its more along the lines of me trying 
> to use the 
> >greatly improved observability tools in Solaris to get a better 
> >understanding of things. In any case, its pretty much relegated to a 
> >science project right now because we cant ship anything 
> that's not part 
> >of some "official" distribution?
> >  
> >
> 
> Ok.  The tools are pretty easy to use.  If you have any 
> questions, we would be happy to help and welcome any feedback 
> on the tools or documentation.
> 
> When you say that you can't ship anything that's not part of 
> some "official" distribution, are you referring to our tools 
> or your software?
> 
> I am suggesting using our tools to understand the behavior of 
> your application and its interaction with the operating 
> system better and determine whether there is a problem or 
> not.  If there is a problem in the OS, we can try to fix the 
> default behavior.
> 
> As Sasha pointed out, it is our intention to ship our 
> observability tools, but we wanted to let the OpenSolaris 
> community try them first to see whether they are useful.
> 
> Last but not least, we can try running your application if you want.
> 
> 
> 
> Jonathan
> 
> >>-----Original Message-----
> >>From: jonathan chew [mailto:[EMAIL PROTECTED]
> >>Sent: Friday, September 09, 2005 6:08 PM
> >>To: David McDaniel (damcdani)
> >>Cc: Eric C. Saxe; perf-discuss@opensolaris.org
> >>Subject: Re: [perf-discuss] Re: Puzzling scheduler behavior
> >>
> >>Dave,
> >>
> >>Sorry, I forgot to reply to this sooner.  Yes, I was just 
> curious what 
> >>else was running to see whether we would expect your 
> application to be 
> >>perturbed much.
> >>
> >>There could be a load imbalance due to the daemons throwing 
> everything 
> >>off once in awhile.  This could be affecting how the 
> threads in your 
> >>application are distributed across the nodes in your NUMA machine.
> >>
> >>Each thread is assigned a home locality group upon creation and the 
> >>kernel will tend to run it on CPUs in its home lgroup and 
> allocate its 
> >>memory there to minimize latency and maximize performance 
> by default.
> >>There is an lgroup corresponding to each of the nodes
> >>(boards) in your NUMA machine.  The assignment of threads 
> to lgroups 
> >>is based on lgroup load averages, so other threads may cause the 
> >>lgroup load average to go up or down and thus affect how 
> threads are 
> >>placed among lgroups.
> >>
> >>You can use plgrp(1) which is available on our NUMA 
> observability web 
> >>page at 
> http://opensolaris.org/os/community/performance/numa/observabi
> >>lity to see where your application processes/threads are 
> homed.  Then 
> >>we can see whether they are distributed very well.  You can 
> also use 
> >>plgrp(1) to change the home lgroup of a thread, but should 
> be careful 
> >>because there can be side effects as explained in the 
> example referred 
> >>to below.
> >>
> >>There are man pages, source, and binaries for our tools on the web 
> >>page.  I wrote up a good example of how to use the tools to 
> >>understand, observe, and affect thread and memory placement among 
> >>lgroups on a NUMA machine and posted it on the web page in 
> >>http://opensolaris.org/os/community/performance/example.txt.
> >>
> >>You can also try using the lgrp_expand_proc_thresh tunable 
> that Eric 
> >>suggested last week.
> >>
> >>Are the migrations that you are seeing when not running a psrset 
> >>causing a performance problem for your application?
> >>
> >>
> >>
> >>Jonathan
> >>
> >>
> >>David McDaniel (damcdani) wrote:
> >>
> >>    
> >>
> >>> When using prsets, the migrations and involuntary context
> >>>      
> >>>
> >>switches go
> >>    
> >>
> >>>essentially to zero. As far as "other stuff", not quite 
> sure what you 
> >>>mean, but this application runs on a dedicated server so 
> there is no 
> >>>stuff of a casueal nature, however there is a lot of what
> >>>      
> >>>
> >>I'll glom into
> >>    
> >>
> >>>the category of "support" tasks, ie ntp daemons, nscd
> >>>      
> >>>
> >>flushing caches,
> >>    
> >>
> >>>fsflush running around backing up pages, etc. Was that what
> >>>      
> >>>
> >>you meant? 
> >>    
> >>
> >>> 
> >>>
> >>>      
> >>>
> >>>>-----Original Message-----
> >>>>From: jonathan chew [mailto:[EMAIL PROTECTED]
> >>>>Sent: Thursday, September 01, 2005 12:45 PM
> >>>>To: David McDaniel (damcdani)
> >>>>Cc: Eric C. Saxe; perf-discuss@opensolaris.org
> >>>>Subject: Re: [perf-discuss] Re: Puzzling scheduler behavior
> >>>>
> >>>>David McDaniel (damcdani) wrote:
> >>>>
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>Thanks, Jonathon for the good insights. I'll be digging into the 
> >>>>>references you mentioned. Yes, at the end of the day I'm
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>sure binding
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>to processor sets is part of the plan; having already 
> done so in a 
> >>>>>rather rote way I can demonstrate a very dramatic reduction
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>in apparent
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>cpu utilzation, on the order of 25-30%. But before I commit
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>engineers
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>to casting something in stone I want to make sure I 
> understand the 
> >>>>>defaults and the side effects of doing so since it
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>potentially results
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>in defeating other improvements that Sun has done or 
> will be doing.
> >>>>>
> >>>>>
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>Sure.  No problem.  The overview and man pages for our tools are 
> >>>>pretty short.  The tools are very easy to use and kind of fun to 
> >>>>play with.
> >>>>I'm going to try to post a good example of how to use them later 
> >>>>today.
> >>>>
> >>>>I think that using a psrset is an interesting experiment to see 
> >>>>whether interference is a big factor in all the migrations.  It 
> >>>>would be nice not to have to do that by default though.
> >>>>
> >>>>It sounds like you already tried this experiment though 
> and noticed 
> >>>>a big difference.  Did the migrations drop dramatically?  
> What else 
> >>>>is running on the system when you don't use a psrset?
> >>>>
> >>>>
> >>>>Jonathan
> >>>>
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>-----Original Message-----
> >>>>>>From: jonathan chew [mailto:[EMAIL PROTECTED]
> >>>>>>Sent: Thursday, September 01, 2005 11:50 AM
> >>>>>>To: David McDaniel (damcdani)
> >>>>>>Cc: Eric C. Saxe; perf-discuss@opensolaris.org
> >>>>>>Subject: Re: [perf-discuss] Re: Puzzling scheduler behavior
> >>>>>>
> >>>>>>Dave,
> >>>>>>
> >>>>>>It sounds like you have an interesting application.  You 
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>might want to 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>create a processor set, leave some CPUs outside the psrset 
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>for other 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>threads to run on, and run your application in a 
> processor set to 
> >>>>>>minimize interference from other threads.  As long as there 
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>are enough 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>CPUs for your application in the psrset, you should see the 
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>number of 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>migrations go down because there won't be any interference 
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>from other 
> >>>      
> >>>
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>threads.
> >>>>>>
> >>>>>>To get a better understanding of the Solaris performance 
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>optimizations 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>done for NUMA, you might want to check out the overview 
> of Memory 
> >>>>>>Placement Optimization (MPO) at:
> >>>>>>
> >>>>>>  
> >>>>>>            
> >>>>>>
> >>http://opensolaris.org/os/community/performance/mpo_overview.pdf
> >>    
> >>
> >>>>>>The stickiness that you observed is because of MPO.  
> Binding to a 
> >>>>>>processor set containing one CPU set the home lgroup of the 
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>thread to 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>the lgroup containing that CPU and destroying the psrset 
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>just left the 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>thread homed there.
> >>>>>>
> >>>>>>Your shared memory is probably spread across the system already 
> >>>>>>because the default MPO memory allocation policy for shared 
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>memory is 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>to allocate the memory from random lgroups across the system.
> >>>>>>
> >>>>>>We have some prototype observability tools which allow you 
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>to examine 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>the lgroup hierarchy and it contents and observe and/or 
> >>>>>>            
> >>>>>>
> >>control how 
> >>    
> >>
> >>>>>>the threads and memory are placed among lgroups (see 
> >>>>>>http://opensolaris.org/os/community/performance/numa/observabi
> >>>>>>lity/).  
> >>>>>>The source, binaries, and man pages are there.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>Jonathan
> >>>>>>
> >>>>>>
> >>>>>>David McDaniel (damcdani) wrote:
> >>>>>>
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>Very, very enlightening, Eric. Its really terrific to have 
> >>>>>>>    
> >>>>>>>
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>this kind 
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>of channel for dialog.
> >>>>>>>The "return to home base" behavior you describe is clearly 
> >>>>>>>    
> >>>>>>>
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>consistent 
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>with what I see and makes perfect sense.
> >>>>>>>Let me followup with a question. In this application, 
> >>>>>>>    
> >>>>>>>
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>processes have 
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>not only their "own" memory, ie heap, stack program text and 
> >>>>>>>    
> >>>>>>>
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>data, etc, 
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>but they also share a moderately large (~ 2-5GB today) 
> >>>>>>>    
> >>>>>>>
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>amount of memory 
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>in the form of mmap'd files. From Sherry Moore's previous 
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>posts, I'm 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>>assuming that at startup time that would actually be all 
> >>>>>>>    
> >>>>>>>
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>allocated in 
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>one board. Since I'm contemplating moving processes onto 
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>psrsets off 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>>that board, would it be plausible to assume that I might get 
> >>>>>>>    
> >>>>>>>
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>slightly 
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>better net throughput if I could somehow spread that 
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>across all the 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>>boards? I know its speculation of the highest order, so 
> >>>>>>>    
> >>>>>>>
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>maybe my real 
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>question is whether that's even worth testing.
> >>>>>>>In any case, I'd love to turn the knob you mention and 
> >>>>>>>    
> >>>>>>>
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>I'll look on 
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>the performance community page and see what kind of trouble 
> >>>>>>>    
> >>>>>>>
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>I can get 
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>into. If there are any particular items you think I should 
> >>>>>>>    
> >>>>>>>
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>check out, 
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>guidance is welcome.
> >>>>>>>Regards
> >>>>>>>-d
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>    
> >>>>>>>
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>>>-----Original Message-----
> >>>>>>>>From: [EMAIL PROTECTED]
> >>>>>>>>[mailto:[EMAIL PROTECTED] On Behalf 
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>Of Eric C. 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>>>Saxe
> >>>>>>>>Sent: Thursday, September 01, 2005 1:48 AM
> >>>>>>>>To: perf-discuss@opensolaris.org
> >>>>>>>>Subject: [perf-discuss] Re: Puzzling scheduler behavior
> >>>>>>>>
> >>>>>>>>Hi David,
> >>>>>>>>
> >>>>>>>>Since your v1280 systems has NUMA characteristics, the bias 
> >>>>>>>>      
> >>>>>>>>
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>that you 
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>>see for one of the boards may be a result of the kernel 
> >>>>>>>>      
> >>>>>>>>
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>trying to run 
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>>your application's threads "close" to where they have 
> >>>>>>>>      
> >>>>>>>>
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>allocated their 
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>>memory. We also generally try to keep threads in the 
> >>>>>>>>                
> >>>>>>>>
> >>same process 
> >>    
> >>
> >>>>>>>>together, since they generally tend to work on the same 
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>data. This 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>>>might explain why one of the boards is so much busier than 
> >>>>>>>>      
> >>>>>>>>
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>the others.
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>>So yes, the interesting piece of this seems to be the 
> >>>>>>>>                
> >>>>>>>>
> >>higher than 
> >>    
> >>
> >>>>>>>>expected run queue wait time (latency) as seen via prstat 
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>-Lm. Even 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>>>with the thread-to-board/memory affinity I mentioned 
> above, it 
> >>>>>>>>generally shouldn't be the case that threads are willing to 
> >>>>>>>>      
> >>>>>>>>
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>hang out 
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>>on a run queue waiting for a CPU on their "home" when 
> >>>>>>>>                
> >>>>>>>>
> >>that thread 
> >>    
> >>
> >>>>>>>>*could* actually run immediately on a "remote" 
> (off-board) CPU.
> >>>>>>>>Better to run remote, than not at all, or at least the 
> >>>>>>>>      
> >>>>>>>>
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>saying goes :)
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>>In the case where a thread is dispatched remotely because 
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>all home 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>>>CPUs are busy, the thread will try to migrate back home the 
> >>>>>>>>      
> >>>>>>>>
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>next time 
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>>it comes through the dispatcher and finds it can run 
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>immediately at 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>>>home (either because there's an idle CPU, or because 
> one of the 
> >>>>>>>>running threads is lower priority than us, and we can 
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>preempt it). 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>>>This migrating around means that the thread will tend to 
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>spend more 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>>>time waiting on run queues, since it has to either wait for 
> >>>>>>>>      
> >>>>>>>>
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>the idle() 
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>>thread to switch off, or for the lower priority thread 
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>it's able to 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>>>preempt to surrender the CPU. Either way, the thread 
> >>>>>>>>      
> >>>>>>>>
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>shouldn't have to 
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>>wait long to get the CPU, but it will have to wait a 
> >>>>>>>>      
> >>>>>>>>
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>non-zero amount 
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>>of time.
> >>>>>>>>
> >>>>>>>>What does the prstat -Lm output look like exactly? Is it a 
> >>>>>>>>      
> >>>>>>>>
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>lot of wait 
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>>time, or just more than you would expect?
> >>>>>>>>
> >>>>>>>>By the way, just to be clear, when I say "board" what I 
> >>>>>>>>                
> >>>>>>>>
> >>should be 
> >>    
> >>
> >>>>>>>>saying is lgroup (or locality group). This is the Solaris 
> >>>>>>>>      
> >>>>>>>>
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>abstraction 
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>>for a set of CPU and memory resources that are close to one 
> >>>>>>>>      
> >>>>>>>>
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>another. 
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>>On your system, it turns out that kernel creates an 
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>lgroup for each 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>>>board, and each thread is given an affinity for one of 
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>the lgroups, 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>>>such that it will try to run on the CPUs (and allocate 
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>memory from 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>>>that group of resources.
> >>>>>>>>
> >>>>>>>>One thing to look at here is whether or not the 
> kernel could be 
> >>>>>>>>"overloading" a given lgroup. This would result in threads 
> >>>>>>>>      
> >>>>>>>>
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>tending to 
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>>be less sucessful in getting CPU time (and/or
> >>>>>>>>memory) in their home. At least for CPU time, you can 
> >>>>>>>>                
> >>>>>>>>
> >>see this by 
> >>    
> >>
> >>>>>>>>looking at the number of migrations and where they are 
> >>>>>>>>      
> >>>>>>>>
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>taking place. 
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>>If the thread isn't having much luck running at home, this 
> >>>>>>>>      
> >>>>>>>>
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>means that 
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>>it (and others sharing it's
> >>>>>>>>home) will tend to "ping-pong" between CPU in and out 
> >>>>>>>>                
> >>>>>>>>
> >>of the home 
> >>    
> >>
> >>>>>>>>lgroup (we refer to this as the "king of the hill"
> >>>>>>>>pathology). In your mpstat  output, I see many migrations 
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>on one of 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>>>the boards, and a good many on the other boards as well, so 
> >>>>>>>>      
> >>>>>>>>
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>that might 
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>>well be happening here.
> >>>>>>>>
> >>>>>>>>To get some additional observability into this issue, you 
> >>>>>>>>      
> >>>>>>>>
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>might want 
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>>to take a look at some of our lgroup 
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>observability/control tools we 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>>>posted (available from the performance community page). 
> >>>>>>>>      
> >>>>>>>>
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>They allow you 
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>>to do things like query/set your application's lgroup 
> >>>>>>>>      
> >>>>>>>>
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>affinity, find 
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>>out about the lgroups in the system, and what resources 
> >>>>>>>>      
> >>>>>>>>
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>they contain, 
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>>etc. Using them you might be able to confirm some of my 
> >>>>>>>>      
> >>>>>>>>
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>theory above. 
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>>We would also *very* much like any feedback you (or anyone 
> >>>>>>>>      
> >>>>>>>>
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>else) would 
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>>be willing to provide on the tools.
> >>>>>>>>
> >>>>>>>>In the short term, there's a tunable I can suggest you take 
> >>>>>>>>      
> >>>>>>>>
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>a look at 
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>>that deals with how hard the kernel tries to keep threads 
> >>>>>>>>      
> >>>>>>>>
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>of the same 
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>>process together in the same lgroup.
> >>>>>>>>Tuning this should result in your workload being spread 
> >>>>>>>>                
> >>>>>>>>
> >>out more 
> >>    
> >>
> >>>>>>>>effectively than it currently seems to be. I'll post a 
> >>>>>>>>                
> >>>>>>>>
> >>follow up 
> >>    
> >>
> >>>>>>>>message tomorrow morning with these details, if you'd 
> >>>>>>>>                
> >>>>>>>>
> >>like to try 
> >>    
> >>
> >>>>>>>>this.
> >>>>>>>>
> >>>>>>>>In the medium-short term, we really need to implement a 
> >>>>>>>>      
> >>>>>>>>
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>mechanism to 
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>>dynamically change a thread's lgroup affinity when it's 
> >>>>>>>>      
> >>>>>>>>
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>home becomes 
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>>overloaded. We presently don't have this, as the 
> mechanism that 
> >>>>>>>>determines a thread's home lgroup (and does the lgroup load 
> >>>>>>>>      
> >>>>>>>>
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>balancing) 
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>>is static in nature (done at thread creation time). 
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>(Implemented in
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>>>usr/src/uts/common/os/lgrp.c:lgrp_choose() if you'd like 
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>to take a 
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>>>look a the source.) In terms of our NUMA/MPO projects, this 
> >>>>>>>>      
> >>>>>>>>
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>one is at 
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>>the top of the 'ol TODO list.
> >>>>>>>>This message posted from opensolaris.org 
> >>>>>>>>_______________________________________________
> >>>>>>>>perf-discuss mailing list
> >>>>>>>>perf-discuss@opensolaris.org
> >>>>>>>>
> >>>>>>>> 
> >>>>>>>>
> >>>>>>>>      
> >>>>>>>>
> >>>>>>>>           
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>>_______________________________________________
> >>>>>>>perf-discuss mailing list
> >>>>>>>perf-discuss@opensolaris.org
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>    
> >>>>>>>
> >>>>>>>         
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>> 
> >>>
> >>>      
> >>>
> >
> >  
> >
> 
_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org

RE: [perf-discuss] Re: Puzzling scheduler behavior

Reply via email to