When using prsets, the migrations and involuntary context switches go essentially to zero. As far as "other stuff", not quite sure what you mean, but this application runs on a dedicated server so there is no stuff of a casueal nature, however there is a lot of what I'll glom into the category of "support" tasks, ie ntp daemons, nscd flushing caches, fsflush running around backing up pages, etc. Was that what you meant?
> -----Original Message----- > From: jonathan chew [mailto:[EMAIL PROTECTED] > Sent: Thursday, September 01, 2005 12:45 PM > To: David McDaniel (damcdani) > Cc: Eric C. Saxe; perf-discuss@opensolaris.org > Subject: Re: [perf-discuss] Re: Puzzling scheduler behavior > > David McDaniel (damcdani) wrote: > > > Thanks, Jonathon for the good insights. I'll be digging into the > >references you mentioned. Yes, at the end of the day I'm > sure binding > >to processor sets is part of the plan; having already done so in a > >rather rote way I can demonstrate a very dramatic reduction > in apparent > >cpu utilzation, on the order of 25-30%. But before I commit > engineers > >to casting something in stone I want to make sure I understand the > >defaults and the side effects of doing so since it > potentially results > >in defeating other improvements that Sun has done or will be doing. > > > > > > Sure. No problem. The overview and man pages for our tools > are pretty short. The tools are very easy to use and kind of > fun to play with. > I'm going to try to post a good example of how to use them > later today. > > I think that using a psrset is an interesting experiment to > see whether interference is a big factor in all the > migrations. It would be nice not to have to do that by > default though. > > It sounds like you already tried this experiment though and > noticed a big difference. Did the migrations drop > dramatically? What else is running on the system when you > don't use a psrset? > > > Jonathan > > >>-----Original Message----- > >>From: jonathan chew [mailto:[EMAIL PROTECTED] > >>Sent: Thursday, September 01, 2005 11:50 AM > >>To: David McDaniel (damcdani) > >>Cc: Eric C. Saxe; perf-discuss@opensolaris.org > >>Subject: Re: [perf-discuss] Re: Puzzling scheduler behavior > >> > >>Dave, > >> > >>It sounds like you have an interesting application. You > might want to > >>create a processor set, leave some CPUs outside the psrset > for other > >>threads to run on, and run your application in a processor set to > >>minimize interference from other threads. As long as there > are enough > >>CPUs for your application in the psrset, you should see the > number of > >>migrations go down because there won't be any interference > from other > >>threads. > >> > >>To get a better understanding of the Solaris performance > optimizations > >>done for NUMA, you might want to check out the overview of Memory > >>Placement Optimization (MPO) at: > >> > >> http://opensolaris.org/os/community/performance/mpo_overview.pdf > >> > >>The stickiness that you observed is because of MPO. Binding to a > >>processor set containing one CPU set the home lgroup of the > thread to > >>the lgroup containing that CPU and destroying the psrset > just left the > >>thread homed there. > >> > >>Your shared memory is probably spread across the system already > >>because the default MPO memory allocation policy for shared > memory is > >>to allocate the memory from random lgroups across the system. > >> > >>We have some prototype observability tools which allow you > to examine > >>the lgroup hierarchy and it contents and observe and/or control how > >>the threads and memory are placed among lgroups (see > >>http://opensolaris.org/os/community/performance/numa/observabi > >>lity/). > >>The source, binaries, and man pages are there. > >> > >> > >> > >>Jonathan > >> > >> > >>David McDaniel (damcdani) wrote: > >> > >> > >> > >>> Very, very enlightening, Eric. Its really terrific to have > >>> > >>> > >>this kind > >> > >> > >>>of channel for dialog. > >>> The "return to home base" behavior you describe is clearly > >>> > >>> > >>consistent > >> > >> > >>>with what I see and makes perfect sense. > >>> Let me followup with a question. In this application, > >>> > >>> > >>processes have > >> > >> > >>>not only their "own" memory, ie heap, stack program text and > >>> > >>> > >>data, etc, > >> > >> > >>>but they also share a moderately large (~ 2-5GB today) > >>> > >>> > >>amount of memory > >> > >> > >>>in the form of mmap'd files. From Sherry Moore's previous > posts, I'm > >>>assuming that at startup time that would actually be all > >>> > >>> > >>allocated in > >> > >> > >>>one board. Since I'm contemplating moving processes onto > psrsets off > >>>that board, would it be plausible to assume that I might get > >>> > >>> > >>slightly > >> > >> > >>>better net throughput if I could somehow spread that > across all the > >>>boards? I know its speculation of the highest order, so > >>> > >>> > >>maybe my real > >> > >> > >>>question is whether that's even worth testing. > >>> In any case, I'd love to turn the knob you mention and > >>> > >>> > >>I'll look on > >> > >> > >>>the performance community page and see what kind of trouble > >>> > >>> > >>I can get > >> > >> > >>>into. If there are any particular items you think I should > >>> > >>> > >>check out, > >> > >> > >>>guidance is welcome. > >>>Regards > >>>-d > >>> > >>> > >>> > >>> > >>> > >>>>-----Original Message----- > >>>>From: [EMAIL PROTECTED] > >>>>[mailto:[EMAIL PROTECTED] On Behalf > Of Eric C. > >>>>Saxe > >>>>Sent: Thursday, September 01, 2005 1:48 AM > >>>>To: perf-discuss@opensolaris.org > >>>>Subject: [perf-discuss] Re: Puzzling scheduler behavior > >>>> > >>>>Hi David, > >>>> > >>>>Since your v1280 systems has NUMA characteristics, the bias > >>>> > >>>> > >>that you > >> > >> > >>>>see for one of the boards may be a result of the kernel > >>>> > >>>> > >>trying to run > >> > >> > >>>>your application's threads "close" to where they have > >>>> > >>>> > >>allocated their > >> > >> > >>>>memory. We also generally try to keep threads in the same process > >>>>together, since they generally tend to work on the same > data. This > >>>>might explain why one of the boards is so much busier than > >>>> > >>>> > >>the others. > >> > >> > >>>>So yes, the interesting piece of this seems to be the higher than > >>>>expected run queue wait time (latency) as seen via prstat > -Lm. Even > >>>>with the thread-to-board/memory affinity I mentioned above, it > >>>>generally shouldn't be the case that threads are willing to > >>>> > >>>> > >>hang out > >> > >> > >>>>on a run queue waiting for a CPU on their "home" when that thread > >>>>*could* actually run immediately on a "remote" (off-board) CPU. > >>>>Better to run remote, than not at all, or at least the > >>>> > >>>> > >>saying goes :) > >> > >> > >>>>In the case where a thread is dispatched remotely because > all home > >>>>CPUs are busy, the thread will try to migrate back home the > >>>> > >>>> > >>next time > >> > >> > >>>>it comes through the dispatcher and finds it can run > immediately at > >>>>home (either because there's an idle CPU, or because one of the > >>>>running threads is lower priority than us, and we can > preempt it). > >>>>This migrating around means that the thread will tend to > spend more > >>>>time waiting on run queues, since it has to either wait for > >>>> > >>>> > >>the idle() > >> > >> > >>>>thread to switch off, or for the lower priority thread > it's able to > >>>>preempt to surrender the CPU. Either way, the thread > >>>> > >>>> > >>shouldn't have to > >> > >> > >>>>wait long to get the CPU, but it will have to wait a > >>>> > >>>> > >>non-zero amount > >> > >> > >>>>of time. > >>>> > >>>>What does the prstat -Lm output look like exactly? Is it a > >>>> > >>>> > >>lot of wait > >> > >> > >>>>time, or just more than you would expect? > >>>> > >>>>By the way, just to be clear, when I say "board" what I should be > >>>>saying is lgroup (or locality group). This is the Solaris > >>>> > >>>> > >>abstraction > >> > >> > >>>>for a set of CPU and memory resources that are close to one > >>>> > >>>> > >>another. > >> > >> > >>>>On your system, it turns out that kernel creates an > lgroup for each > >>>>board, and each thread is given an affinity for one of > the lgroups, > >>>>such that it will try to run on the CPUs (and allocate > memory from > >>>>that group of resources. > >>>> > >>>>One thing to look at here is whether or not the kernel could be > >>>>"overloading" a given lgroup. This would result in threads > >>>> > >>>> > >>tending to > >> > >> > >>>>be less sucessful in getting CPU time (and/or > >>>>memory) in their home. At least for CPU time, you can see this by > >>>>looking at the number of migrations and where they are > >>>> > >>>> > >>taking place. > >> > >> > >>>>If the thread isn't having much luck running at home, this > >>>> > >>>> > >>means that > >> > >> > >>>>it (and others sharing it's > >>>>home) will tend to "ping-pong" between CPU in and out of the home > >>>>lgroup (we refer to this as the "king of the hill" > >>>>pathology). In your mpstat output, I see many migrations > on one of > >>>>the boards, and a good many on the other boards as well, so > >>>> > >>>> > >>that might > >> > >> > >>>>well be happening here. > >>>> > >>>>To get some additional observability into this issue, you > >>>> > >>>> > >>might want > >> > >> > >>>>to take a look at some of our lgroup > observability/control tools we > >>>>posted (available from the performance community page). > >>>> > >>>> > >>They allow you > >> > >> > >>>>to do things like query/set your application's lgroup > >>>> > >>>> > >>affinity, find > >> > >> > >>>>out about the lgroups in the system, and what resources > >>>> > >>>> > >>they contain, > >> > >> > >>>>etc. Using them you might be able to confirm some of my > >>>> > >>>> > >>theory above. > >> > >> > >>>>We would also *very* much like any feedback you (or anyone > >>>> > >>>> > >>else) would > >> > >> > >>>>be willing to provide on the tools. > >>>> > >>>>In the short term, there's a tunable I can suggest you take > >>>> > >>>> > >>a look at > >> > >> > >>>>that deals with how hard the kernel tries to keep threads > >>>> > >>>> > >>of the same > >> > >> > >>>>process together in the same lgroup. > >>>>Tuning this should result in your workload being spread out more > >>>>effectively than it currently seems to be. I'll post a follow up > >>>>message tomorrow morning with these details, if you'd like to try > >>>>this. > >>>> > >>>>In the medium-short term, we really need to implement a > >>>> > >>>> > >>mechanism to > >> > >> > >>>>dynamically change a thread's lgroup affinity when it's > >>>> > >>>> > >>home becomes > >> > >> > >>>>overloaded. We presently don't have this, as the mechanism that > >>>>determines a thread's home lgroup (and does the lgroup load > >>>> > >>>> > >>balancing) > >> > >> > >>>>is static in nature (done at thread creation time). > (Implemented in > >>>>usr/src/uts/common/os/lgrp.c:lgrp_choose() if you'd like > to take a > >>>>look a the source.) In terms of our NUMA/MPO projects, this > >>>> > >>>> > >>one is at > >> > >> > >>>>the top of the 'ol TODO list. > >>>>This message posted from opensolaris.org > >>>>_______________________________________________ > >>>>perf-discuss mailing list > >>>>perf-discuss@opensolaris.org > >>>> > >>>> > >>>> > >>>> > >>>> > >>>_______________________________________________ > >>>perf-discuss mailing list > >>>perf-discuss@opensolaris.org > >>> > >>> > >>> > >>> > >>> > _______________________________________________ perf-discuss mailing list perf-discuss@opensolaris.org