Thanks for the feedback, Jonathan. I've got it on my todo list to get those tools and go spelunking a bit. I cant really say that we have a performance problem, its more along the lines of me trying to use the greatly improved observability tools in Solaris to get a better understanding of things. In any case, its pretty much relegated to a science project right now because we cant ship anything that's not part of some "official" distribution?
> -----Original Message----- > From: jonathan chew [mailto:[EMAIL PROTECTED] > Sent: Friday, September 09, 2005 6:08 PM > To: David McDaniel (damcdani) > Cc: Eric C. Saxe; perf-discuss@opensolaris.org > Subject: Re: [perf-discuss] Re: Puzzling scheduler behavior > > Dave, > > Sorry, I forgot to reply to this sooner. Yes, I was just > curious what else was running to see whether we would expect > your application to be perturbed much. > > There could be a load imbalance due to the daemons throwing > everything off once in awhile. This could be affecting how > the threads in your application are distributed across the > nodes in your NUMA machine. > > Each thread is assigned a home locality group upon creation > and the kernel will tend to run it on CPUs in its home lgroup > and allocate its memory there to minimize latency and > maximize performance by default. > There is an lgroup corresponding to each of the nodes > (boards) in your NUMA machine. The assignment of threads to > lgroups is based on lgroup load averages, so other threads > may cause the lgroup load average to go up or down and thus > affect how threads are placed among lgroups. > > You can use plgrp(1) which is available on our NUMA > observability web page at > http://opensolaris.org/os/community/performance/numa/observabi > lity to see where your application processes/threads are > homed. Then we can see whether they are distributed very > well. You can also use plgrp(1) to change the home lgroup of > a thread, but should be careful because there can be side > effects as explained in the example referred to below. > > There are man pages, source, and binaries for our tools on > the web page. I wrote up a good example of how to use the > tools to understand, observe, and affect thread and memory > placement among lgroups on a NUMA machine and posted it on > the web page in > http://opensolaris.org/os/community/performance/example.txt. > > You can also try using the lgrp_expand_proc_thresh tunable > that Eric suggested last week. > > Are the migrations that you are seeing when not running a > psrset causing a performance problem for your application? > > > > Jonathan > > > David McDaniel (damcdani) wrote: > > > When using prsets, the migrations and involuntary context > switches go > >essentially to zero. As far as "other stuff", not quite sure what you > >mean, but this application runs on a dedicated server so there is no > >stuff of a casueal nature, however there is a lot of what > I'll glom into > >the category of "support" tasks, ie ntp daemons, nscd > flushing caches, > >fsflush running around backing up pages, etc. Was that what > you meant? > > > > > > > >>-----Original Message----- > >>From: jonathan chew [mailto:[EMAIL PROTECTED] > >>Sent: Thursday, September 01, 2005 12:45 PM > >>To: David McDaniel (damcdani) > >>Cc: Eric C. Saxe; perf-discuss@opensolaris.org > >>Subject: Re: [perf-discuss] Re: Puzzling scheduler behavior > >> > >>David McDaniel (damcdani) wrote: > >> > >> > >> > >>> Thanks, Jonathon for the good insights. I'll be digging into the > >>>references you mentioned. Yes, at the end of the day I'm > >>> > >>> > >>sure binding > >> > >> > >>>to processor sets is part of the plan; having already done so in a > >>>rather rote way I can demonstrate a very dramatic reduction > >>> > >>> > >>in apparent > >> > >> > >>>cpu utilzation, on the order of 25-30%. But before I commit > >>> > >>> > >>engineers > >> > >> > >>>to casting something in stone I want to make sure I understand the > >>>defaults and the side effects of doing so since it > >>> > >>> > >>potentially results > >> > >> > >>>in defeating other improvements that Sun has done or will be doing. > >>> > >>> > >>> > >>> > >>Sure. No problem. The overview and man pages for our tools > >>are pretty short. The tools are very easy to use and kind of > >>fun to play with. > >>I'm going to try to post a good example of how to use them > >>later today. > >> > >>I think that using a psrset is an interesting experiment to > >>see whether interference is a big factor in all the > >>migrations. It would be nice not to have to do that by > >>default though. > >> > >>It sounds like you already tried this experiment though and > >>noticed a big difference. Did the migrations drop > >>dramatically? What else is running on the system when you > >>don't use a psrset? > >> > >> > >>Jonathan > >> > >> > >> > >>>>-----Original Message----- > >>>>From: jonathan chew [mailto:[EMAIL PROTECTED] > >>>>Sent: Thursday, September 01, 2005 11:50 AM > >>>>To: David McDaniel (damcdani) > >>>>Cc: Eric C. Saxe; perf-discuss@opensolaris.org > >>>>Subject: Re: [perf-discuss] Re: Puzzling scheduler behavior > >>>> > >>>>Dave, > >>>> > >>>>It sounds like you have an interesting application. You > >>>> > >>>> > >>might want to > >> > >> > >>>>create a processor set, leave some CPUs outside the psrset > >>>> > >>>> > >>for other > >> > >> > >>>>threads to run on, and run your application in a processor set to > >>>>minimize interference from other threads. As long as there > >>>> > >>>> > >>are enough > >> > >> > >>>>CPUs for your application in the psrset, you should see the > >>>> > >>>> > >>number of > >> > >> > >>>>migrations go down because there won't be any interference > >>>> > >>>> > >>from other > >> > >> > >>>>threads. > >>>> > >>>>To get a better understanding of the Solaris performance > >>>> > >>>> > >>optimizations > >> > >> > >>>>done for NUMA, you might want to check out the overview of Memory > >>>>Placement Optimization (MPO) at: > >>>> > >>>> > http://opensolaris.org/os/community/performance/mpo_overview.pdf > >>>> > >>>>The stickiness that you observed is because of MPO. Binding to a > >>>>processor set containing one CPU set the home lgroup of the > >>>> > >>>> > >>thread to > >> > >> > >>>>the lgroup containing that CPU and destroying the psrset > >>>> > >>>> > >>just left the > >> > >> > >>>>thread homed there. > >>>> > >>>>Your shared memory is probably spread across the system already > >>>>because the default MPO memory allocation policy for shared > >>>> > >>>> > >>memory is > >> > >> > >>>>to allocate the memory from random lgroups across the system. > >>>> > >>>>We have some prototype observability tools which allow you > >>>> > >>>> > >>to examine > >> > >> > >>>>the lgroup hierarchy and it contents and observe and/or > control how > >>>>the threads and memory are placed among lgroups (see > >>>>http://opensolaris.org/os/community/performance/numa/observabi > >>>>lity/). > >>>>The source, binaries, and man pages are there. > >>>> > >>>> > >>>> > >>>>Jonathan > >>>> > >>>> > >>>>David McDaniel (damcdani) wrote: > >>>> > >>>> > >>>> > >>>> > >>>> > >>>>>Very, very enlightening, Eric. Its really terrific to have > >>>>> > >>>>> > >>>>> > >>>>> > >>>>this kind > >>>> > >>>> > >>>> > >>>> > >>>>>of channel for dialog. > >>>>>The "return to home base" behavior you describe is clearly > >>>>> > >>>>> > >>>>> > >>>>> > >>>>consistent > >>>> > >>>> > >>>> > >>>> > >>>>>with what I see and makes perfect sense. > >>>>>Let me followup with a question. In this application, > >>>>> > >>>>> > >>>>> > >>>>> > >>>>processes have > >>>> > >>>> > >>>> > >>>> > >>>>>not only their "own" memory, ie heap, stack program text and > >>>>> > >>>>> > >>>>> > >>>>> > >>>>data, etc, > >>>> > >>>> > >>>> > >>>> > >>>>>but they also share a moderately large (~ 2-5GB today) > >>>>> > >>>>> > >>>>> > >>>>> > >>>>amount of memory > >>>> > >>>> > >>>> > >>>> > >>>>>in the form of mmap'd files. From Sherry Moore's previous > >>>>> > >>>>> > >>posts, I'm > >> > >> > >>>>>assuming that at startup time that would actually be all > >>>>> > >>>>> > >>>>> > >>>>> > >>>>allocated in > >>>> > >>>> > >>>> > >>>> > >>>>>one board. Since I'm contemplating moving processes onto > >>>>> > >>>>> > >>psrsets off > >> > >> > >>>>>that board, would it be plausible to assume that I might get > >>>>> > >>>>> > >>>>> > >>>>> > >>>>slightly > >>>> > >>>> > >>>> > >>>> > >>>>>better net throughput if I could somehow spread that > >>>>> > >>>>> > >>across all the > >> > >> > >>>>>boards? I know its speculation of the highest order, so > >>>>> > >>>>> > >>>>> > >>>>> > >>>>maybe my real > >>>> > >>>> > >>>> > >>>> > >>>>>question is whether that's even worth testing. > >>>>>In any case, I'd love to turn the knob you mention and > >>>>> > >>>>> > >>>>> > >>>>> > >>>>I'll look on > >>>> > >>>> > >>>> > >>>> > >>>>>the performance community page and see what kind of trouble > >>>>> > >>>>> > >>>>> > >>>>> > >>>>I can get > >>>> > >>>> > >>>> > >>>> > >>>>>into. If there are any particular items you think I should > >>>>> > >>>>> > >>>>> > >>>>> > >>>>check out, > >>>> > >>>> > >>>> > >>>> > >>>>>guidance is welcome. > >>>>>Regards > >>>>>-d > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>>>-----Original Message----- > >>>>>>From: [EMAIL PROTECTED] > >>>>>>[mailto:[EMAIL PROTECTED] On Behalf > >>>>>> > >>>>>> > >>Of Eric C. > >> > >> > >>>>>>Saxe > >>>>>>Sent: Thursday, September 01, 2005 1:48 AM > >>>>>>To: perf-discuss@opensolaris.org > >>>>>>Subject: [perf-discuss] Re: Puzzling scheduler behavior > >>>>>> > >>>>>>Hi David, > >>>>>> > >>>>>>Since your v1280 systems has NUMA characteristics, the bias > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>that you > >>>> > >>>> > >>>> > >>>> > >>>>>>see for one of the boards may be a result of the kernel > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>trying to run > >>>> > >>>> > >>>> > >>>> > >>>>>>your application's threads "close" to where they have > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>allocated their > >>>> > >>>> > >>>> > >>>> > >>>>>>memory. We also generally try to keep threads in the > same process > >>>>>>together, since they generally tend to work on the same > >>>>>> > >>>>>> > >>data. This > >> > >> > >>>>>>might explain why one of the boards is so much busier than > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>the others. > >>>> > >>>> > >>>> > >>>> > >>>>>>So yes, the interesting piece of this seems to be the > higher than > >>>>>>expected run queue wait time (latency) as seen via prstat > >>>>>> > >>>>>> > >>-Lm. Even > >> > >> > >>>>>>with the thread-to-board/memory affinity I mentioned above, it > >>>>>>generally shouldn't be the case that threads are willing to > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>hang out > >>>> > >>>> > >>>> > >>>> > >>>>>>on a run queue waiting for a CPU on their "home" when > that thread > >>>>>>*could* actually run immediately on a "remote" (off-board) CPU. > >>>>>>Better to run remote, than not at all, or at least the > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>saying goes :) > >>>> > >>>> > >>>> > >>>> > >>>>>>In the case where a thread is dispatched remotely because > >>>>>> > >>>>>> > >>all home > >> > >> > >>>>>>CPUs are busy, the thread will try to migrate back home the > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>next time > >>>> > >>>> > >>>> > >>>> > >>>>>>it comes through the dispatcher and finds it can run > >>>>>> > >>>>>> > >>immediately at > >> > >> > >>>>>>home (either because there's an idle CPU, or because one of the > >>>>>>running threads is lower priority than us, and we can > >>>>>> > >>>>>> > >>preempt it). > >> > >> > >>>>>>This migrating around means that the thread will tend to > >>>>>> > >>>>>> > >>spend more > >> > >> > >>>>>>time waiting on run queues, since it has to either wait for > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>the idle() > >>>> > >>>> > >>>> > >>>> > >>>>>>thread to switch off, or for the lower priority thread > >>>>>> > >>>>>> > >>it's able to > >> > >> > >>>>>>preempt to surrender the CPU. Either way, the thread > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>shouldn't have to > >>>> > >>>> > >>>> > >>>> > >>>>>>wait long to get the CPU, but it will have to wait a > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>non-zero amount > >>>> > >>>> > >>>> > >>>> > >>>>>>of time. > >>>>>> > >>>>>>What does the prstat -Lm output look like exactly? Is it a > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>lot of wait > >>>> > >>>> > >>>> > >>>> > >>>>>>time, or just more than you would expect? > >>>>>> > >>>>>>By the way, just to be clear, when I say "board" what I > should be > >>>>>>saying is lgroup (or locality group). This is the Solaris > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>abstraction > >>>> > >>>> > >>>> > >>>> > >>>>>>for a set of CPU and memory resources that are close to one > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>another. > >>>> > >>>> > >>>> > >>>> > >>>>>>On your system, it turns out that kernel creates an > >>>>>> > >>>>>> > >>lgroup for each > >> > >> > >>>>>>board, and each thread is given an affinity for one of > >>>>>> > >>>>>> > >>the lgroups, > >> > >> > >>>>>>such that it will try to run on the CPUs (and allocate > >>>>>> > >>>>>> > >>memory from > >> > >> > >>>>>>that group of resources. > >>>>>> > >>>>>>One thing to look at here is whether or not the kernel could be > >>>>>>"overloading" a given lgroup. This would result in threads > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>tending to > >>>> > >>>> > >>>> > >>>> > >>>>>>be less sucessful in getting CPU time (and/or > >>>>>>memory) in their home. At least for CPU time, you can > see this by > >>>>>>looking at the number of migrations and where they are > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>taking place. > >>>> > >>>> > >>>> > >>>> > >>>>>>If the thread isn't having much luck running at home, this > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>means that > >>>> > >>>> > >>>> > >>>> > >>>>>>it (and others sharing it's > >>>>>>home) will tend to "ping-pong" between CPU in and out > of the home > >>>>>>lgroup (we refer to this as the "king of the hill" > >>>>>>pathology). In your mpstat output, I see many migrations > >>>>>> > >>>>>> > >>on one of > >> > >> > >>>>>>the boards, and a good many on the other boards as well, so > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>that might > >>>> > >>>> > >>>> > >>>> > >>>>>>well be happening here. > >>>>>> > >>>>>>To get some additional observability into this issue, you > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>might want > >>>> > >>>> > >>>> > >>>> > >>>>>>to take a look at some of our lgroup > >>>>>> > >>>>>> > >>observability/control tools we > >> > >> > >>>>>>posted (available from the performance community page). > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>They allow you > >>>> > >>>> > >>>> > >>>> > >>>>>>to do things like query/set your application's lgroup > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>affinity, find > >>>> > >>>> > >>>> > >>>> > >>>>>>out about the lgroups in the system, and what resources > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>they contain, > >>>> > >>>> > >>>> > >>>> > >>>>>>etc. Using them you might be able to confirm some of my > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>theory above. > >>>> > >>>> > >>>> > >>>> > >>>>>>We would also *very* much like any feedback you (or anyone > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>else) would > >>>> > >>>> > >>>> > >>>> > >>>>>>be willing to provide on the tools. > >>>>>> > >>>>>>In the short term, there's a tunable I can suggest you take > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>a look at > >>>> > >>>> > >>>> > >>>> > >>>>>>that deals with how hard the kernel tries to keep threads > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>of the same > >>>> > >>>> > >>>> > >>>> > >>>>>>process together in the same lgroup. > >>>>>>Tuning this should result in your workload being spread > out more > >>>>>>effectively than it currently seems to be. I'll post a > follow up > >>>>>>message tomorrow morning with these details, if you'd > like to try > >>>>>>this. > >>>>>> > >>>>>>In the medium-short term, we really need to implement a > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>mechanism to > >>>> > >>>> > >>>> > >>>> > >>>>>>dynamically change a thread's lgroup affinity when it's > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>home becomes > >>>> > >>>> > >>>> > >>>> > >>>>>>overloaded. We presently don't have this, as the mechanism that > >>>>>>determines a thread's home lgroup (and does the lgroup load > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>balancing) > >>>> > >>>> > >>>> > >>>> > >>>>>>is static in nature (done at thread creation time). > >>>>>> > >>>>>> > >>(Implemented in > >> > >> > >>>>>>usr/src/uts/common/os/lgrp.c:lgrp_choose() if you'd like > >>>>>> > >>>>>> > >>to take a > >> > >> > >>>>>>look a the source.) In terms of our NUMA/MPO projects, this > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>one is at > >>>> > >>>> > >>>> > >>>> > >>>>>>the top of the 'ol TODO list. > >>>>>>This message posted from opensolaris.org > >>>>>>_______________________________________________ > >>>>>>perf-discuss mailing list > >>>>>>perf-discuss@opensolaris.org > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>_______________________________________________ > >>>>>perf-discuss mailing list > >>>>>perf-discuss@opensolaris.org > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > > > > > > > _______________________________________________ perf-discuss mailing list perf-discuss@opensolaris.org