Re: [perf-discuss] Re: Puzzling scheduler behavior

jonathan chew Thu, 01 Sep 2005 09:53:09 -0700

Dave,

It sounds like you have an interesting application. You might want tocreate a processor set, leave some CPUs outside the psrset for otherthreads to run on, and run your application in a processor set tominimize interference from other threads. As long as there are enoughCPUs for your application in the psrset, you should see the number ofmigrations go down because there won't be any interference from otherthreads.

To get a better understanding of the Solaris performance optimizationsdone for NUMA, you might want to check out the overview of MemoryPlacement Optimization (MPO) at:


   http://opensolaris.org/os/community/performance/mpo_overview.pdf

The stickiness that you observed is because of MPO. Binding to aprocessor set containing one CPU set the home lgroup of the thread tothe lgroup containing that CPU and destroying the psrset just left thethread homed there.

Your shared memory is probably spread across the system already becausethe default MPO memory allocation policy for shared memory is toallocate the memory from random lgroups across the system.

We have some prototype observability tools which allow you to examinethe lgroup hierarchy and it contents and observe and/or control how thethreads and memory are placed among lgroups (seehttp://opensolaris.org/os/community/performance/numa/observability/).The source, binaries, and man pages are there.




Jonathan


David McDaniel (damcdani) wrote:

 Very, very enlightening, Eric. Its really terrific to have this kind
of channel for dialog.
 The "return to home base" behavior you describe is clearly consistent
with what I see and makes perfect sense.
 Let me followup with a question. In this application, processes have
not only their "own" memory, ie heap, stack program text and data, etc,
but they also share a moderately large (~ 2-5GB today) amount of memory
in the form of mmap'd files. From Sherry Moore's previous posts, I'm
assuming that at startup time that would actually be all allocated in
one board. Since I'm contemplating moving processes onto psrsets off
that board, would it be plausible to assume that I might get slightly
better net throughput if I could somehow spread that across all the
boards? I know its speculation of the highest order, so maybe my real
question is whether that's even worth testing.
 In any case, I'd love to turn the knob you mention and I'll look on
the performance community page and see what kind of trouble I can get
into. If there are any particular items you think I should check out,
guidance is welcome.
Regards
-d
-----Original Message-----
From: [EMAIL PROTECTED][mailto:[EMAIL PROTECTED] On Behalf OfEric C. Saxe
Sent: Thursday, September 01, 2005 1:48 AM
To: perf-discuss@opensolaris.org
Subject: [perf-discuss] Re: Puzzling scheduler behavior

Hi David,
Since your v1280 systems has NUMA characteristics, the biasthat you see for one of the boards may be a result of thekernel trying to run your application's threads "close" towhere they have allocated their memory. We also generally tryto keep threads in the same process together, since theygenerally tend to work on the same data. This might explainwhy one of the boards is so much busier than the others.So yes, the interesting piece of this seems to be the higherthan expected run queue wait time (latency) as seen viaprstat -Lm. Even with the thread-to-board/memory affinity Imentioned above, it generally shouldn't be the case thatthreads are willing to hang out on a run queue waiting for aCPU on their "home" when that thread *could* actually runimmediately on a "remote" (off-board) CPU.
Better to run remote, than not at all, or at least the saying goes :)
In the case where a thread is dispatched remotely because allhome CPUs are busy, the thread will try to migrate back homethe next time it comes through the dispatcher and finds itcan run immediately at home (either because there's an idleCPU, or because one of the running threads is lower prioritythan us, and we can preempt it). This migrating around meansthat the thread will tend to spend more time waiting on runqueues, since it has to either wait for the idle() thread toswitch off, or for the lower priority thread it's able topreempt to surrender the CPU. Either way, the threadshouldn't have to wait long to get the CPU, but it will haveto wait a non-zero amount of time.
What does the prstat -Lm output look like exactly? Is it alot of wait time, or just more than you would expect?
By the way, just to be clear, when I say "board" what Ishould be saying is lgroup (or locality group). This is theSolaris abstraction for a set of CPU and memory resourcesthat are close to one another. On your system, it turns outthat kernel creates an lgroup for each board, and each threadis given an affinity for one of the lgroups, such that itwill try to run on the CPUs (and allocate memory from thatgroup of resources.
One thing to look at here is whether or not the kernel couldbe "overloading" a given lgroup. This would result in threadstending to be less sucessful in getting CPU time (and/ormemory) in their home. At least for CPU time, you can seethis by looking at the number of migrations and where theyare taking place. If the thread isn't having much luckrunning at home, this means that it (and others sharing it'shome) will tend to "ping-pong" between CPU in and out of thehome lgroup (we refer to this as the "king of the hill"pathology). In your mpstat output, I see many migrations onone of the boards, and a good many on the other boards aswell, so that might well be happening here.
To get some additional observability into this issue, youmight want to take a look at some of our lgroupobservability/control tools we posted (available from theperformance community page). They allow you to do things likequery/set your application's lgroup affinity, find out aboutthe lgroups in the system, and what resources they contain,etc. Using them you might be able to confirm some of mytheory above. We would also *very* much like any feedback you(or anyone else) would be willing to provide on the tools.
In the short term, there's a tunable I can suggest you take alook at that deals with how hard the kernel tries to keepthreads of the same process together in the same lgroup.Tuning this should result in your workload being spread outmore effectively than it currently seems to be. I'll post afollow up message tomorrow morning with these details, ifyou'd like to try this.
In the medium-short term, we really need to implement amechanism to dynamically change a thread's lgroup affinitywhen it's home becomes overloaded. We presently don't havethis, as the mechanism that determines a thread's home lgroup(and does the lgroup load balancing) is static in nature(done at thread creation time). (Implemented inusr/src/uts/common/os/lgrp.c:lgrp_choose() if you'd like totake a look a the source.) In terms of our NUMA/MPO projects,this one is at the top of the 'ol TODO list.This message posted from opensolaris.org_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org
_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org


_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org

Re: [perf-discuss] Re: Puzzling scheduler behavior

Reply via email to