Re: [perf-discuss] Re: Puzzling scheduler behavior

jonathan chew Thu, 01 Sep 2005 10:47:58 -0700

David McDaniel (damcdani) wrote:

 Thanks, Jonathon for the good insights. I'll be digging into the
references you mentioned. Yes, at the end of the day I'm sure binding to
processor sets is part of the plan; having already done so in a rather
rote way I can demonstrate a very dramatic reduction in apparent cpu
utilzation, on the order of 25-30%. But before I commit engineers to
casting something in stone I want to make sure I understand the defaults
and the side effects of doing so since it potentially results in

defeating other improvements that Sun has done or will be doing.

Sure. No problem. The overview and man pages for our tools are prettyshort. The tools are very easy to use and kind of fun to play with.I'm going to try to post a good example of how to use them later today.

I think that using a psrset is an interesting experiment to see whetherinterference is a big factor in all the migrations. It would be nicenot to have to do that by default though.

It sounds like you already tried this experiment though and noticed abig difference. Did the migrations drop dramatically? What else isrunning on the system when you don't use a psrset?



Jonathan

-----Original Message-----
From: jonathan chew [mailto:[EMAIL PROTECTED]Sent: Thursday, September 01, 2005 11:50 AM
To: David McDaniel (damcdani)
Cc: Eric C. Saxe; perf-discuss@opensolaris.org
Subject: Re: [perf-discuss] Re: Puzzling scheduler behavior

Dave,
It sounds like you have an interesting application. Youmight want to create a processor set, leave some CPUs outsidethe psrset for other threads to run on, and run yourapplication in a processor set to minimize interference fromother threads. As long as there are enough CPUs for yourapplication in the psrset, you should see the number ofmigrations go down because there won't be any interferencefrom other threads.
To get a better understanding of the Solaris performanceoptimizations done for NUMA, you might want to check out theoverview of Memory Placement Optimization (MPO) at:
   http://opensolaris.org/os/community/performance/mpo_overview.pdf
The stickiness that you observed is because of MPO. Bindingto a processor set containing one CPU set the home lgroup ofthe thread to the lgroup containing that CPU and destroyingthe psrset just left the thread homed there.
Your shared memory is probably spread across the systemalready because the default MPO memory allocation policy forshared memory is to allocate the memory from random lgroupsacross the system.
We have some prototype observability tools which allow you toexamine the lgroup hierarchy and it contents and observeand/or control how the threads and memory are placed amonglgroups (seehttp://opensolaris.org/os/community/performance/numa/observability/).The source, binaries, and man pages are there.
Jonathan


David McDaniel (damcdani) wrote:
Very, very enlightening, Eric. Its really terrific to have
this kind
of channel for dialog.
The "return to home base" behavior you describe is clearly
consistent
with what I see and makes perfect sense.
Let me followup with a question. In this application,
processes have
not only their "own" memory, ie heap, stack program text and
data, etc,
but they also share a moderately large (~ 2-5GB today)
amount of memory
in the form of mmap'd files. From Sherry Moore's previous posts, I'massuming that at startup time that would actually be all
allocated in
one board. Since I'm contemplating moving processes onto psrsets offthat board, would it be plausible to assume that I might get
slightly
better net throughput if I could somehow spread that across all theboards? I know its speculation of the highest order, so
maybe my real
question is whether that's even worth testing.
In any case, I'd love to turn the knob you mention and
I'll look on
the performance community page and see what kind of trouble
I can get
into. If there are any particular items you think I should
check out,
guidance is welcome.
Regards
-d
-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Eric C.Saxe
Sent: Thursday, September 01, 2005 1:48 AM
To: perf-discuss@opensolaris.org
Subject: [perf-discuss] Re: Puzzling scheduler behavior

Hi David,
Since your v1280 systems has NUMA characteristics, the bias
that you
see for one of the boards may be a result of the kernel
trying to run
your application's threads "close" to where they have
allocated their
memory. We also generally try to keep threads in the same processtogether, since they generally tend to work on the same data. Thismight explain why one of the boards is so much busier than
the others.
So yes, the interesting piece of this seems to be the higher thanexpected run queue wait time (latency) as seen via prstat -Lm. Evenwith the thread-to-board/memory affinity I mentioned above, itgenerally shouldn't be the case that threads are willing to
hang out
on a run queue waiting for a CPU on their "home" when that thread*could* actually run immediately on a "remote" (off-board) CPU.Better to run remote, than not at all, or at least the
saying goes :)
In the case where a thread is dispatched remotely because all homeCPUs are busy, the thread will try to migrate back home the
next time
it comes through the dispatcher and finds it can run immediately athome (either because there's an idle CPU, or because one of therunning threads is lower priority than us, and we can preempt it).This migrating around means that the thread will tend to spend moretime waiting on run queues, since it has to either wait for
the idle()
thread to switch off, or for the lower priority thread it's able topreempt to surrender the CPU. Either way, the thread
shouldn't have to
wait long to get the CPU, but it will have to wait a
non-zero amount
of time.
What does the prstat -Lm output look like exactly? Is it a
lot of wait
time, or just more than you would expect?
By the way, just to be clear, when I say "board" what I should besaying is lgroup (or locality group). This is the Solaris
abstraction
for a set of CPU and memory resources that are close to one
another.
On your system, it turns out that kernel creates an lgroup for eachboard, and each thread is given an affinity for one of the lgroups,such that it will try to run on the CPUs (and allocate memory fromthat group of resources.
One thing to look at here is whether or not the kernel could be"overloading" a given lgroup. This would result in threads
tending to
be less sucessful in getting CPU time (and/or
memory) in their home. At least for CPU time, you can see this bylooking at the number of migrations and where they are
taking place.
If the thread isn't having much luck running at home, this
means that
it (and others sharing it's
home) will tend to "ping-pong" between CPU in and out of the homelgroup (we refer to this as the "king of the hill"pathology). In your mpstat output, I see many migrations on one ofthe boards, and a good many on the other boards as well, so
that might
well be happening here.
To get some additional observability into this issue, you
might want
to take a look at some of our lgroup observability/control tools weposted (available from the performance community page).
They allow you
to do things like query/set your application's lgroup
affinity, find
out about the lgroups in the system, and what resources
they contain,
etc. Using them you might be able to confirm some of my
theory above.
We would also *very* much like any feedback you (or anyone
else) would
be willing to provide on the tools.
In the short term, there's a tunable I can suggest you take
a look at
that deals with how hard the kernel tries to keep threads
of the same
process together in the same lgroup.
Tuning this should result in your workload being spread out moreeffectively than it currently seems to be. I'll post a follow upmessage tomorrow morning with these details, if you'd like to trythis.
In the medium-short term, we really need to implement a
mechanism to
dynamically change a thread's lgroup affinity when it's
home becomes
overloaded. We presently don't have this, as the mechanism thatdetermines a thread's home lgroup (and does the lgroup load
balancing)
is static in nature (done at thread creation time). (Implemented in
usr/src/uts/common/os/lgrp.c:lgrp_choose() if you'd like to take alook a the source.) In terms of our NUMA/MPO projects, this
one is at
the top of the 'ol TODO list.
This message posted from opensolaris.org_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org
_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org


_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org

Re: [perf-discuss] Re: Puzzling scheduler behavior

Reply via email to