David McDaniel (damcdani) wrote:
Thanks, Jonathon for the good insights. I'll be digging into the
references you mentioned. Yes, at the end of the day I'm sure binding to
processor sets is part of the plan; having already done so in a rather
rote way I can demonstrate a very dramatic reduction in apparent cpu
utilzation, on the order of 25-30%. But before I commit engineers to
casting something in stone I want to make sure I understand the defaults
and the side effects of doing so since it potentially results in
defeating other improvements that Sun has done or will be doing.
Sure. No problem. The overview and man pages for our tools are pretty
short. The tools are very easy to use and kind of fun to play with.
I'm going to try to post a good example of how to use them later today.
I think that using a psrset is an interesting experiment to see whether
interference is a big factor in all the migrations. It would be nice
not to have to do that by default though.
It sounds like you already tried this experiment though and noticed a
big difference. Did the migrations drop dramatically? What else is
running on the system when you don't use a psrset?
Jonathan
-----Original Message-----
From: jonathan chew [mailto:[EMAIL PROTECTED]
Sent: Thursday, September 01, 2005 11:50 AM
To: David McDaniel (damcdani)
Cc: Eric C. Saxe; perf-discuss@opensolaris.org
Subject: Re: [perf-discuss] Re: Puzzling scheduler behavior
Dave,
It sounds like you have an interesting application. You
might want to create a processor set, leave some CPUs outside
the psrset for other threads to run on, and run your
application in a processor set to minimize interference from
other threads. As long as there are enough CPUs for your
application in the psrset, you should see the number of
migrations go down because there won't be any interference
from other threads.
To get a better understanding of the Solaris performance
optimizations done for NUMA, you might want to check out the
overview of Memory Placement Optimization (MPO) at:
http://opensolaris.org/os/community/performance/mpo_overview.pdf
The stickiness that you observed is because of MPO. Binding
to a processor set containing one CPU set the home lgroup of
the thread to the lgroup containing that CPU and destroying
the psrset just left the thread homed there.
Your shared memory is probably spread across the system
already because the default MPO memory allocation policy for
shared memory is to allocate the memory from random lgroups
across the system.
We have some prototype observability tools which allow you to
examine the lgroup hierarchy and it contents and observe
and/or control how the threads and memory are placed among
lgroups (see
http://opensolaris.org/os/community/performance/numa/observabi
lity/).
The source, binaries, and man pages are there.
Jonathan
David McDaniel (damcdani) wrote:
Very, very enlightening, Eric. Its really terrific to have
this kind
of channel for dialog.
The "return to home base" behavior you describe is clearly
consistent
with what I see and makes perfect sense.
Let me followup with a question. In this application,
processes have
not only their "own" memory, ie heap, stack program text and
data, etc,
but they also share a moderately large (~ 2-5GB today)
amount of memory
in the form of mmap'd files. From Sherry Moore's previous posts, I'm
assuming that at startup time that would actually be all
allocated in
one board. Since I'm contemplating moving processes onto psrsets off
that board, would it be plausible to assume that I might get
slightly
better net throughput if I could somehow spread that across all the
boards? I know its speculation of the highest order, so
maybe my real
question is whether that's even worth testing.
In any case, I'd love to turn the knob you mention and
I'll look on
the performance community page and see what kind of trouble
I can get
into. If there are any particular items you think I should
check out,
guidance is welcome.
Regards
-d
-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Eric C.
Saxe
Sent: Thursday, September 01, 2005 1:48 AM
To: perf-discuss@opensolaris.org
Subject: [perf-discuss] Re: Puzzling scheduler behavior
Hi David,
Since your v1280 systems has NUMA characteristics, the bias
that you
see for one of the boards may be a result of the kernel
trying to run
your application's threads "close" to where they have
allocated their
memory. We also generally try to keep threads in the same process
together, since they generally tend to work on the same data. This
might explain why one of the boards is so much busier than
the others.
So yes, the interesting piece of this seems to be the higher than
expected run queue wait time (latency) as seen via prstat -Lm. Even
with the thread-to-board/memory affinity I mentioned above, it
generally shouldn't be the case that threads are willing to
hang out
on a run queue waiting for a CPU on their "home" when that thread
*could* actually run immediately on a "remote" (off-board) CPU.
Better to run remote, than not at all, or at least the
saying goes :)
In the case where a thread is dispatched remotely because all home
CPUs are busy, the thread will try to migrate back home the
next time
it comes through the dispatcher and finds it can run immediately at
home (either because there's an idle CPU, or because one of the
running threads is lower priority than us, and we can preempt it).
This migrating around means that the thread will tend to spend more
time waiting on run queues, since it has to either wait for
the idle()
thread to switch off, or for the lower priority thread it's able to
preempt to surrender the CPU. Either way, the thread
shouldn't have to
wait long to get the CPU, but it will have to wait a
non-zero amount
of time.
What does the prstat -Lm output look like exactly? Is it a
lot of wait
time, or just more than you would expect?
By the way, just to be clear, when I say "board" what I should be
saying is lgroup (or locality group). This is the Solaris
abstraction
for a set of CPU and memory resources that are close to one
another.
On your system, it turns out that kernel creates an lgroup for each
board, and each thread is given an affinity for one of the lgroups,
such that it will try to run on the CPUs (and allocate memory from
that group of resources.
One thing to look at here is whether or not the kernel could be
"overloading" a given lgroup. This would result in threads
tending to
be less sucessful in getting CPU time (and/or
memory) in their home. At least for CPU time, you can see this by
looking at the number of migrations and where they are
taking place.
If the thread isn't having much luck running at home, this
means that
it (and others sharing it's
home) will tend to "ping-pong" between CPU in and out of the home
lgroup (we refer to this as the "king of the hill"
pathology). In your mpstat output, I see many migrations on one of
the boards, and a good many on the other boards as well, so
that might
well be happening here.
To get some additional observability into this issue, you
might want
to take a look at some of our lgroup observability/control tools we
posted (available from the performance community page).
They allow you
to do things like query/set your application's lgroup
affinity, find
out about the lgroups in the system, and what resources
they contain,
etc. Using them you might be able to confirm some of my
theory above.
We would also *very* much like any feedback you (or anyone
else) would
be willing to provide on the tools.
In the short term, there's a tunable I can suggest you take
a look at
that deals with how hard the kernel tries to keep threads
of the same
process together in the same lgroup.
Tuning this should result in your workload being spread out more
effectively than it currently seems to be. I'll post a follow up
message tomorrow morning with these details, if you'd like to try
this.
In the medium-short term, we really need to implement a
mechanism to
dynamically change a thread's lgroup affinity when it's
home becomes
overloaded. We presently don't have this, as the mechanism that
determines a thread's home lgroup (and does the lgroup load
balancing)
is static in nature (done at thread creation time). (Implemented in
usr/src/uts/common/os/lgrp.c:lgrp_choose() if you'd like to take a
look a the source.) In terms of our NUMA/MPO projects, this
one is at
the top of the 'ol TODO list.
This message posted from opensolaris.org
_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org
_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org
_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org