David McDaniel (damcdani) wrote:

 Thanks, Jonathon for the good insights. I'll be digging into the
references you mentioned. Yes, at the end of the day I'm sure binding to
processor sets is part of the plan; having already done so in a rather
rote way I can demonstrate a very dramatic reduction in apparent cpu
utilzation, on the order of 25-30%. But before I commit engineers to
casting something in stone I want to make sure I understand the defaults
and the side effects of doing so since it potentially results in
defeating other improvements that Sun has done or will be doing.

Sure. No problem. The overview and man pages for our tools are pretty short. The tools are very easy to use and kind of fun to play with. I'm going to try to post a good example of how to use them later today.

I think that using a psrset is an interesting experiment to see whether interference is a big factor in all the migrations. It would be nice not to have to do that by default though.

It sounds like you already tried this experiment though and noticed a big difference. Did the migrations drop dramatically? What else is running on the system when you don't use a psrset?


Jonathan

-----Original Message-----
From: jonathan chew [mailto:[EMAIL PROTECTED] Sent: Thursday, September 01, 2005 11:50 AM
To: David McDaniel (damcdani)
Cc: Eric C. Saxe; perf-discuss@opensolaris.org
Subject: Re: [perf-discuss] Re: Puzzling scheduler behavior

Dave,

It sounds like you have an interesting application. You might want to create a processor set, leave some CPUs outside the psrset for other threads to run on, and run your application in a processor set to minimize interference from other threads. As long as there are enough CPUs for your application in the psrset, you should see the number of migrations go down because there won't be any interference from other threads.

To get a better understanding of the Solaris performance optimizations done for NUMA, you might want to check out the overview of Memory Placement Optimization (MPO) at:

   http://opensolaris.org/os/community/performance/mpo_overview.pdf

The stickiness that you observed is because of MPO. Binding to a processor set containing one CPU set the home lgroup of the thread to the lgroup containing that CPU and destroying the psrset just left the thread homed there.

Your shared memory is probably spread across the system already because the default MPO memory allocation policy for shared memory is to allocate the memory from random lgroups across the system.

We have some prototype observability tools which allow you to examine the lgroup hierarchy and it contents and observe and/or control how the threads and memory are placed among lgroups (see http://opensolaris.org/os/community/performance/numa/observabi lity/). The source, binaries, and man pages are there.



Jonathan


David McDaniel (damcdani) wrote:

Very, very enlightening, Eric. Its really terrific to have
this kind
of channel for dialog.
The "return to home base" behavior you describe is clearly
consistent
with what I see and makes perfect sense.
Let me followup with a question. In this application,
processes have
not only their "own" memory, ie heap, stack program text and
data, etc,
but they also share a moderately large (~ 2-5GB today)
amount of memory
in the form of mmap'd files. From Sherry Moore's previous posts, I'm assuming that at startup time that would actually be all
allocated in
one board. Since I'm contemplating moving processes onto psrsets off that board, would it be plausible to assume that I might get
slightly
better net throughput if I could somehow spread that across all the boards? I know its speculation of the highest order, so
maybe my real
question is whether that's even worth testing.
In any case, I'd love to turn the knob you mention and
I'll look on
the performance community page and see what kind of trouble
I can get
into. If there are any particular items you think I should
check out,
guidance is welcome.
Regards
-d



-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Eric C. Saxe
Sent: Thursday, September 01, 2005 1:48 AM
To: perf-discuss@opensolaris.org
Subject: [perf-discuss] Re: Puzzling scheduler behavior

Hi David,

Since your v1280 systems has NUMA characteristics, the bias
that you
see for one of the boards may be a result of the kernel
trying to run
your application's threads "close" to where they have
allocated their
memory. We also generally try to keep threads in the same process together, since they generally tend to work on the same data. This might explain why one of the boards is so much busier than
the others.
So yes, the interesting piece of this seems to be the higher than expected run queue wait time (latency) as seen via prstat -Lm. Even with the thread-to-board/memory affinity I mentioned above, it generally shouldn't be the case that threads are willing to
hang out
on a run queue waiting for a CPU on their "home" when that thread *could* actually run immediately on a "remote" (off-board) CPU. Better to run remote, than not at all, or at least the
saying goes :)
In the case where a thread is dispatched remotely because all home CPUs are busy, the thread will try to migrate back home the
next time
it comes through the dispatcher and finds it can run immediately at home (either because there's an idle CPU, or because one of the running threads is lower priority than us, and we can preempt it). This migrating around means that the thread will tend to spend more time waiting on run queues, since it has to either wait for
the idle()
thread to switch off, or for the lower priority thread it's able to preempt to surrender the CPU. Either way, the thread
shouldn't have to
wait long to get the CPU, but it will have to wait a
non-zero amount
of time.

What does the prstat -Lm output look like exactly? Is it a
lot of wait
time, or just more than you would expect?

By the way, just to be clear, when I say "board" what I should be saying is lgroup (or locality group). This is the Solaris
abstraction
for a set of CPU and memory resources that are close to one
another.
On your system, it turns out that kernel creates an lgroup for each board, and each thread is given an affinity for one of the lgroups, such that it will try to run on the CPUs (and allocate memory from that group of resources.

One thing to look at here is whether or not the kernel could be "overloading" a given lgroup. This would result in threads
tending to
be less sucessful in getting CPU time (and/or
memory) in their home. At least for CPU time, you can see this by looking at the number of migrations and where they are
taking place.
If the thread isn't having much luck running at home, this
means that
it (and others sharing it's
home) will tend to "ping-pong" between CPU in and out of the home lgroup (we refer to this as the "king of the hill" pathology). In your mpstat output, I see many migrations on one of the boards, and a good many on the other boards as well, so
that might
well be happening here.

To get some additional observability into this issue, you
might want
to take a look at some of our lgroup observability/control tools we posted (available from the performance community page).
They allow you
to do things like query/set your application's lgroup
affinity, find
out about the lgroups in the system, and what resources
they contain,
etc. Using them you might be able to confirm some of my
theory above.
We would also *very* much like any feedback you (or anyone
else) would
be willing to provide on the tools.

In the short term, there's a tunable I can suggest you take
a look at
that deals with how hard the kernel tries to keep threads
of the same
process together in the same lgroup.
Tuning this should result in your workload being spread out more effectively than it currently seems to be. I'll post a follow up message tomorrow morning with these details, if you'd like to try this.

In the medium-short term, we really need to implement a
mechanism to
dynamically change a thread's lgroup affinity when it's
home becomes
overloaded. We presently don't have this, as the mechanism that determines a thread's home lgroup (and does the lgroup load
balancing)
is static in nature (done at thread creation time). (Implemented in
usr/src/uts/common/os/lgrp.c:lgrp_choose() if you'd like to take a look a the source.) In terms of our NUMA/MPO projects, this
one is at
the top of the 'ol TODO list.
This message posted from opensolaris.org _______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org

_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org




_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org

Reply via email to