Dave,
It sounds like you have an interesting application. You might want to
create a processor set, leave some CPUs outside the psrset for other
threads to run on, and run your application in a processor set to
minimize interference from other threads. As long as there are enough
CPUs for your application in the psrset, you should see the number of
migrations go down because there won't be any interference from other
threads.
To get a better understanding of the Solaris performance optimizations
done for NUMA, you might want to check out the overview of Memory
Placement Optimization (MPO) at:
http://opensolaris.org/os/community/performance/mpo_overview.pdf
The stickiness that you observed is because of MPO. Binding to a
processor set containing one CPU set the home lgroup of the thread to
the lgroup containing that CPU and destroying the psrset just left the
thread homed there.
Your shared memory is probably spread across the system already because
the default MPO memory allocation policy for shared memory is to
allocate the memory from random lgroups across the system.
We have some prototype observability tools which allow you to examine
the lgroup hierarchy and it contents and observe and/or control how the
threads and memory are placed among lgroups (see
http://opensolaris.org/os/community/performance/numa/observability/).
The source, binaries, and man pages are there.
Jonathan
David McDaniel (damcdani) wrote:
Very, very enlightening, Eric. Its really terrific to have this kind
of channel for dialog.
The "return to home base" behavior you describe is clearly consistent
with what I see and makes perfect sense.
Let me followup with a question. In this application, processes have
not only their "own" memory, ie heap, stack program text and data, etc,
but they also share a moderately large (~ 2-5GB today) amount of memory
in the form of mmap'd files. From Sherry Moore's previous posts, I'm
assuming that at startup time that would actually be all allocated in
one board. Since I'm contemplating moving processes onto psrsets off
that board, would it be plausible to assume that I might get slightly
better net throughput if I could somehow spread that across all the
boards? I know its speculation of the highest order, so maybe my real
question is whether that's even worth testing.
In any case, I'd love to turn the knob you mention and I'll look on
the performance community page and see what kind of trouble I can get
into. If there are any particular items you think I should check out,
guidance is welcome.
Regards
-d
-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of
Eric C. Saxe
Sent: Thursday, September 01, 2005 1:48 AM
To: perf-discuss@opensolaris.org
Subject: [perf-discuss] Re: Puzzling scheduler behavior
Hi David,
Since your v1280 systems has NUMA characteristics, the bias
that you see for one of the boards may be a result of the
kernel trying to run your application's threads "close" to
where they have allocated their memory. We also generally try
to keep threads in the same process together, since they
generally tend to work on the same data. This might explain
why one of the boards is so much busier than the others.
So yes, the interesting piece of this seems to be the higher
than expected run queue wait time (latency) as seen via
prstat -Lm. Even with the thread-to-board/memory affinity I
mentioned above, it generally shouldn't be the case that
threads are willing to hang out on a run queue waiting for a
CPU on their "home" when that thread *could* actually run
immediately on a "remote" (off-board) CPU.
Better to run remote, than not at all, or at least the saying goes :)
In the case where a thread is dispatched remotely because all
home CPUs are busy, the thread will try to migrate back home
the next time it comes through the dispatcher and finds it
can run immediately at home (either because there's an idle
CPU, or because one of the running threads is lower priority
than us, and we can preempt it). This migrating around means
that the thread will tend to spend more time waiting on run
queues, since it has to either wait for the idle() thread to
switch off, or for the lower priority thread it's able to
preempt to surrender the CPU. Either way, the thread
shouldn't have to wait long to get the CPU, but it will have
to wait a non-zero amount of time.
What does the prstat -Lm output look like exactly? Is it a
lot of wait time, or just more than you would expect?
By the way, just to be clear, when I say "board" what I
should be saying is lgroup (or locality group). This is the
Solaris abstraction for a set of CPU and memory resources
that are close to one another. On your system, it turns out
that kernel creates an lgroup for each board, and each thread
is given an affinity for one of the lgroups, such that it
will try to run on the CPUs (and allocate memory from that
group of resources.
One thing to look at here is whether or not the kernel could
be "overloading" a given lgroup. This would result in threads
tending to be less sucessful in getting CPU time (and/or
memory) in their home. At least for CPU time, you can see
this by looking at the number of migrations and where they
are taking place. If the thread isn't having much luck
running at home, this means that it (and others sharing it's
home) will tend to "ping-pong" between CPU in and out of the
home lgroup (we refer to this as the "king of the hill"
pathology). In your mpstat output, I see many migrations on
one of the boards, and a good many on the other boards as
well, so that might well be happening here.
To get some additional observability into this issue, you
might want to take a look at some of our lgroup
observability/control tools we posted (available from the
performance community page). They allow you to do things like
query/set your application's lgroup affinity, find out about
the lgroups in the system, and what resources they contain,
etc. Using them you might be able to confirm some of my
theory above. We would also *very* much like any feedback you
(or anyone else) would be willing to provide on the tools.
In the short term, there's a tunable I can suggest you take a
look at that deals with how hard the kernel tries to keep
threads of the same process together in the same lgroup.
Tuning this should result in your workload being spread out
more effectively than it currently seems to be. I'll post a
follow up message tomorrow morning with these details, if
you'd like to try this.
In the medium-short term, we really need to implement a
mechanism to dynamically change a thread's lgroup affinity
when it's home becomes overloaded. We presently don't have
this, as the mechanism that determines a thread's home lgroup
(and does the lgroup load balancing) is static in nature
(done at thread creation time). (Implemented in
usr/src/uts/common/os/lgrp.c:lgrp_choose() if you'd like to
take a look a the source.) In terms of our NUMA/MPO projects,
this one is at the top of the 'ol TODO list.
This message posted from opensolaris.org
_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org
_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org
_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org