David McDaniel (damcdani) wrote:

 Thanks for the feedback, Jonathan. I've got it on my todo list to get
those tools and go spelunking a bit. I cant really say that we have a
performance problem, its more along the lines of me trying to use the
greatly improved observability tools in Solaris to get a better
understanding of things. In any case, its pretty much relegated to a
science project right now because we cant ship anything that's not part
of some "official" distribution?

Ok. The tools are pretty easy to use. If you have any questions, we would be happy to help and welcome any feedback on the tools or documentation.

When you say that you can't ship anything that's not part of some "official" distribution, are you referring to our tools or your software?

I am suggesting using our tools to understand the behavior of your application and its interaction with the operating system better and determine whether there is a problem or not. If there is a problem in the OS, we can try to fix the default behavior.

As Sasha pointed out, it is our intention to ship our observability tools, but we wanted to let the OpenSolaris community try them first to see whether they are useful.

Last but not least, we can try running your application if you want.



Jonathan

-----Original Message-----
From: jonathan chew [mailto:[EMAIL PROTECTED] Sent: Friday, September 09, 2005 6:08 PM
To: David McDaniel (damcdani)
Cc: Eric C. Saxe; perf-discuss@opensolaris.org
Subject: Re: [perf-discuss] Re: Puzzling scheduler behavior

Dave,

Sorry, I forgot to reply to this sooner. Yes, I was just curious what else was running to see whether we would expect your application to be perturbed much.

There could be a load imbalance due to the daemons throwing everything off once in awhile. This could be affecting how the threads in your application are distributed across the nodes in your NUMA machine.

Each thread is assigned a home locality group upon creation and the kernel will tend to run it on CPUs in its home lgroup and allocate its memory there to minimize latency and maximize performance by default. There is an lgroup corresponding to each of the nodes (boards) in your NUMA machine. The assignment of threads to lgroups is based on lgroup load averages, so other threads may cause the lgroup load average to go up or down and thus affect how threads are placed among lgroups.

You can use plgrp(1) which is available on our NUMA observability web page at http://opensolaris.org/os/community/performance/numa/observabi lity to see where your application processes/threads are homed. Then we can see whether they are distributed very well. You can also use plgrp(1) to change the home lgroup of a thread, but should be careful because there can be side effects as explained in the example referred to below.

There are man pages, source, and binaries for our tools on the web page. I wrote up a good example of how to use the tools to understand, observe, and affect thread and memory placement among lgroups on a NUMA machine and posted it on the web page in http://opensolaris.org/os/community/performance/example.txt.

You can also try using the lgrp_expand_proc_thresh tunable that Eric suggested last week.

Are the migrations that you are seeing when not running a psrset causing a performance problem for your application?



Jonathan


David McDaniel (damcdani) wrote:

When using prsets, the migrations and involuntary context
switches go
essentially to zero. As far as "other stuff", not quite sure what you
mean, but this application runs on a dedicated server so there is no
stuff of a casueal nature, however there is a lot of what
I'll glom into
the category of "support" tasks, ie ntp daemons, nscd
flushing caches,
fsflush running around backing up pages, etc. Was that what
you meant?


-----Original Message-----
From: jonathan chew [mailto:[EMAIL PROTECTED] Sent: Thursday, September 01, 2005 12:45 PM
To: David McDaniel (damcdani)
Cc: Eric C. Saxe; perf-discuss@opensolaris.org
Subject: Re: [perf-discuss] Re: Puzzling scheduler behavior

David McDaniel (damcdani) wrote:

Thanks, Jonathon for the good insights. I'll be digging into the references you mentioned. Yes, at the end of the day I'm
sure binding
to processor sets is part of the plan; having already done so in a rather rote way I can demonstrate a very dramatic reduction
in apparent
cpu utilzation, on the order of 25-30%. But before I commit
engineers
to casting something in stone I want to make sure I understand the defaults and the side effects of doing so since it
potentially results
in defeating other improvements that Sun has done or will be doing.


Sure. No problem. The overview and man pages for our tools are pretty short. The tools are very easy to use and kind of fun to play with. I'm going to try to post a good example of how to use them later today.

I think that using a psrset is an interesting experiment to see whether interference is a big factor in all the migrations. It would be nice not to have to do that by default though.

It sounds like you already tried this experiment though and noticed a big difference. Did the migrations drop dramatically? What else is running on the system when you don't use a psrset?


Jonathan

-----Original Message-----
From: jonathan chew [mailto:[EMAIL PROTECTED]
Sent: Thursday, September 01, 2005 11:50 AM
To: David McDaniel (damcdani)
Cc: Eric C. Saxe; perf-discuss@opensolaris.org
Subject: Re: [perf-discuss] Re: Puzzling scheduler behavior

Dave,

It sounds like you have an interesting application. You
might want to
create a processor set, leave some CPUs outside the psrset
for other
threads to run on, and run your application in a processor set to minimize interference from other threads. As long as there
are enough
CPUs for your application in the psrset, you should see the
number of
migrations go down because there won't be any interference
from other
threads.

To get a better understanding of the Solaris performance
optimizations
done for NUMA, you might want to check out the overview of Memory Placement Optimization (MPO) at:

http://opensolaris.org/os/community/performance/mpo_overview.pdf
The stickiness that you observed is because of MPO. Binding to a processor set containing one CPU set the home lgroup of the
thread to
the lgroup containing that CPU and destroying the psrset
just left the
thread homed there.

Your shared memory is probably spread across the system already because the default MPO memory allocation policy for shared
memory is
to allocate the memory from random lgroups across the system.

We have some prototype observability tools which allow you
to examine
the lgroup hierarchy and it contents and observe and/or
control how
the threads and memory are placed among lgroups (see http://opensolaris.org/os/community/performance/numa/observabi lity/). The source, binaries, and man pages are there.



Jonathan


David McDaniel (damcdani) wrote:

Very, very enlightening, Eric. Its really terrific to have
this kind
of channel for dialog.
The "return to home base" behavior you describe is clearly
consistent
with what I see and makes perfect sense.
Let me followup with a question. In this application,
processes have
not only their "own" memory, ie heap, stack program text and
data, etc,
but they also share a moderately large (~ 2-5GB today)
amount of memory
in the form of mmap'd files. From Sherry Moore's previous
posts, I'm
assuming that at startup time that would actually be all
allocated in
one board. Since I'm contemplating moving processes onto
psrsets off
that board, would it be plausible to assume that I might get
slightly
better net throughput if I could somehow spread that
across all the
boards? I know its speculation of the highest order, so
maybe my real
question is whether that's even worth testing.
In any case, I'd love to turn the knob you mention and
I'll look on
the performance community page and see what kind of trouble
I can get
into. If there are any particular items you think I should
check out,
guidance is welcome.
Regards
-d



-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf
Of Eric C.
Saxe
Sent: Thursday, September 01, 2005 1:48 AM
To: perf-discuss@opensolaris.org
Subject: [perf-discuss] Re: Puzzling scheduler behavior

Hi David,

Since your v1280 systems has NUMA characteristics, the bias
that you
see for one of the boards may be a result of the kernel
trying to run
your application's threads "close" to where they have
allocated their
memory. We also generally try to keep threads in the
same process
together, since they generally tend to work on the same
data. This
might explain why one of the boards is so much busier than
the others.
So yes, the interesting piece of this seems to be the
higher than
expected run queue wait time (latency) as seen via prstat
-Lm. Even
with the thread-to-board/memory affinity I mentioned above, it generally shouldn't be the case that threads are willing to
hang out
on a run queue waiting for a CPU on their "home" when
that thread
*could* actually run immediately on a "remote" (off-board) CPU.
Better to run remote, than not at all, or at least the
saying goes :)
In the case where a thread is dispatched remotely because
all home
CPUs are busy, the thread will try to migrate back home the
next time
it comes through the dispatcher and finds it can run
immediately at
home (either because there's an idle CPU, or because one of the running threads is lower priority than us, and we can
preempt it).
This migrating around means that the thread will tend to
spend more
time waiting on run queues, since it has to either wait for
the idle()
thread to switch off, or for the lower priority thread
it's able to
preempt to surrender the CPU. Either way, the thread
shouldn't have to
wait long to get the CPU, but it will have to wait a
non-zero amount
of time.

What does the prstat -Lm output look like exactly? Is it a
lot of wait
time, or just more than you would expect?

By the way, just to be clear, when I say "board" what I
should be
saying is lgroup (or locality group). This is the Solaris
abstraction
for a set of CPU and memory resources that are close to one
another.
On your system, it turns out that kernel creates an
lgroup for each
board, and each thread is given an affinity for one of
the lgroups,
such that it will try to run on the CPUs (and allocate
memory from
that group of resources.

One thing to look at here is whether or not the kernel could be "overloading" a given lgroup. This would result in threads
tending to
be less sucessful in getting CPU time (and/or
memory) in their home. At least for CPU time, you can
see this by
looking at the number of migrations and where they are
taking place.
If the thread isn't having much luck running at home, this
means that
it (and others sharing it's
home) will tend to "ping-pong" between CPU in and out
of the home
lgroup (we refer to this as the "king of the hill"
pathology). In your mpstat output, I see many migrations
on one of
the boards, and a good many on the other boards as well, so
that might
well be happening here.

To get some additional observability into this issue, you
might want
to take a look at some of our lgroup
observability/control tools we
posted (available from the performance community page).
They allow you
to do things like query/set your application's lgroup
affinity, find
out about the lgroups in the system, and what resources
they contain,
etc. Using them you might be able to confirm some of my
theory above.
We would also *very* much like any feedback you (or anyone
else) would
be willing to provide on the tools.

In the short term, there's a tunable I can suggest you take
a look at
that deals with how hard the kernel tries to keep threads
of the same
process together in the same lgroup.
Tuning this should result in your workload being spread
out more
effectively than it currently seems to be. I'll post a
follow up
message tomorrow morning with these details, if you'd
like to try
this.

In the medium-short term, we really need to implement a
mechanism to
dynamically change a thread's lgroup affinity when it's
home becomes
overloaded. We presently don't have this, as the mechanism that determines a thread's home lgroup (and does the lgroup load
balancing)
is static in nature (done at thread creation time).
(Implemented in
usr/src/uts/common/os/lgrp.c:lgrp_choose() if you'd like
to take a
look a the source.) In terms of our NUMA/MPO projects, this
one is at
the top of the 'ol TODO list.
This message posted from opensolaris.org _______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org



_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org







_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org

Reply via email to