David,
I suspect your doors arrangement is a losing proposition. A doors call is
cheap relative to many IPC methods, but expensive relative to the TLB miss
you are trying to avoid. A doors call costs a handful of microseconds or more.
A TLB miss takes approx 30 cycles for a TSB hit when the TTE is in cache, and
a few hundred cycles if the TTE is read from memory - less than a microsecond.
Your best approach is to explicitly map the arrays with large pages by
calling memcntl(MC_HAT_ADVISE) after mmap. This support was recently added
as part of "6219317 Large page support is needed for mapping executables,
libraries and files". However, the OOB policies do not automatically assign
large pages to mmap'd data files, hence the memcntl is required. Let the
app run a while, and use "pmap -s" to verify that large pages were mapped.
Beware that each mmap'd file needs to be larger than the large page size,
and ideally should have a starting address that is large-page aligned.
BTW, DISM will not help. It causes the kernel data structures storing
translations to be shared amongst processes, but it does not enable processes
to share the same translation in the TLB.
- Steve Sistare
David McDaniel wrote On 06/29/06 15:59,:
Wow, thanks for the fast feedback, guys. Sorry about the insufficient detail...
I've work with Sun stuff for so long I keep forgetting about the AMD stuff.
The hardware this stuff runs on is mostly USIII and IIIi, basically Netra/Sunfire 1280
and 440 boxes. The app is currently 32 bits. We're playing with a couple of AMD boxes but
wont actually get serious until the new "big" Galaxy box materialize. It
currently runs on S10-03/05, but we plan to upgrade to 06/06 pretty quickly because the
tests I've done already show a little improvement due to the LPOOB work. (more on that
later)
In any case, the "app" consists of several sets of cooperating primary and support
processes, each of which is multithreaded to a greater or lesser degree. In most cases the thread
model is the easier to deal with parallel model, in a few cases the threading is more of a pipeline
model, with some degree of parallelism in the one or more pipeline stage. The single hottest thread
soaks up about 10% or the total cpu cycles when the app is running at its engineered limit, so
today we scale nicely to above 8 cores. In testing (not in production) I observe that the simple
step of taking interrupts off one of the cpus, putting it into a psrset and binding the process
with the single hot thread to that psrset reduces its reported cpu consumption by ~30%... a pretty
dramatic reduction. And this thread can be easily parallelized so we should be able use something
like a maxed out 2900/1290 pretty handily. As to a niagara, I'm still in "show me" mode
on that one, but we've ordered a
couple to test with.
As to the dataset, the easiest way to describe it is as a couple hundred C arrays, each
persisted within a memory mapped file. The important processes mmap these files, at
different virtual addresses of course :-( so there is a high degree of
"aliasing". And they are coupled through that data with shared mutexes. The
system is purely reactive to the outside world so the access pattern is pretty much
random. So, my thought was to consider creating some per-arrary door servers. In this way
instead of a dozen processes accessing the same physical page through non-shared TLBs
only a single door server would consume that slot. My understanding is the DISM would
help, but for several reasons having this data visible as files is probably immutable.
Thanks again for the comments.
-d
This message posted from opensolaris.org
_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org
_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org