Wow, thanks for the fast feedback, guys. Sorry about the insufficient detail... 
I've work with Sun stuff for so long I keep forgetting about the AMD stuff.
  The hardware this stuff runs on is mostly USIII and IIIi, basically 
Netra/Sunfire 1280 and 440 boxes. The app is currently 32 bits. We're playing 
with a couple of AMD boxes but wont actually get serious until the new "big" 
Galaxy box materialize. It currently runs on S10-03/05, but we plan to upgrade 
to 06/06 pretty quickly because the tests I've done already show a little 
improvement due to the LPOOB work. (more on that later)

  In any case, the "app" consists of several sets of cooperating primary and 
support processes, each of which is multithreaded to a greater or lesser 
degree. In most cases the thread model is the easier to deal with parallel 
model, in a few cases the threading is more of a pipeline model, with some 
degree of parallelism in the one or more pipeline stage. The single hottest 
thread soaks up about 10% or the total cpu cycles when the app is running at 
its engineered limit, so today we scale nicely to above 8 cores. In testing 
(not in production) I observe that the simple step of taking interrupts off one 
of the cpus, putting it into a psrset and binding the process with the single 
hot thread to that psrset reduces its reported cpu consumption by ~30%... a 
pretty dramatic reduction. And this thread can be easily parallelized so we 
should be able use something like a maxed out 2900/1290 pretty handily. As to a 
niagara, I'm still in "show me" mode on that one, but we've ordered a couple to 
test with.

  As to the dataset, the easiest way to describe it is as a couple hundred C 
arrays, each persisted within a memory mapped file. The important processes 
mmap these files, at different virtual addresses of course :-( so there is a 
high degree of "aliasing". And they are coupled through that data with shared 
mutexes. The system is purely reactive to the outside world so the access 
pattern is pretty much random. So, my thought was to consider creating some 
per-arrary door servers. In this way instead of a dozen processes accessing the 
same physical page through non-shared TLBs only a single door server would 
consume that slot. My understanding is the DISM would help, but for several 
reasons having this data visible as files is probably immutable.

  Thanks again for the comments.
-d
 
 
This message posted from opensolaris.org
_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org

Reply via email to