Wow, thanks for the fast feedback, guys. Sorry about the insufficient detail... I've work with Sun stuff for so long I keep forgetting about the AMD stuff. The hardware this stuff runs on is mostly USIII and IIIi, basically Netra/Sunfire 1280 and 440 boxes. The app is currently 32 bits. We're playing with a couple of AMD boxes but wont actually get serious until the new "big" Galaxy box materialize. It currently runs on S10-03/05, but we plan to upgrade to 06/06 pretty quickly because the tests I've done already show a little improvement due to the LPOOB work. (more on that later)
In any case, the "app" consists of several sets of cooperating primary and support processes, each of which is multithreaded to a greater or lesser degree. In most cases the thread model is the easier to deal with parallel model, in a few cases the threading is more of a pipeline model, with some degree of parallelism in the one or more pipeline stage. The single hottest thread soaks up about 10% or the total cpu cycles when the app is running at its engineered limit, so today we scale nicely to above 8 cores. In testing (not in production) I observe that the simple step of taking interrupts off one of the cpus, putting it into a psrset and binding the process with the single hot thread to that psrset reduces its reported cpu consumption by ~30%... a pretty dramatic reduction. And this thread can be easily parallelized so we should be able use something like a maxed out 2900/1290 pretty handily. As to a niagara, I'm still in "show me" mode on that one, but we've ordered a couple to test with. As to the dataset, the easiest way to describe it is as a couple hundred C arrays, each persisted within a memory mapped file. The important processes mmap these files, at different virtual addresses of course :-( so there is a high degree of "aliasing". And they are coupled through that data with shared mutexes. The system is purely reactive to the outside world so the access pattern is pretty much random. So, my thought was to consider creating some per-arrary door servers. In this way instead of a dozen processes accessing the same physical page through non-shared TLBs only a single door server would consume that slot. My understanding is the DISM would help, but for several reasons having this data visible as files is probably immutable. Thanks again for the comments. -d This message posted from opensolaris.org _______________________________________________ perf-discuss mailing list perf-discuss@opensolaris.org