Glen Gunselman wrote:
Bob,
 
Thanks for the info.  Sorry about taking so long to reply.
 
I think I answered most (all?) of your questions in my reply to Jim Mauros.
 
I do have a question for you.  You said "Note that your prstat data excerpt is not looking at per-thread statistics, but only per-process.".  The prstat is from a prstat -amL.  The top eight lines have only two PIDs - 5617 and 6084.
 
   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
  5617 cognos8   53 0.5 0.0 0.0 0.0 2.1 0.0  45  1K 200  3K   0 BIBusTKServe/18
  5617 cognos8   51 0.5 0.0 0.0 0.0 3.6 0.0  45  1K 274  3K   0 BIBusTKServe/17
  6084 cognos8   43 0.6 0.0 0.0 0.0 1.9 0.0  54  2K 222  5K   0 BIBusTKServe/20
  6084 cognos8   43 0.6 0.0 0.0 0.0 1.1 0.0  55  1K 244  4K   0 BIBusTKServe/15
  6084 cognos8   39 0.6 0.0 0.0 0.0 1.8 0.0  59  2K 212  4K   0 BIBusTKServe/22
  5617 cognos8   39 0.4 0.0 0.0 0.0 1.4 0.0  59  1K 223  3K   0 BIBusTKServe/22
  6084 cognos8   35 0.4 0.0 0.0 0.0 1.1 0.0  64  1K 262  2K   0 BIBusTKServe/19
  5617 cognos8   34 0.4 0.0 0.0 0.0 2.2 0.0  64  1K 465  2K   0 BIBusTKServe/23

What does "per-thread" output look like?
It looks like you have eight CPU-bound threads that would probably run roughly twice as fast with twice as many hardware threads.

As food for thought, your ICX rate is probably largely due to these competing with each other.  Putting theses under FX scheduling (for example, FX 40), might improve your results measurably (but probably not dramatically) by making them round-robin rather than preempt each other.

I'm ignorant of this particular app, so there may be many other opportunities for improvement.  I'd be curious to know what those syscalls are, for example.  Maybe there are opportunities for leveraging MPSS?  One can never tell from pure CPU-second metrics how much is business logic versus polling, spin-wait, or other overhead cycles.  I'm always curious about compile/link options used for CPU-bound binaries and how that can effect the *quality* of those CPU-seconds being used.  Perhaps this application is throttled by malloc() operations in such a way that an alternate library like libmtmalloc might be very helpful?  My point here is that CPU-seconds are only a first-cut metric, and that satisfying apparent CPU demand is not necessarily the same as optimizing throughput.

So - upgrading your V490 to eight cores would help with respect to BIBusTKServe performance, but it's not fully clear what your business metrics are or whether a 2x speedup in BIBusTKServe would cause great joy.  For all I know, your users' experiences might be dominated by something in the Oracle part of your workload.  I'd be happy to hear from you off-alias about how this goes forward.

Best regards,
-- Bob
 
 
Here's the remaining lines from the prstat -amL.
 

  NLWP USERNAME  SIZE   RSS MEMORY      TIME  CPU
   282 cognos8   171G  134G    18%   3:14:22  74
  1316 oracle    721G  592G    82%  34:49:50  13
    97 root      356M  198M   0.0%   1:24:18 0.8
     6 sysnav   8576K 5448K   0.0%   0:00:03 0.1
     7 gunselmg   15M   12M   0.0%   0:00:00 0.0
     1 smmsp    4432K 1240K   0.0%   0:00:00 0.0
     1 daemon   2512K 1080K   0.0%   0:00:00 0.0
 

Total: 278 processes, 1710 lwps, load averages: 20.72, 13.21, 6.74


Thanks again for the help,
 
 
Glen Gunselman
Systems Software Specialist
TCS
Emporia State University
 
>>> "Bob Sneed, SMI PAE" <[EMAIL PROTECTED]> 12/5/2006 3:12 AM >>>
Glen Gunselman wrote:
 
 

Glen:

Some comments offered inline below ...
 
We have an overloaded server (V490 with one CPU board) - CPU bound.  Here is a sample prstat -mL taken during a time of high load(uptime Total: 278 processes, 1710 lwps, load averages: 20.72, 13.21, 6.74):
 
   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
  5617 cognos8   53 0.5 0.0 0.0 0.0 2.1 0.0  45  1K 200  3K   0 BIBusTKServe/18
  5617 cognos8   51 0.5 0.0 0.0 0.0 3.6 0.0  45  1K 274  3K   0 BIBusTKServe/17
  6084 cognos8   43 0.6 0.0 0.0 0.0 1.9 0.0  54  2K 222  5K   0 BIBusTKServe/20
  6084 cognos8   43 0.6 0.0 0.0 0.0 1.1 0.0  55  1K 244  4K   0 BIBusTKServe/15
  6084 cognos8   39 0.6 0.0 0.0 0.0 1.8 0.0  59  2K 212  4K   0 BIBusTKServe/22
  5617 cognos8   39 0.4 0.0 0.0 0.0 1.4 0.0  59  1K 223  3K   0 BIBusTKServe/22
  6084 cognos8   35 0.4 0.0 0.0 0.0 1.1 0.0  64  1K 262  2K   0 BIBusTKServe/19
  5617 cognos8   34 0.4 0.0 0.0 0.0 2.2 0.0  64  1K 465  2K   0 BIBusTKServe/23
 29514 oracle    28 1.2 0.1 0.0 0.0 0.0 8.6  62 217 990 899   0 oracle/1
 29948 root     2.4 0.4 0.0 0.0 0.0 0.0  77  20 109 561 961   0 cfagent/1
  5610 oracle   1.5 0.5 0.0 0.0 0.0 0.0  98 0.1   3   8 871   0 oracle/1
   942 oracle   1.2 0.6 0.0 0.0 0.0 0.0  98 0.0  15  50 506   0 oracle/1
  9378 root     0.4 1.1 0.1 0.0 0.0 0.0  98 0.9  40   9 994   0 prstat/1
  1475 oracle   1.1 0.2 0.4 0.0 0.0 0.0  98 0.2 111  55 945   0 emagent/3047304
 11646 oracle   0.8 0.0 0.0 0.0 0.0 0.0  91 8.7   1  45  80   0 java/56
 11479 oracle   0.6 0.1 0.0 0.0 0.0 0.0  98 1.0   4   4 615   0 oracle/1
 10520 oracle   0.6 0.0 0.0 0.0 0.0 0.0  98 1.4   5   0  45   5 nmccollector/1
   835 sysnav   0.1 0.2 0.1 0.0 0.0 0.0  57  42  19 240 471   0 bb-local.sh/1
  7375 oracle   0.2 0.0 0.0 0.0 0.0 0.0 100 0.0   9   3 192   0 oracle/1
 11712 oracle   0.2 0.0 0.0 0.0 0.0 0.0 100 0.0   8   2 178   0 oracle/1
 11815 oracle   0.2 0.0 0.0 0.0 0.0 100 0.0 0.2   1   3  18   0 java/37
   576 root     0.1 0.1 0.0 0.0 0.0 0.0 100 0.1 331   1  1K   0 nscd/11
 17855 oracle   0.1 0.0 0.0 0.0 0.0 100 0.0 0.1   5   0   5   ; ; 0 java/2
 11805 oracle   0.1 0.1 0.0 0.0 0.0 0.0  96 3.8   4   7  62   2 perl/1
 11649 oracle   0.1 0.0 0.0 0.0 0.0 0.0 100 0.0   9   0 118   0 oracle/1
 11780 oracle   0.0 0.1 0.0 0.0 0.0 0.0  92 8.3  52   0 354  47 webcached/1
     1 root     0.0 0.1 0.0 0.0 0.0 0.0 100 0.2  13   0 361  14 init/1
  4987 cognos8  0.0 0.1 0.0 0.0 0.0 0.0  57  43 338   4 232   0 java/5
  4972 cognos8  0.1 0.0 0.0 0.0 0.0 0.0  91 8.5  68   0  77   0 cogbootstrap/3
 17855 oracle   0.0 0.1 0.0 0.0 0.0 0.0  51  49 312   2 209   0 java/5
 
From looking at the LAT column how to I compute the CPU resources needed to reduce LAT to  more "normal levels".
First, I should say that tuning LAT should not be a performance tuning objective, and that there is no such thing as a generic "normal" value for it.  Your goals should be measured in workload performance terms, and absent that - tuning any other observed metric could be a waste of time and money.  Could you give some insight into the business problem you are trying to solve in some quantitative terms, and make some assessment of where you are relative to that goal?

Note that your prstat data excerpt is not looking at per-thread statistics, but only per-process.  Therefore, we really do not know how many compute-bound threads you actually have.  Since this process-level LAT data is in terms of "percent of elapsed time", it is not really useful for estimating latent demand for CPU. We would gain more insig ht from 'prstat -mL' data.  In turn, that data is best interpreted in the light of matching mpstat and ps data, and we will often capture other data to complete the picture.  Adding CPUs will not benefit much past the point where each CPU-hog thread essentially has a dedicated core and the remaining miscellaneous demand is well-served.   On the other hand, your business needs might well be met using fewer worker threads to begin with - and fewer threads might exhibit less contention.  One would need to know more the function and design of BIBusTKserve.

There's a lot here that a performance analyst would like to know in a case like this, such as whether or not your Oracle is configured ideally, where it fits in the overall workload, and what function that high-CPU oracle process is performing.  I'm always curious in a general way to know how much of the aggregate CPU usage is going to spin-locks and other synchronization activities, both a t the OS and application levels.

I'll echo Jim Mauros's sentiment to follow-up with more answers.

Best regards,
-- Bob

 
Page 24 of Solaris Performance and Tools includes the following statement referring to LAT:
 
"This is an extremely useful metric--we can use it to estimate the potential speedup for a thread if more CPU resources are added ..."
 
I have been unable to find any information on how to turn LAT into CPU resources.  I'm reluctant to use USR + SYS (370.5 the top 9 processes) + LAT (507 for the same top 9 processes) / 100.  This seems way too simple. 
 
Thanks
gleng
 
Glen Gunselman
Systems Software Specialist
TCS
Emporia State University

_______________________________________________ perf-discuss mailing list perf-discuss@opensolaris.org


_______________________________________________ perf-discuss mailing list perf-discuss@opensolaris.org



_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org

Reply via email to