Re: [perf-discuss] CPU data per project and/or zone?

Mike Gerdts Thu, 03 Nov 2005 21:02:02 -0800

On 11/3/05, Brendan Gregg <[EMAIL PROTECTED]> wrote:
> G'Day Mike,
> On Wed, 2 Nov 2005, Mike Gerdts wrote:
>
> > One of the more difficult performance monitoring problems that I have
> > come across is determining the impact of multiple workloads running on
> > a server.  Consider a server that has about 1000 database processes
> > that are long running - many minutes to many months - mixed with batch
> > jobs written in Bourne shell.  Largely due to the batch jobs, it is
> > not uncommon for sar to report hundreds of forks and execs per second.
> >
> procfs based tools miss out on short lived processes (as a seperate
> process entry anyway) due to sampling. "prstat -m" can do something
> spooky with short lived processes, for example,
>
>    PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP
>   6394 root      14  71 0.1 0.0 0.0 0.0 0.0  15   0 593 35K   0 /0
>    937 root     2.2 8.4 0.0 0.0 0.0 0.0 4.2  85  26  24 11K 428 bash/1
>
> PID 6394 has no name and no LWPs, a ghost process. (And a fairly busy one
> too, 35000 syscalls ... Actually, I shouldn't criticise this as it may
> well be a deliberate aggregation of short lived processes, and I've found
> it to be quite handy. Thank you ghost process!)


Could this just be one that is not yet reaped?

>
> Now, the by-child usr/sys times from procfs should give us a crack at
> solving this, they can be fetched using the -b option of,
>         http://www.brendangregg.com/Solaris/prusage
> however I suspect they undercount usr/sys time. (Now that opensolaris
> code is public I ought to go and read how they are incremented)...

I hadn't noticed the child-related fields exported through procfs yet.
 I have done some digging through the source and found the following:

http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/os/exit.c#698
    343 int
    344 proc_exit(int why, int what)
    345 {
...
    698         p->p_utime = (clock_t)NSEC_TO_TICK(hrutime) + p->p_cutime;
    699         p->p_stime = (clock_t)NSEC_TO_TICK(hrstime) + p->p_cstime;

- As the process is exiting (wow, exiting is a lot of work!) it adds
its children's usr and sys time to its own usr and sys time.

http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/os/exit.c#1172

   1124 void
   1125 freeproc(proc_t *p)
   1126 {
...
   1172                 p->p_nextofkin->p_cutime += p->p_utime;
   1173                 p->p_nextofkin->p_cstime += p->p_stime;

- The sums calculated above are propagated to the parent process. 
Kinda sad that the parent process is the "next of kin".
- It seems as though there may be a race condition here.  I'm not sure
that the operations on lines 1172 and 1173 are protected by a lock. 
It seems as though p->p_nextofkin->p_lock should be held when
incrementing these values in the event that multiple children are
being reaped at the same time.

>
> > The best solution that I have come up with is to write extended
> > accounting records (task) every few minutes, then to process the
> > exacct file afterwards.  Writing the code to write exacct records
> > periodically and make sense of them later is far from trivial.  It is
> > also impractical for multiple users (monitoring frameworks,
> > administrators, etc.) to make use of this approach on the same machine
> > at the same time due to the fact that the exacct records need to be
> > written and this is presumably a somewhat expensive operation to do
> > too often.
>
> Wow, are you baiting me to talk about DTrace? ;-)

I keep thinking about the real world where I run Solaris 9 and will
probably be doing so for a while yet (ISV adoption, let others flush
out the bugs, no time to do the testing/upgrade, yadda, yadda, yadda).
 If it is something that is *almost* there, it could turn into an RFE
for that old proprietary code.

But, yeah, baiting a free DTrace script out of you works too.  :)

>
> Actually, nice idea with exaccts, there are few other things to try in
> previous Solaris to shed light on this problem (TNF tracing, BSM
> auditing...)

I had first implemented it on my own, then refined the mechanism a bit
after  spending a bit of time on Adrian Cochroft's blog
(http://perfcap.blogspot.com/2005/04/writing-accounting-records-at-time.html).
 The ugly part was getting time stamps for when the sample was taken
into the logs.  It wasn't really that bad, just annoying that I
discovered it after collecting many days of logs.  Now I timestamp
intervals into the log using the perl equivalent of "newtask -p
timestampproject /bin/true".

In any case, Adrian's advice suggested that the exacct way may not be
such a bad way to go long term.

> > It seems as though it should be possible for the kernel to maintain
> > per-user, per-project, and per-zone statistics.  Perhaps collecting
> > them all the time is not desirable, but it seems as though updating
> > the three sets of statistics for each context switch would be lighter
> > weight than writing accounting records then post processing them.  The
> > side affect of having this data available would be that tools like
> > prstat could report accurate data.  Other tools could likely get this
> > data through kstat or a similar interface.
>
> Kstat currently provides a number of CPU related goodies. See
> /usr/include/sys/sysinfo.h for the cpu_* structs.
>
> Alan Hargreaves (from memory) posted the following,
>
>         http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6199092
>
> which suggests seperating the cpu_* structs from the CPUs, so that they
> can be used to track many other categories, such as by zone.

Very interesting... I have to open a case to see the comments section
and possibly add my contract to that bug.  That could be a fix for
Solaris 9 (by project) too.

> A number of scripts from the DTraceToolkit already provide per zone
> statistics, since it's trivial to retrieve. Eg, zvmstat.

I'll have to take a look at this to see if it addresses the problems
with short-lived processes.

> I think the bottom line is it depends on what details you are interested
> in. procfs already has project and zone info and a swag of resource
> counters, so an ordinary procfs tool may be a solution (enhancement to
> prstat/ps)..
>
> cheers,
>
> Brendan

Good stuff... thanks for the pointers to several details that I have
missed in the past.

Mike
_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org

Re: [perf-discuss] CPU data per project and/or zone?

Reply via email to