On 11/3/05, Brendan Gregg <[EMAIL PROTECTED]> wrote: > G'Day Mike, > On Wed, 2 Nov 2005, Mike Gerdts wrote: > > > One of the more difficult performance monitoring problems that I have > > come across is determining the impact of multiple workloads running on > > a server. Consider a server that has about 1000 database processes > > that are long running - many minutes to many months - mixed with batch > > jobs written in Bourne shell. Largely due to the batch jobs, it is > > not uncommon for sar to report hundreds of forks and execs per second. > > > procfs based tools miss out on short lived processes (as a seperate > process entry anyway) due to sampling. "prstat -m" can do something > spooky with short lived processes, for example, > > PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP > 6394 root 14 71 0.1 0.0 0.0 0.0 0.0 15 0 593 35K 0 /0 > 937 root 2.2 8.4 0.0 0.0 0.0 0.0 4.2 85 26 24 11K 428 bash/1 > > PID 6394 has no name and no LWPs, a ghost process. (And a fairly busy one > too, 35000 syscalls ... Actually, I shouldn't criticise this as it may > well be a deliberate aggregation of short lived processes, and I've found > it to be quite handy. Thank you ghost process!)
Could this just be one that is not yet reaped? > > Now, the by-child usr/sys times from procfs should give us a crack at > solving this, they can be fetched using the -b option of, > http://www.brendangregg.com/Solaris/prusage > however I suspect they undercount usr/sys time. (Now that opensolaris > code is public I ought to go and read how they are incremented)... I hadn't noticed the child-related fields exported through procfs yet. I have done some digging through the source and found the following: http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/os/exit.c#698 343 int 344 proc_exit(int why, int what) 345 { ... 698 p->p_utime = (clock_t)NSEC_TO_TICK(hrutime) + p->p_cutime; 699 p->p_stime = (clock_t)NSEC_TO_TICK(hrstime) + p->p_cstime; - As the process is exiting (wow, exiting is a lot of work!) it adds its children's usr and sys time to its own usr and sys time. http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/os/exit.c#1172 1124 void 1125 freeproc(proc_t *p) 1126 { ... 1172 p->p_nextofkin->p_cutime += p->p_utime; 1173 p->p_nextofkin->p_cstime += p->p_stime; - The sums calculated above are propagated to the parent process. Kinda sad that the parent process is the "next of kin". - It seems as though there may be a race condition here. I'm not sure that the operations on lines 1172 and 1173 are protected by a lock. It seems as though p->p_nextofkin->p_lock should be held when incrementing these values in the event that multiple children are being reaped at the same time. > > > The best solution that I have come up with is to write extended > > accounting records (task) every few minutes, then to process the > > exacct file afterwards. Writing the code to write exacct records > > periodically and make sense of them later is far from trivial. It is > > also impractical for multiple users (monitoring frameworks, > > administrators, etc.) to make use of this approach on the same machine > > at the same time due to the fact that the exacct records need to be > > written and this is presumably a somewhat expensive operation to do > > too often. > > Wow, are you baiting me to talk about DTrace? ;-) I keep thinking about the real world where I run Solaris 9 and will probably be doing so for a while yet (ISV adoption, let others flush out the bugs, no time to do the testing/upgrade, yadda, yadda, yadda). If it is something that is *almost* there, it could turn into an RFE for that old proprietary code. But, yeah, baiting a free DTrace script out of you works too. :) > > Actually, nice idea with exaccts, there are few other things to try in > previous Solaris to shed light on this problem (TNF tracing, BSM > auditing...) I had first implemented it on my own, then refined the mechanism a bit after spending a bit of time on Adrian Cochroft's blog (http://perfcap.blogspot.com/2005/04/writing-accounting-records-at-time.html). The ugly part was getting time stamps for when the sample was taken into the logs. It wasn't really that bad, just annoying that I discovered it after collecting many days of logs. Now I timestamp intervals into the log using the perl equivalent of "newtask -p timestampproject /bin/true". In any case, Adrian's advice suggested that the exacct way may not be such a bad way to go long term. > > It seems as though it should be possible for the kernel to maintain > > per-user, per-project, and per-zone statistics. Perhaps collecting > > them all the time is not desirable, but it seems as though updating > > the three sets of statistics for each context switch would be lighter > > weight than writing accounting records then post processing them. The > > side affect of having this data available would be that tools like > > prstat could report accurate data. Other tools could likely get this > > data through kstat or a similar interface. > > Kstat currently provides a number of CPU related goodies. See > /usr/include/sys/sysinfo.h for the cpu_* structs. > > Alan Hargreaves (from memory) posted the following, > > http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6199092 > > which suggests seperating the cpu_* structs from the CPUs, so that they > can be used to track many other categories, such as by zone. Very interesting... I have to open a case to see the comments section and possibly add my contract to that bug. That could be a fix for Solaris 9 (by project) too. > A number of scripts from the DTraceToolkit already provide per zone > statistics, since it's trivial to retrieve. Eg, zvmstat. I'll have to take a look at this to see if it addresses the problems with short-lived processes. > I think the bottom line is it depends on what details you are interested > in. procfs already has project and zone info and a swag of resource > counters, so an ordinary procfs tool may be a solution (enhancement to > prstat/ps).. > > cheers, > > Brendan Good stuff... thanks for the pointers to several details that I have missed in the past. Mike _______________________________________________ perf-discuss mailing list perf-discuss@opensolaris.org