Hi all, Here's the design document for LatencyTop. Comments are welcome. Thanks. -- This message posted from opensolaris.org
LatencyTOP High Level Design Document
Zhu, Lejun(lejun....@intel.com) Version 0.4, 2009.1.5 1. Overview 1.1 Purpose and Scope Latency causes applications to run slow even on a high end, fast system. It is easy to notice when a problem caused by latency happens, but it is usually difficult to identify what causes the latency. LatencyTOP (www.latencytop.org) is a tool that is recently developed for Linux OS to identify outstanding latency problems in a running system. Statistics is collected from OS scheduling subsystem and collected on both system-wide and per-process basis. Application developers can look at the statistics and try to avoid latency problems. A similar tool will also benefit developers writing applications under Solaris. LatencyTOP monitors the system for latencies, i.e., processes does nothing but waits for some condition to happen. LatencyTOP detects latencies that results in a process sleep or block on some condition. It also detects some known busy loop in Solaris kernel, e.g. lock spinning. LatencyTOP does not detect busy loop inside user application. Neither does it detect delay that is not caused by waiting, e.g. process not running because a higher priority process takes a lot of CPU time. DTrace is the advanced tracing mechanism in Solaris for troubleshooting kernel as well as applications. With DTrace, it is very easy to gather statistics from Solaris without having to patch the kernel as LatencyTOP for Linux does. 1.2 Design Overview The goal of this utility is to analyze the cause of latency in a running system using DTrace probes. In order to do so, every time a process is going into and out of SLEEP state, the timing is recorded and the kernel call stack is captured. The utility periodically collects data from DTrace, search call stacks for symbols that are known to cause latency and update the statistics. The symbols that are used to scan call stacks are hand picked from OpenSolaris kernel source code, e.g. "Category: Caused by read - ssize_t read(int fdes, void *cbuf, size_t count)". If such a symbol is found in call stack, this latency is categorized as "caused by read". Symbols are stored in a separate text based configuration file, so that new symbols can be easily added to analyze different latency problems. LatencyTOP uses a console based UI to display the statistics. The windowing is based on curses library. 1.3 Key Challenges The actual pattern of what's causing latency varies in different systems. It is very difficult to properly define and categorize all significant causes for latency once and for all. The design of the tool itself can be decided before implementation, but a configuration that only picks causes that "really matters" and organize them in a most clear way would require test after development work is done, as well as feedback from community. 1.4 Dependencies LatencyTOP uses DTrace framework to collect data. The implementation depends on the undocumented interface of libdtrace(3LIB). 2. Software Design 2.1 DTrace usage in LatencyTOP LatencyTOP uses DTrace probes "sched:::off-cpu" and "sched:::on-cpu" to capture when a process goes into and out of SLEEP. Although there may be a gap between a process goes from SLEEP to RUN and the process is actually on CPU after swtch(), it is necessary to use "on-cpu" in order to capture kernel call stack. The difference of ts between off-cpu and on-cpu are used to calculate how long the latency is. LatencyTOP uses DTrace aggregators to store statistics. Every set of different pid/stack() will have a separate entry. Values of count(), sum() and max() are also recorded separately. LatencyTOP uses probes "lockstat:::adaptive-spin" and "lockstat:::spin-spin" to detect lock spinning. These two are special categories and stack is not used to track the aggregators. Instead each process has two special aggregators to generally track all adaptive lock spinning and spinlock spinning. These values are usually insignificant latency causes. In case they get higher than expected, lockstat(1M) can be used to further drill the problem down. Suggestion of using this tool will be visible on screen when LatencyTOP is running. 2.2 Configuration Configuration is LatencyTOP's knowledge of Solaris Kernel code. It is internal and considered part of LatencyTOP. An example of configuration looks like: # UFS 50 ufs`ufs_sync UFS sync 50 ufs`ufs_fsync UFS sync 50 ufs`ufs_remove UFS remove a file 50 ufs`ufs_create UFS create a file Each line defines one symbol match rule, except for lines starting with "#". The line has the following format: [:blank:]* <priority> [:blank:]+ [<module>]`<function> [:blank:]+ <Category> These rules will be used in data analysis procedure, but will not affect what data is collected from DTrace. For every stack trace in the aggregator snapshot, each symbol is compared against all symbols ("<module>`<function name>") known in the configuration. If a symbol is found, the stack is assigned to the corresponding category. Rules that contain the same category string are considered in the same category group, and their statistics will be counted together. In the example above, the statistics value of "UFS sync" will contain values from both "ufs`ufs_sync" and "ufs`ufs_fsync". In case that more than one rules are matched to a stack trace, <priority> is used to control which one is used, the higher value means higher priority. Lines starting with "#" are comments and will be ignored during parsing. ";" is used for handling special commands. Only one such command is defined: "disable_category". When a category is disabled, latency is still matched to that category, but will be ignored in statistics in LatencyTOP. For example: ; disable_category FSFlush Daemon It is also possible to categorize latency causes based on syscalls. An example of such configuration fragment would look like: # Syscalls # # Syscalls have priority 10, this is the lowest priority defined as default. # This is to ensure a latency is traced to one of the syscalls if nothing # else matches. # 10 genunix`indir Syscall: indir 10 genunix`rexit Syscall: exit 10 genunix`forkall Syscall: forkall 10 genunix`read Syscall: read 10 genunix`read32 Syscall: read 10 genunix`write Syscall: write 10 genunix`write32 Syscall: write # and more... Such fragment will be in the configuration to ensure all latency has at least one cause falls into syscall, if not captured somewhere else. 2.3 Data Analysis and UI LatencyTOP periodically creates snapshot of DTrace data and walks through it. After the data is collected, the UI is updated and key presses are processed. So there is no concurrency problem between updating the data and displaying it. LatencyTOP has similar look and feel as LatencyTOP for Linux version 0.4. The screen layout will look as follows: ==================================Begin======================================= LatencyTOP version x.y (C) 2008 Intel Corporation Cause Average Maximum Percentage <System-wide Cause 1> <Avg> msec <Max> msec <Percent> % <System-wide Cause 1> <Avg> msec <Max> msec <Percent> % <System-wide Cause 1> <Avg> msec <Max> msec <Percent> % Process <exec name> (<pid>) Total: <total latency> msec <Per-Process Cause 1> <Avg> msec <Max> msec <Percent> % <Per-Process Cause 2> <Avg> msec <Max> msec <Percent> % <ProcessName1> <ProcessName2> <ProcessName3> <ProcessName4> <ProcessName5> ===================================End======================================== The top half of the screen displays system-wide statistics. <Percent> is calculated within a given period, e.g. 5 seconds, percent = sum(causeA)/sum(all)*100%. <Avg> means the average latency, e.g., average = sum(causeA)/count(causeA). <Max> represents the largest period of process being blocked. For example, if for the same cause, cv_wait() is called several times within a given period, and in worst case the process is blocked X ms, max = X. Latencies are sorted in descending order by percentage. Up to 10 causes will be displayed. The bottom half of the screen displays per-process statistics. Latencies are ordered in the same way as system-wide statistics. Up to 6 causes will be displayed. User can choose a different process to analyze by pressing "<" or ">" and cycle through the process list at the very bottom of the screen. LatencyTOP collects data periodically and updates the UI. Between data collection, the data on UI remains the same. The data refresh period equals to the period which statistics are calculated. 2.4 Log Stack Traces That Are Not Categorized If none of the symbols defined in the configuration file is found in a stack trace, the latency is considered "not categorized" in the active configuration file. A command option is available in LatencyTOP to dump all such causes, one entry for each stack trace. After LatencyTOP runs for a while, develop/test engineers can look into OpenSolaris kernel source code for the logged stack and update the configuration file. 3. Implementation and Testing Consideration The performance impact needs to be determined for heavy load system when the DTrace probes are enabled and data polling thread is running. The initial release should include internal experience. To help define the initial symbols/stack traces of interest, LatencyTOP needs to be tested with several Perf PIT workloads before release. 4. Future Improvements The initial release of LatencyTOP aims to provide similar feature and look-and-feel as LatencyTOP for Linux version 0.4. However, with the advanced DTrace infrastructure in Solaris OS, it is possible to add more enhancements in statistics gathering or analysis. For initial release, only a limited set of latency causes are defined in the configuration file. More will be added after internal and external tests. It is also possible to create multiple sets that group latency causes in different ways, e.g. categorized based on syscall types versus resource types to suit the need in different environment. Some level of protection/check may be added to configuration files, in order to distinguish "official" configurations released by Intel and those modified by user, if any. LatencyTOP for Linux generates statistics on both system-wide and per-process basis. With DTrace, it is also possible to collect per-thread statistics for any given process. This could be useful for some multi-threaded applications in which threads have long lifespan and have specialized workloads. Because DTrace can also capture user call stack, it would be useful to add a log feature if a latency meets some condition (e.g. goes above a threshold). It is also possible to count how much latency is generated by every function in user application. This would immediately pinpoint the cause in user application.
_______________________________________________ perf-discuss mailing list perf-discuss@opensolaris.org