Re: [perf-discuss] Project proposal: LatencyTOP

Lejun Zhu Mon, 19 Jan 2009 00:07:11 -0800

Hi all,

Here's the design document for LatencyTop. Comments are welcome. Thanks.
-- 
This message posted from opensolaris.org

                       LatencyTOP High Level Design Document


                                               Zhu, Lejun(lejun....@intel.com)
                                                         Version 0.4, 2009.1.5

1. Overview
1.1 Purpose and Scope

Latency causes applications to run slow even on a high end, fast system. It is
easy to notice when a problem caused by latency happens, but it is usually
difficult to identify what causes the latency.

LatencyTOP (www.latencytop.org) is a tool that is recently developed for Linux
OS to identify outstanding latency problems in a running system. Statistics is
collected from OS scheduling subsystem and collected on both system-wide and
per-process basis. Application developers can look at the statistics and try to
avoid latency problems. A similar tool will also benefit developers writing
applications under Solaris.

LatencyTOP monitors the system for latencies, i.e., processes does nothing but
waits for some condition to happen. LatencyTOP detects latencies that results
in a process sleep or block on some condition. It also detects some known busy
loop in Solaris kernel, e.g. lock spinning. LatencyTOP does not detect busy
loop inside user application. Neither does it detect delay that is not caused
by waiting, e.g. process not running because a higher priority process takes a
lot of CPU time.

DTrace is the advanced tracing mechanism in Solaris for troubleshooting kernel
as well as applications. With DTrace, it is very easy to gather statistics from
Solaris without having to patch the kernel as LatencyTOP for Linux does.

1.2 Design Overview

The goal of this utility is to analyze the cause of latency in a running system
using DTrace probes. In order to do so, every time a process is going into and
out of SLEEP state, the timing is recorded and the kernel call stack is
captured. The utility periodically collects data from DTrace, search call
stacks for symbols that are known to cause latency and update the statistics.

The symbols that are used to scan call stacks are hand picked from OpenSolaris
kernel source code, e.g. "Category: Caused by read - ssize_t read(int fdes,
void *cbuf, size_t count)". If such a symbol is found in call stack, this
latency is categorized as "caused by read". Symbols are stored in a separate
text based configuration file, so that new symbols can be easily added to
analyze different latency problems.

LatencyTOP uses a console based UI to display the statistics. The windowing is
based on curses library.

1.3 Key Challenges

The actual pattern of what's causing latency varies in different systems. It is
very difficult to properly define and categorize all significant causes for
latency once and for all. The design of the tool itself can be decided before
implementation, but a configuration that only picks causes that "really
matters" and organize them in a most clear way would require test after
development work is done, as well as feedback from community.

1.4 Dependencies

LatencyTOP uses DTrace framework to collect data. The implementation depends on
the undocumented interface of libdtrace(3LIB).

2. Software Design
2.1 DTrace usage in LatencyTOP

LatencyTOP uses DTrace probes "sched:::off-cpu" and "sched:::on-cpu" to capture
when a process goes into and out of SLEEP. Although there may be a gap between
a process goes from SLEEP to RUN and the process is actually on CPU after
swtch(), it is necessary to use "on-cpu" in order to capture kernel call stack.
The difference of ts between off-cpu and on-cpu are used to calculate how long
the latency is.

LatencyTOP uses DTrace aggregators to store statistics. Every set of different
pid/stack() will have a separate entry. Values of count(), sum() and max() are
also recorded separately.

LatencyTOP uses probes "lockstat:::adaptive-spin" and "lockstat:::spin-spin" to
detect lock spinning. These two are special categories and stack is not used to
track the aggregators. Instead each process has two special aggregators to
generally track all adaptive lock spinning and spinlock spinning. These values
are usually insignificant latency causes. In case they get higher than
expected, lockstat(1M) can be used to further drill the problem down.
Suggestion of using this tool will be visible on screen when LatencyTOP is
running.

2.2 Configuration

Configuration is LatencyTOP's knowledge of Solaris Kernel code. It is internal
and considered part of LatencyTOP. An example of configuration looks like:

        # UFS
        50      ufs`ufs_sync                    UFS sync
        50      ufs`ufs_fsync                   UFS sync
        50      ufs`ufs_remove                  UFS remove a file
        50      ufs`ufs_create                  UFS create a file

Each line defines one symbol match rule, except for lines starting with "#".
The line has the following format:

    [:blank:]* <priority> [:blank:]+ [<module>]`<function> [:blank:]+
    <Category>


These rules will be used in data analysis procedure, but will not affect what
data is collected from DTrace. For every stack trace in the aggregator
snapshot, each symbol is compared against all symbols ("<module>`<function
name>") known in the configuration. If a symbol is found, the stack is assigned
to the corresponding category.

Rules that contain the same category string are considered in the same category
group, and their statistics will be counted together. In the example above, the
statistics value of "UFS sync" will contain values from both "ufs`ufs_sync" and
"ufs`ufs_fsync".

In case that more than one rules are matched to a stack trace, <priority> is
used to control which one is used, the higher value means higher priority.

Lines starting with "#" are comments and will be ignored during parsing.

";" is used for handling special commands. Only one such command is defined:
"disable_category". When a category is disabled, latency is still matched to
that category, but will be ignored in statistics in LatencyTOP. For example:

; disable_category  FSFlush Daemon

It is also possible to categorize latency causes based on syscalls. An example
of such configuration fragment would look like:

        # Syscalls
        #
        # Syscalls have priority 10, this is the lowest priority defined as
        default.
        # This is to ensure a latency is traced to one of the syscalls if
        nothing
        # else matches.
        #
        10      genunix`indir                   Syscall: indir
        10      genunix`rexit                   Syscall: exit
        10      genunix`forkall                 Syscall: forkall
        10      genunix`read                    Syscall: read
        10      genunix`read32                  Syscall: read
        10      genunix`write                   Syscall: write
        10      genunix`write32                 Syscall: write
        # and more...

Such fragment will be in the configuration to ensure all latency has at least
one cause falls into syscall, if not captured somewhere else.

2.3 Data Analysis and UI

LatencyTOP periodically creates snapshot of DTrace data and walks through it.
After the data is collected, the UI is updated and key presses are processed.
So there is no concurrency problem between updating the data and displaying
it.

LatencyTOP has similar look and feel as LatencyTOP for Linux version 0.4. The
screen layout will look as follows:

==================================Begin=======================================
            LatencyTOP version x.y       (C) 2008 Intel Corporation

Cause                                     Average      Maximum      Percentage
<System-wide Cause 1>                   <Avg> msec   <Max> msec    <Percent> %
<System-wide Cause 1>                   <Avg> msec   <Max> msec    <Percent> %
<System-wide Cause 1>                   <Avg> msec   <Max> msec    <Percent> %


Process <exec name> (<pid>)                      Total:   <total latency> msec
<Per-Process Cause 1>                   <Avg> msec   <Max> msec    <Percent> %
<Per-Process Cause 2>                   <Avg> msec   <Max> msec    <Percent> %


<ProcessName1>  <ProcessName2>  <ProcessName3>  <ProcessName4>  <ProcessName5>
===================================End========================================

The top half of the screen displays system-wide statistics. <Percent> is
calculated within a given period, e.g. 5 seconds, percent =
sum(causeA)/sum(all)*100%. <Avg> means the average latency, e.g., average =
sum(causeA)/count(causeA). <Max> represents the largest period of process being
blocked. For example, if for the same cause, cv_wait() is called several times
within a given period, and in worst case the process is blocked X ms, max = X.
Latencies are sorted in descending order by percentage. Up to 10 causes will be
displayed.

The bottom half of the screen displays per-process statistics. Latencies are
ordered in the same way as system-wide statistics. Up to 6 causes will be
displayed. User can choose a different process to analyze by pressing "<" or
">" and cycle through the process list at the very bottom of the screen.

LatencyTOP collects data periodically and updates the UI. Between data
collection, the data on UI remains the same. The data refresh period equals to
the period which statistics are calculated.

2.4 Log Stack Traces That Are Not Categorized

If none of the symbols defined in the configuration file is found in a stack
trace, the latency is considered "not categorized" in the active configuration
file. A command option is available in LatencyTOP to dump all such causes, one
entry for each stack trace. After LatencyTOP runs for a while, develop/test
engineers can look into OpenSolaris kernel source code for the logged stack and
update the configuration file.

3. Implementation and Testing Consideration

The performance impact needs to be determined for heavy load system when the
DTrace probes are enabled and data polling thread is running.

The initial release should include internal experience. To help define the
initial symbols/stack traces of interest, LatencyTOP needs to be tested with
several Perf PIT workloads before release.

4. Future Improvements

The initial release of LatencyTOP aims to provide similar feature and
look-and-feel as LatencyTOP for Linux version 0.4. However, with the advanced
DTrace infrastructure in Solaris OS, it is possible to add more enhancements in
statistics gathering or analysis.

For initial release, only a limited set of latency causes are defined in the
configuration file. More will be added after internal and external tests. It is
also possible to create multiple sets that group latency causes in different
ways, e.g. categorized based on syscall types versus resource types to suit the
need in different environment.

Some level of protection/check may be added to configuration files, in order to
distinguish "official" configurations released by Intel and those modified by
user, if any.

LatencyTOP for Linux generates statistics on both system-wide and per-process
basis. With DTrace, it is also possible to collect per-thread statistics for
any given process. This could be useful for some multi-threaded applications in
which threads have long lifespan and have specialized workloads.

Because DTrace can also capture user call stack, it would be useful to add a
log feature if a latency meets some condition (e.g. goes above a threshold). It
is also possible to count how much latency is generated by every function in
user application. This would immediately pinpoint the cause in user
application.

_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org

Re: [perf-discuss] Project proposal: LatencyTOP

Reply via email to