[HACKERS] Dynamic instrumentation of lwlock wait times (lwlock flamegraphs)

Andres Freund Thu, 22 Jun 2017 14:09:29 -0700

Hi,

At pgcon some people were talking about the difficulty of instrumenting
the time actually spent waiting for lwlocks and related measurements.
I'd mentioned that linux these days provides infrastructure to measure
such things in unmodified binaries.


Attached is a prototype of a script that measures the time spent inside
PGSemaphoreLock(), aggregates that inside the kernel, grouped by pid and
stacktrace.  That allows one to generate nice flame graphs showing which
part of the code waits how long for lwlocks.

The attached script clearly needs improvements, but I thought I'd show
some of the results it can get.  To run it you need the the python
library of the 'bcc' project [1], and a sufficiently new kernel.  Some
distributions, e.g. newer debian versions, package this as python-bpfcc
and similar.

The output of the script can be turned into a flamegraph with
https://github.com/brendangregg/FlameGraph 's flamegraph.pl.

Here's a few examples of a pgbench run. The number is the number of
clients, sync/async indicates synchronous_commit on/off.  All the
numbers here were generated with the libpq & pgbench batch patches
applied and in use, but that's just because that was the state of my
working tree.

http://anarazel.de/t/2017-06-22/pgsemwait_8_sync.svg
http://anarazel.de/t/2017-06-22/pgsemwait_8_async.svg
http://anarazel.de/t/2017-06-22/pgsemwait_64_sync.svg
http://anarazel.de/t/2017-06-22/pgsemwait_64_async.svg

A bonus, not that representative one is the wait for a pgbench readonly
run after the above, with autovacuum previously disabled:
http://anarazel.de/t/2017-06-22/pgsemwait_64_select.svg
interesting to see how the backends themselves never end up having to
flush WAL themselves, or at least not in a manner triggering contention.

I plan to write a few more of these, because they're hugely useful to
understand actual locking behaviour. Among them:
- time beteen Acquire/Release of lwlocks, grouped by lwlock
- time beteen Acquire/Release of lwlocks, grouped by stack
- callstack of acquirer and waker of lwlocks, grouped by caller stack, waiter 
stack
- ...

I think it might be interesting to collect a few of these somewhere
centrally once halfway mature.  Maybe in src/tools or such.

Greetings,

Andres Freund

[1] https://github.com/iovisor/bcc

#!/usr/bin/python

from __future__ import print_function
from bcc import BPF
from time import sleep

bpf_text = """
#include <uapi/linux/ptrace.h>
#include <linux/sched.h>

struct stats_key_t {
    u32 pid;
    int user_stack_id;
};

struct stats_value_t {
    u64 total_time;
};

struct start_key_t {
    u32 pid;
};

struct start_value_t {
    u64 last_start;
};

// map_type, key_type, leaf_type, table_name, num_entry
BPF_HASH(stats, struct stats_key_t, struct stats_value_t);
BPF_HASH(start, struct start_key_t, struct start_value_t);

BPF_STACK_TRACE(stack_traces, STACK_STORAGE_SIZE)

int trace_sem_entry(struct pt_regs *ctx)
{
    struct start_key_t start_key = {};
    struct start_value_t start_value = {};

    start_key.pid = bpf_get_current_pid_tgid();
    start_value.last_start = bpf_ktime_get_ns();

    start.update(&start_key, &start_value);

    return 0;
}

int trace_sem_return(struct pt_regs *ctx)
{
    struct stats_key_t stats_key = {};
    struct start_key_t start_key = {};
    struct stats_value_t zero = {};
    struct stats_value_t *stats_value;
    struct start_value_t *start_value;
    u64 delta;

    start_key.pid = bpf_get_current_pid_tgid();
    start_value = start.lookup(&start_key);

    if (!start_value)
        return 0; /* missed start */;

    delta = bpf_ktime_get_ns() - start_value->last_start;
    start.delete(&start_key);

    stats_key.pid = bpf_get_current_pid_tgid();
    stats_key.user_stack_id = stack_traces.get_stackid(ctx, BPF_F_REUSE_STACKID 
| BPF_F_USER_STACK);

    stats_value = stats.lookup_or_init(&stats_key, &zero);
    stats_value->total_time += delta;

    return 0;
}

"""

# set stack storage size
bpf_text = bpf_text.replace('STACK_STORAGE_SIZE', str(1000))

b = BPF(text=bpf_text)


libpath = 
BPF.find_exe('/home/andres/build/postgres/dev-optimize/vpath/src/backend/postgres')
if not libpath:
    bail("can't resolve library %s" % library)
library = libpath

b.attach_uprobe(name=library, sym_re='PGSemaphoreLock',
    fn_name="trace_sem_entry",
    pid = -1)

b.attach_uretprobe(name=library, sym_re='PGSemaphoreLock',
    fn_name="trace_sem_return",
    pid = -1)

matched = b.num_open_uprobes()
if matched == 0:
    print("error: 0 functions traced. Exiting.", file=stderr)
    exit(1)

sleep(3)

stats = b.get_table("stats")
stack_traces = b.get_table("stack_traces")
folded = True #False

for k, v in stats.items(): #, key=lambda v: v.total_time):
    #print(dir(k))
    #print(dir(v))
    user_stack = [] if k.user_stack_id < 0 else \
        stack_traces.walk(k.user_stack_id)
    name = 'postgres'
    if v.total_time == 0:
        continue
    if folded:
        # print folded stack output
        user_stack = list(user_stack)
        line = [name] + \
            [b.sym(addr, k.pid) for addr in reversed(user_stack)]
        print("%s %d" % (";".join(line), v.total_time))
    else:
        # print default multi-line stack output
        for addr in user_stack:
            print("    %s" % b.sym(addr, k.pid))
        print("    %-16s %s (%d)" % ("-", name, k.pid))
        print("        %d\n" % v.total_time)

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Dynamic instrumentation of lwlock wait times (lwlock flamegraphs)

Reply via email to