Re: [PATCH] Identify LWLocks in tracepoints

Craig Ringer Mon, 12 Apr 2021 19:35:13 -0700

On Tue, 13 Apr 2021 at 02:23, Andres Freund <and...@anarazel.de> wrote:

[I've changed the order of the quoted sections a little to prioritize
the key stuff]

>
> On 2021-04-12 14:31:32 +0800, Craig Ringer wrote:
>
> > It's annoying that we have to pay the cost of computing the tranche name
> > though. It never used to matter, but now that T_NAME() expands to
> > GetLWTrancheName() calls as of 29c3e2dd5a6 it's going to cost a little more
> > on such a hot path. I might see if I can do a little comparison and see how
> > much.  I could add TRACE_POSTGRESQL_<<tracepointname>>_ENABLED() guards 
> > since we
> > do in fact build with SDT semaphore support. That adds a branch for each
> > tracepoint, but they're already marshalling arguments and making a function
> > call that does lots more than a single branch, so that seems pretty
> > sensible.
>
> I am against adding any overhead for this feature. I honestly think the
> probes we have right now in postgres do not provide a commensurate
> benefit.

I agree that the probes we have now are nearly useless, if not
entirely useless. The transaction management ones are misplaced and
utterly worthless. The LWLock ones don't carry enough info to be much
use and are incomplete. I doubt anybody uses any of them at all, or
would even notice their absence.

In terms of overhead, what is in place right now is not free. It used
to be very cheap, but since 29c3e2dd5a6 it's not. I'd like to reduce
the current cost and improve functionality at the same time, so it's
actually useful.

> > * There is no easy way to look up the tranche name by ID from outside the
> > backend
>
> But it's near trivial to add that.

Really?

We can expose a pg_catalog.lwlock_tranches view that lets you observe
the current mappings for any given user backend I guess.

But if I'm looking for performance issues caused by excessive LWLock
contention or waits, LWLocks held too long, LWLock lock-ordering
deadlocks, or the like, it's something I want to capture across the
whole postgres instance. Each backend can have different tranche IDs
(right?) and there's no way to know what a given non-built-in tranche
ID means for any given backend without accessing backend-specific
in-memory state. Including for non-user-accessible backends like
bgworkers and auxprocs, where it's not possible to just query the
state from a view directly.

So we'd be looking at some kind of shm based monstrosity. That doesn't
sound appealing. Worse, there's no way to solve races with it - is a
given tranche ID already allocated when you see it? If not, can you
look it up from the backend before the backend exits/dies? For that
matter, how do you do that, since the connection to the backend is
likely under the control of an application, not your monitoring and
diagnostic tooling.

Some trace tools can poke backend memory directly, but it generally
requires debuginfo, is fragile and Pg version specific, slow, and a
real pain to use. If we don't attach the LWLock names to the
tracepoints in some way they're pretty worthless.

Again, I don't plan to add new costs here. I'm actually proposing to
reduce an existing cost.

And you can always build without `--enable-dtrace` and ... just not care.

Anyway - I'll do some `perf` runs shortly to quantify this:

* With/without tracepoints at all
* With/without names in tracepoints
* With/without tracepoint refcounting (_ENABLED() semaphores)

so as to rely less on handwaving.

> > (Those can also be used with systemtap guru mode scripts to do things
> > like turn a particular elog(DEBUG) into a PANIC at runtime for
> > diagnostic purposes).
>
> Yikes.
>

Well, it's not like it can happen by accident. You have to
deliberately write a script that twiddles process memory, using a tool
that requires special privileges and

I recently had to prepare a custom build for a customer that converted
an elog(DEBUG) into an elog(PANIC) in order to capture a core with
much better diagnostic info for a complex, hard to reproduce and
intermittent memory management issue. It would've been rather nice to
be able to do so with a trace marker instead of a custom build.

> > There are a TON of probes I want to add, and I have a tree full of them
> > waiting to submit progressively. Yes, ability to probe all GUCs is in
> > there. So is detail on walsender, reorder buffer, and snapshot builder
> > activity. Heavyweight lock SDTs. A probe that identifies the backend type
> > at startup. SDT probe events emitted for every wait-event. Probes in elog.c
> > to let probes observe error unwinding, capture error messages,
> > etc. [...] A probe that fires whenever debug_query_string
> > changes. Lots. But I can't submit them all at once, especially without
> > some supporting use cases and scripts that other people can use so
> > they can understand why these probes are useful.
>
> -1. This is not scalable. Adding static probes all over has both a
> runtime (L1I, branches, code optimization) and maintenance overhead.

Take a look at "sudo perf list".

  sched:sched_kthread_work_execute_end               [Tracepoint event]
  sched:sched_kthread_work_execute_start             [Tracepoint event]
  ...
  sched:sched_migrate_task                           [Tracepoint event]
  ...
  sched:sched_process_exec                           [Tracepoint event]
  ...
  sched:sched_process_fork                           [Tracepoint event]
  ...
  sched:sched_stat_iowait                            [Tracepoint event]
  ...
  sched:sched_stat_sleep                             [Tracepoint event]
  sched:sched_stat_wait                              [Tracepoint event]
  ...
  sched:sched_switch                                 [Tracepoint event]
  ...
  sched:sched_wakeup                                 [Tracepoint event]

The kernel is packed with extremely useful trace events, and for very
good reasons. Some on very hot paths.

I do _not_ want to randomly add probes everywhere. I propose that they be added:

* Where they will meaningfully aid production diagnosis, complex
testing, and/or development activity. Expose high level activity of
key subsystems via trace markers especially at the boundaries of IPCs
or logic otherwise passes between processes.
* Where it's not feasible to instead adjust code structure to make
DWARF debuginfo based probing sufficient.
* Where there's no other sensible way to get useful information
without excessive complexity and/or runtime cost, but it could be very
important for understanding intermittent production issues or
performance problems at scale in live systems.
* Where the execution path is not extremely hot - e.g. no static
tracepoints in spinlocks or atomics.
* Where a DWARF debuginfo based probe cannot easily replace them, i.e.
generally not placed on entry and exit of stable and well-known
functions.

Re the code structure point above, we have lots of places where we
return in multiple places, or where a single function can do many
different things with different effects on system state. For example
right now it's quite complex to place probes to definitively confirm
the outcome of a given transaction and capture its commit record lsn.
Functions with many branches that each fiddle with system state,
functions that test for the validity of some global and short-circuit
return if invalid, etc. Functions that do long loops over big chunks
of logic are hard too, e.g. ReorderBufferCommit.

I want to place probes where they will greatly simplify observation of
important global system state that's not easily observed using
traditional tools like gdb or logging.

When applied sensibly and moderately, trace markers are absolutely
amazing for diagnostic and performance work. You can attach to them in
production builds even without debuginfo and observe behaviour that
would otherwise be impossible without complex fiddling around with
multi-process gdb. This sort of capability is going to become more and
more important as we become more parallel and can rely less on
single-process gdb-style tracing. Diagnostics using logging is a blunt
hammer that does not scale and is rarely viable for intermittent or
hard to reproduce production issues.

I will always favour "native postgres" solutions where feasible - for
example, I want to add some basic reorder buffer state to struct
WalSnd and the pg_stat_replication views, and I want to expose some
means to get a walsender to report details of its ReorderBuffer state.

But some things are not very amenable to that. Either the runtime
costs of having the facility available are too high (we're never going
to have a pg_catalog.pg_lwlocks for good reasons) or it's too
complicated to write and maintain. Especially where info is needed
from many processes.

That's where trace markers become valuable. But right now what we have
in Pg is worthless, and it seems almost nobody knows how to use the
tools. I want to change that, but it's a bit of a catch-22. Making
tooling easy to use benefits enormously from some more stable
interfaces that don't break so much version-to-version, don't require
deep code knowledge to understand, and work without debuginfo on
production builds. But without some "oh, wow" tools, it's hard to
convince anyone we should invest any effort in improving the
infrastructure...

It's possible I'm beating a dead horse here. I find these tools
amazingly useful, but they're currently made 10x harder than they need
to be by the complexities of directly poking at postgres's complex and
version-specific internal structure using debuginfo based probing.
Things that should be simple, like determining the timings of a txn
from xid assignment -> 2pc prepare -> 2pc commit prepared .... really
aren't. Markers that report xid assignment, commit, rollback, etc,
with the associated topxid would help immensely.

Re: [PATCH] Identify LWLocks in tracepoints

Reply via email to