On Tue, 13 Apr 2021 at 02:23, Andres Freund <and...@anarazel.de> wrote:
[I've changed the order of the quoted sections a little to prioritize the key stuff] > > On 2021-04-12 14:31:32 +0800, Craig Ringer wrote: > > > It's annoying that we have to pay the cost of computing the tranche name > > though. It never used to matter, but now that T_NAME() expands to > > GetLWTrancheName() calls as of 29c3e2dd5a6 it's going to cost a little more > > on such a hot path. I might see if I can do a little comparison and see how > > much. I could add TRACE_POSTGRESQL_<<tracepointname>>_ENABLED() guards > > since we > > do in fact build with SDT semaphore support. That adds a branch for each > > tracepoint, but they're already marshalling arguments and making a function > > call that does lots more than a single branch, so that seems pretty > > sensible. > > I am against adding any overhead for this feature. I honestly think the > probes we have right now in postgres do not provide a commensurate > benefit. I agree that the probes we have now are nearly useless, if not entirely useless. The transaction management ones are misplaced and utterly worthless. The LWLock ones don't carry enough info to be much use and are incomplete. I doubt anybody uses any of them at all, or would even notice their absence. In terms of overhead, what is in place right now is not free. It used to be very cheap, but since 29c3e2dd5a6 it's not. I'd like to reduce the current cost and improve functionality at the same time, so it's actually useful. > > * There is no easy way to look up the tranche name by ID from outside the > > backend > > But it's near trivial to add that. Really? We can expose a pg_catalog.lwlock_tranches view that lets you observe the current mappings for any given user backend I guess. But if I'm looking for performance issues caused by excessive LWLock contention or waits, LWLocks held too long, LWLock lock-ordering deadlocks, or the like, it's something I want to capture across the whole postgres instance. Each backend can have different tranche IDs (right?) and there's no way to know what a given non-built-in tranche ID means for any given backend without accessing backend-specific in-memory state. Including for non-user-accessible backends like bgworkers and auxprocs, where it's not possible to just query the state from a view directly. So we'd be looking at some kind of shm based monstrosity. That doesn't sound appealing. Worse, there's no way to solve races with it - is a given tranche ID already allocated when you see it? If not, can you look it up from the backend before the backend exits/dies? For that matter, how do you do that, since the connection to the backend is likely under the control of an application, not your monitoring and diagnostic tooling. Some trace tools can poke backend memory directly, but it generally requires debuginfo, is fragile and Pg version specific, slow, and a real pain to use. If we don't attach the LWLock names to the tracepoints in some way they're pretty worthless. Again, I don't plan to add new costs here. I'm actually proposing to reduce an existing cost. And you can always build without `--enable-dtrace` and ... just not care. Anyway - I'll do some `perf` runs shortly to quantify this: * With/without tracepoints at all * With/without names in tracepoints * With/without tracepoint refcounting (_ENABLED() semaphores) so as to rely less on handwaving. > > (Those can also be used with systemtap guru mode scripts to do things > > like turn a particular elog(DEBUG) into a PANIC at runtime for > > diagnostic purposes). > > Yikes. > Well, it's not like it can happen by accident. You have to deliberately write a script that twiddles process memory, using a tool that requires special privileges and I recently had to prepare a custom build for a customer that converted an elog(DEBUG) into an elog(PANIC) in order to capture a core with much better diagnostic info for a complex, hard to reproduce and intermittent memory management issue. It would've been rather nice to be able to do so with a trace marker instead of a custom build. > > There are a TON of probes I want to add, and I have a tree full of them > > waiting to submit progressively. Yes, ability to probe all GUCs is in > > there. So is detail on walsender, reorder buffer, and snapshot builder > > activity. Heavyweight lock SDTs. A probe that identifies the backend type > > at startup. SDT probe events emitted for every wait-event. Probes in elog.c > > to let probes observe error unwinding, capture error messages, > > etc. [...] A probe that fires whenever debug_query_string > > changes. Lots. But I can't submit them all at once, especially without > > some supporting use cases and scripts that other people can use so > > they can understand why these probes are useful. > > -1. This is not scalable. Adding static probes all over has both a > runtime (L1I, branches, code optimization) and maintenance overhead. Take a look at "sudo perf list". sched:sched_kthread_work_execute_end [Tracepoint event] sched:sched_kthread_work_execute_start [Tracepoint event] ... sched:sched_migrate_task [Tracepoint event] ... sched:sched_process_exec [Tracepoint event] ... sched:sched_process_fork [Tracepoint event] ... sched:sched_stat_iowait [Tracepoint event] ... sched:sched_stat_sleep [Tracepoint event] sched:sched_stat_wait [Tracepoint event] ... sched:sched_switch [Tracepoint event] ... sched:sched_wakeup [Tracepoint event] The kernel is packed with extremely useful trace events, and for very good reasons. Some on very hot paths. I do _not_ want to randomly add probes everywhere. I propose that they be added: * Where they will meaningfully aid production diagnosis, complex testing, and/or development activity. Expose high level activity of key subsystems via trace markers especially at the boundaries of IPCs or logic otherwise passes between processes. * Where it's not feasible to instead adjust code structure to make DWARF debuginfo based probing sufficient. * Where there's no other sensible way to get useful information without excessive complexity and/or runtime cost, but it could be very important for understanding intermittent production issues or performance problems at scale in live systems. * Where the execution path is not extremely hot - e.g. no static tracepoints in spinlocks or atomics. * Where a DWARF debuginfo based probe cannot easily replace them, i.e. generally not placed on entry and exit of stable and well-known functions. Re the code structure point above, we have lots of places where we return in multiple places, or where a single function can do many different things with different effects on system state. For example right now it's quite complex to place probes to definitively confirm the outcome of a given transaction and capture its commit record lsn. Functions with many branches that each fiddle with system state, functions that test for the validity of some global and short-circuit return if invalid, etc. Functions that do long loops over big chunks of logic are hard too, e.g. ReorderBufferCommit. I want to place probes where they will greatly simplify observation of important global system state that's not easily observed using traditional tools like gdb or logging. When applied sensibly and moderately, trace markers are absolutely amazing for diagnostic and performance work. You can attach to them in production builds even without debuginfo and observe behaviour that would otherwise be impossible without complex fiddling around with multi-process gdb. This sort of capability is going to become more and more important as we become more parallel and can rely less on single-process gdb-style tracing. Diagnostics using logging is a blunt hammer that does not scale and is rarely viable for intermittent or hard to reproduce production issues. I will always favour "native postgres" solutions where feasible - for example, I want to add some basic reorder buffer state to struct WalSnd and the pg_stat_replication views, and I want to expose some means to get a walsender to report details of its ReorderBuffer state. But some things are not very amenable to that. Either the runtime costs of having the facility available are too high (we're never going to have a pg_catalog.pg_lwlocks for good reasons) or it's too complicated to write and maintain. Especially where info is needed from many processes. That's where trace markers become valuable. But right now what we have in Pg is worthless, and it seems almost nobody knows how to use the tools. I want to change that, but it's a bit of a catch-22. Making tooling easy to use benefits enormously from some more stable interfaces that don't break so much version-to-version, don't require deep code knowledge to understand, and work without debuginfo on production builds. But without some "oh, wow" tools, it's hard to convince anyone we should invest any effort in improving the infrastructure... It's possible I'm beating a dead horse here. I find these tools amazingly useful, but they're currently made 10x harder than they need to be by the complexities of directly poking at postgres's complex and version-specific internal structure using debuginfo based probing. Things that should be simple, like determining the timings of a txn from xid assignment -> 2pc prepare -> 2pc commit prepared .... really aren't. Markers that report xid assignment, commit, rollback, etc, with the associated topxid would help immensely.