Hello hackers, # MOTIVATION
My recent experiences with problematic queries in customers motivated me to write this patch proposing a new feature to enhance visibility on what active queries are doing. PostgreSQL already offers 2 very powerful tools for query troubleshooting: - EXPLAIN: gives us hints on potential bottlenecks in an execution plan. - EXPLAIN ANALYZE: shows precisely where bottlenecks are, but the query must finish. In my humble opinion we are missing something in the middle. Having visibility over in-flight queries would provide more insights than a plain EXPLAIN and would allow us to analyze super problematic queries that never finish a EXPLAIN ANALYZE execution. Considering that every active query has an execution plan, the new feature can target not only controlled EXPLAIN statements but also any query in progress. This allows us to identify if a slow active query is using a different plan and why (for example, custom settings set a session level that are currently only visible to the backend). # PROPOSAL The feature works similarly to the recently introduced pg_log_backend_memory_contexts(). The patch adds function pg_log_backend_explain_plan(PID) to be executed as superuser in a second backend to signal the target backend to print execution plan details in the log. For regular queries (called without instrumentation) PG will log the plain explain along with useful details like custom settings. When targeting a query with instrumentation enabled PG will log the complete EXPLAIN ANALYZE plan with current row count and, if enabled, timing for each node. Considering that the query is in progress the output will include the following per node: - (never executed) for nodes that weren't touched yet (or may never be). - (in progress) for nodes currently being executed, ie, InstrStartNode was called and clock is ticking there. Parallel workers can be targeted too, where PG will log only the relevant part of the complete execution plan. # DEMONSTRATION a) Targeting a not valid PG process: postgres=# select pg_log_backend_explain_plan(1); WARNING: PID 1 is not a PostgreSQL server process pg_log_backend_explain_plan ----------------------------- f (1 row) b) Targeting a PG process not running a query: postgres=# select pg_log_backend_explain_plan(24103); pg_log_backend_explain_plan ----------------------------- t (1 row) 2023-12-02 16:30:19.979 UTC [24103] LOG: PID 24103 not executing a statement with in-flight explain logging enabled c) Targeting an active query without any instrumentation: postgres=# select pg_log_backend_explain_plan(24103); pg_log_backend_explain_plan ----------------------------- t (1 row) 2023-12-02 16:33:10.968 UTC [24103] LOG: logging explain plan of PID 24103 Query Text: select * from t2 a inner join t1 b on a.c1=b.c1 inner join t1 c on a.c1=c.c1 inner join t1 d on a.c1=d.c1 inner join t1 e on a.c1=e.c1; Gather (cost=70894.63..202643.27 rows=1000000 width=20) Workers Planned: 2 -> Parallel Hash Join (cost=69894.63..101643.27 rows=416667 width=20) Hash Cond: (a.c1 = e.c1) -> Parallel Hash Join (cost=54466.62..77218.65 rows=416667 width=16) Hash Cond: (a.c1 = c.c1) -> Parallel Hash Join (cost=15428.00..29997.42 rows=416667 width=8) Hash Cond: (b.c1 = a.c1) -> Parallel Seq Scan on t1 b (cost=0.00..8591.67 rows=416667 width=4) -> Parallel Hash (cost=8591.67..8591.67 rows=416667 width=4) -> Parallel Seq Scan on t2 a (cost=0.00..8591.67 rows=416667 width=4) -> Parallel Hash (cost=32202.28..32202.28 rows=416667 width=8) -> Parallel Hash Join (cost=15428.00..32202.28 rows=416667 width=8) Hash Cond: (c.c1 = d.c1) -> Parallel Seq Scan on t1 c (cost=0.00..8591.67 rows=416667 width=4) -> Parallel Hash (cost=8591.67..8591.67 rows=416667 width=4) -> Parallel Seq Scan on t1 d (cost=0.00..8591.67 rows=416667 width=4) -> Parallel Hash (cost=8591.67..8591.67 rows=416667 width=4) -> Parallel Seq Scan on t1 e (cost=0.00..8591.67 rows=416667 width=4) Settings: max_parallel_workers_per_gather = '4' d) Targeting a parallel query (and its parallel workers) with instrumentation: postgres=# select pid, backend_type,pg_log_backend_explain_plan(pid) postgres=# from pg_stat_activity postgres=# where (backend_type = 'client backend' and pid != pg_backend_pid()) postgres=# or backend_type = 'parallel worker'; pid | backend_type | pg_log_backend_explain_plan -------+-----------------+----------------------------- 24103 | client backend | t 24389 | parallel worker | t 24390 | parallel worker | t (3 rows) 2023-12-02 16:36:34.840 UTC [24103] LOG: logging explain plan of PID 24103 Query Text: explain (analyze, buffers) select * from t2 a inner join t1 b on a.c1=b.c1 inner join t1 c on a.c1=c.c1 inner join t1 d on a.c1=d.c1 inner join t1 e on a.c1=e.c1; Gather (cost=70894.63..202643.27 rows=1000000 width=20) (never executed) Workers Planned: 2 Workers Launched: 2 -> Parallel Hash Join (cost=69894.63..101643.27 rows=416667 width=20) (never executed) Hash Cond: (a.c1 = e.c1) -> Parallel Hash Join (cost=54466.62..77218.65 rows=416667 width=16) (never executed) Hash Cond: (a.c1 = c.c1) -> Parallel Hash Join (cost=15428.00..29997.42 rows=416667 width=8) (never executed) Hash Cond: (b.c1 = a.c1) -> Parallel Seq Scan on t1 b (cost=0.00..8591.67 rows=416667 width=4) (never executed) -> Parallel Hash (cost=8591.67..8591.67 rows=416667 width=4) (never executed) -> Parallel Seq Scan on t2 a (cost=0.00..8591.67 rows=416667 width=4) (never executed) -> Parallel Hash (cost=32202.28..32202.28 rows=416667 width=8) (never executed) -> Parallel Hash Join (cost=15428.00..32202.28 rows=416667 width=8) (never executed) Hash Cond: (c.c1 = d.c1) -> Parallel Seq Scan on t1 c (cost=0.00..8591.67 rows=416667 width=4) (never executed) -> Parallel Hash (cost=8591.67..8591.67 rows=416667 width=4) (never executed) -> Parallel Seq Scan on t1 d (cost=0.00..8591.67 rows=416667 width=4) (actual time=0.023..8.688 rows=107903 loops=1) (in progress) Buffers: shared hit=46 read=432 -> Parallel Hash (cost=8591.67..8591.67 rows=416667 width=4) (actual time=607.171..607.171 rows=341486 loops=1) Buffers: shared hit=717 read=794, temp written=896 -> Parallel Seq Scan on t1 e (cost=0.00..8591.67 rows=416667 width=4) (actual time=0.009..20.413 rows=341486 loops=1) Buffers: shared hit=717 read=794 Settings: max_parallel_workers_per_gather = '4' 2023-12-02 16:36:34.841 UTC [24389] LOG: logging explain plan of PID 24389 Query Text: explain (analyze, buffers) select * from t2 a inner join t1 b on a.c1=b.c1 inner join t1 c on a.c1=c.c1 inner join t1 d on a.c1=d.c1 inner join t1 e on a.c1=e.c1; Parallel Hash Join (cost=69894.63..101643.27 rows=416667 width=20) (never executed) Hash Cond: (a.c1 = e.c1) -> Parallel Hash Join (cost=54466.62..77218.65 rows=416667 width=16) (never executed) Hash Cond: (a.c1 = c.c1) -> Parallel Hash Join (cost=15428.00..29997.42 rows=416667 width=8) (never executed) Hash Cond: (b.c1 = a.c1) -> Parallel Seq Scan on t1 b (cost=0.00..8591.67 rows=416667 width=4) (never executed) -> Parallel Hash (cost=8591.67..8591.67 rows=416667 width=4) (never executed) -> Parallel Seq Scan on t2 a (cost=0.00..8591.67 rows=416667 width=4) (never executed) -> Parallel Hash (cost=32202.28..32202.28 rows=416667 width=8) (never executed) -> Parallel Hash Join (cost=15428.00..32202.28 rows=416667 width=8) (never executed) Hash Cond: (c.c1 = d.c1) -> Parallel Seq Scan on t1 c (cost=0.00..8591.67 rows=416667 width=4) (never executed) -> Parallel Hash (cost=8591.67..8591.67 rows=416667 width=4) (never executed) -> Parallel Seq Scan on t1 d (cost=0.00..8591.67 rows=416667 width=4) (actual time=0.024..7.486 rows=99146 loops=1) (in progress) Buffers: shared hit=43 read=396 -> Parallel Hash (cost=8591.67..8591.67 rows=416667 width=4) (actual time=595.768..595.768 rows=329056 loops=1) Buffers: shared hit=752 read=704, temp written=868 -> Parallel Seq Scan on t1 e (cost=0.00..8591.67 rows=416667 width=4) (actual time=0.003..20.849 rows=329056 loops=1) Buffers: shared hit=752 read=704 Settings: max_parallel_workers_per_gather = '4' 2023-12-02 16:36:34.844 UTC [24390] LOG: logging explain plan of PID 24390 Query Text: explain (analyze, buffers) select * from t2 a inner join t1 b on a.c1=b.c1 inner join t1 c on a.c1=c.c1 inner join t1 d on a.c1=d.c1 inner join t1 e on a.c1=e.c1; Parallel Hash Join (cost=69894.63..101643.27 rows=416667 width=20) (never executed) Hash Cond: (a.c1 = e.c1) -> Parallel Hash Join (cost=54466.62..77218.65 rows=416667 width=16) (never executed) Hash Cond: (a.c1 = c.c1) -> Parallel Hash Join (cost=15428.00..29997.42 rows=416667 width=8) (never executed) Hash Cond: (b.c1 = a.c1) -> Parallel Seq Scan on t1 b (cost=0.00..8591.67 rows=416667 width=4) (never executed) -> Parallel Hash (cost=8591.67..8591.67 rows=416667 width=4) (never executed) -> Parallel Seq Scan on t2 a (cost=0.00..8591.67 rows=416667 width=4) (never executed) -> Parallel Hash (cost=32202.28..32202.28 rows=416667 width=8) (never executed) -> Parallel Hash Join (cost=15428.00..32202.28 rows=416667 width=8) (never executed) Hash Cond: (c.c1 = d.c1) -> Parallel Seq Scan on t1 c (cost=0.00..8591.67 rows=416667 width=4) (never executed) -> Parallel Hash (cost=8591.67..8591.67 rows=416667 width=4) (never executed) -> Parallel Seq Scan on t1 d (cost=0.00..8591.67 rows=416667 width=4) (actual time=0.005..7.186 rows=98901 loops=1) (in progress) Buffers: shared hit=11 read=427 -> Parallel Hash (cost=8591.67..8591.67 rows=416667 width=4) (actual time=594.224..594.224 rows=329458 loops=1) Buffers: shared hit=708 read=750, temp written=864 -> Parallel Seq Scan on t1 e (cost=0.00..8591.67 rows=416667 width=4) (actual time=0.955..21.233 rows=329458 loops=1) Buffers: shared hit=708 read=750 Settings: max_parallel_workers_per_gather = '4' # IMPLEMENTATION DETAILS - Process signaling The whole process signaling implementation is identical to the logic done for pg_log_backend_memory_contexts(). After signaling a process, the ultimate function called to perform the plan logging is ProcessLogExplainPlanInterrupt() in explain.c. - How to track a currently running query? Explain plans are printed via a QueryDesc structure so we need to be able to access that object for the currently running query. For a simple select query, where the QueryDesc is created here ( https://github.com/postgres/postgres/blob/REL_16_STABLE/src/backend/tcop/pquery.c#L495 ) the QueryDesc is accessible via the global ActivePortal pointer as the objects is stored here ( https://github.com/postgres/postgres/blob/REL_16_STABLE/src/backend/tcop/pquery.c#L522 ). The problem is that for EXPLAIN commands the QueryDesc created here ( https://github.com/postgres/postgres/blob/REL_16_STABLE/src/backend/commands/explain.c#L575 ) isn't accessible externally. It exists only in that code context. So my solution was to have a global pointer in explain.c that is either NULL or is pointed to the currently active QueryDesc. At the end of standard_ExecutorStart() in execMain.c I call new function ExplainTrackQuery(QueryDesc) in explain.c that will take care of pointing the global pointer to the QueryDesc instance. This is an important part of the code. The overhead of the implementation is that every query will do the new logic of assigning the global pointer and making sure pointer is always valid (see next section). - How to make sure the new global pointer is always valid? The global pointer starts as NULL, gets assigned via ExplainTrackQuery(QueryDesc) and gets cleared with the help of a MemoryContextCallback. The strategy there is that a MemoryContextCallback will be assigned in the same MemoryContext where the tracked QueryDesc was created. When the MemoryContext is gone (executor is complete) the QueryDesc instance will be destroyed and function QueryDescReleaseFunc() in explain.c will be called to clear the global pointer. With that we can make sure that the pointer always get cleared, even if the query gets cancelled. - Safely printing in-flight execution plans A plan string is built in function ExplainNode here ( https://github.com/postgres/postgres/blob/REL_16_STABLE/src/backend/commands/explain.c#L1178 ) which is called at the end of a query execution when EXPLAIN is used. That function performs logic using a PlanState (part of QueryDesc) of the running query and a ExplainState. The main challenge there is that ExplainNode calls InstrEndLoop which changes values in Instrumentation. This is ok for a regular EXPLAIN where the query is already complete but not ok for the new feature with in-flight explains. So the new code has custom logic to clone Instrumentation instance of the current node. The cloned object can be safely written. Function InstrEndLoop has a safety rule here ( https://github.com/postgres/postgres/blob/REL_16_STABLE/src/backend/executor/instrument.c#L148 ) that prevents adjusting instrumentation details in a running node. This never happens in the current code logic but with the new in-flight explain it will happen very often. I didn't want to remove this safety rule as InstrEndLoop gets called in other places too (even in auto_explain) so the solution was to keep InstrEndLoop and have a new InstrEndLoopForce for the in-flight explain. Both InstrEndLoop and InstrEndLoopForce call a new internal InstrEndLoopInternal to avoid duplicating the code. - Memory management Considering that pg_log_backend_explain_plan() can be called indefinite times in the same query execution, all allocated objects in the new implementation (via palloc) are manually deallocated. This avoids private memory to keep growing until MemoryContext is released. - ExplainState customization A ExplainState is allocated and customized for the in-flight logging. Instrumentation related settings are enabled based on how the target query started, which is usually via EXPLAIN ANALYZE or with auto_explain. es = NewExplainState(); es->in_flight = true; es->analyze = currentQueryDesc->instrument_options; es->buffers = (currentQueryDesc->instrument_options & INSTRUMENT_BUFFERS) != 0; es->wal = (currentQueryDesc->instrument_options & INSTRUMENT_WAL) != 0; es->timing = (currentQueryDesc->instrument_options & INSTRUMENT_TIMER) != 0; There are other settings that I currently selected some static values for testing: es->summary = (es->analyze); es->format = EXPLAIN_FORMAT_TEXT; es->verbose = false; es->settings = true; For those we can think about customizations like global settings or passing through attributes in pg_log_backend_explain_plan(). There is definitely room for improvement here. - Implementation overhead As mentioned earlier, the new feature adds overhead by having to adjust the query desc global pointer in every QueryDesc that passes through standard_ExecutorStart(). If we think this is not a good idea we can think about moving ExplainTrackQuery(QueryDesc) to specific QueryDesc allocations, like having the feature just for EXPLAIN commands. But that would limit what we can inspect. # FINAL CONSIDERATIONS This should be enough for an initial proposal. Apologies for the huge mail. This is my first patch so I am probably missing a lot of standards and good practices the community is already familiar with. Plus, I haven't implemented any tests yet. If you think it is worth considering this new feature I can work on them. Kind Regards, Rafael Castro.
v1-0001-Proposal-for-in-flight-explain-logging.patch
Description: Binary data