Proposal: In-flight explain logging

Rafael Thofehrn Castro Sat, 02 Dec 2023 11:31:47 -0800

Hello hackers,

# MOTIVATION


My recent experiences with problematic queries in customers motivated
me to write this patch proposing a new feature to enhance visibility
on what active queries are doing.

PostgreSQL already offers 2 very powerful tools for query troubleshooting:

- EXPLAIN: gives us hints on potential bottlenecks in an execution plan.

- EXPLAIN ANALYZE: shows precisely where bottlenecks are, but the query
must finish.

In my humble opinion we are missing something in the middle. Having
visibility
over in-flight queries would provide more insights than a plain EXPLAIN
and would allow us to analyze super problematic queries that never finish
a EXPLAIN ANALYZE execution.

Considering that every active query has an execution plan, the new feature
can target not only controlled EXPLAIN statements but also any query in
progress. This allows us to identify if a slow active query is using a
different plan and why (for example, custom settings set a session level
that are currently only visible to the backend).

# PROPOSAL

The feature works similarly to the recently introduced
pg_log_backend_memory_contexts().

The patch adds function pg_log_backend_explain_plan(PID) to be executed as
superuser in a second backend to signal the target backend to print
execution
plan details in the log.

For regular queries (called without instrumentation) PG will log the plain
explain along with useful details like custom settings.

When targeting a query with instrumentation enabled PG will log the complete
EXPLAIN ANALYZE plan with current row count and, if enabled, timing for each
node. Considering that the query is in progress the output will include the
following per node:

- (never executed) for nodes that weren't touched yet (or
 may never be).
- (in progress) for nodes currently being executed, ie,
 InstrStartNode was called and clock is ticking there.

Parallel workers can be targeted too, where PG will log only the relevant
part
of the complete execution plan.

# DEMONSTRATION

a) Targeting a not valid PG process:

postgres=# select pg_log_backend_explain_plan(1);
WARNING:  PID 1 is not a PostgreSQL server process
 pg_log_backend_explain_plan
-----------------------------
 f
(1 row)

b) Targeting a PG process not running a query:

postgres=# select pg_log_backend_explain_plan(24103);
 pg_log_backend_explain_plan
-----------------------------
 t
(1 row)

2023-12-02 16:30:19.979 UTC [24103] LOG:  PID 24103 not executing a
statement with in-flight explain logging enabled

c) Targeting an active query without any instrumentation:

postgres=# select pg_log_backend_explain_plan(24103);
 pg_log_backend_explain_plan
-----------------------------
 t
(1 row)

2023-12-02 16:33:10.968 UTC [24103] LOG:  logging explain plan of PID 24103
Query Text: select *
from t2 a
inner join t1 b on a.c1=b.c1
inner join t1 c on a.c1=c.c1
inner join t1 d on a.c1=d.c1
inner join t1 e on a.c1=e.c1;
Gather  (cost=70894.63..202643.27 rows=1000000 width=20)
 Workers Planned: 2
 ->  Parallel Hash Join  (cost=69894.63..101643.27 rows=416667 width=20)
       Hash Cond: (a.c1 = e.c1)
       ->  Parallel Hash Join  (cost=54466.62..77218.65 rows=416667
width=16)
             Hash Cond: (a.c1 = c.c1)
             ->  Parallel Hash Join  (cost=15428.00..29997.42 rows=416667
width=8)
                   Hash Cond: (b.c1 = a.c1)
                   ->  Parallel Seq Scan on t1 b  (cost=0.00..8591.67
rows=416667 width=4)
                   ->  Parallel Hash  (cost=8591.67..8591.67 rows=416667
width=4)
                         ->  Parallel Seq Scan on t2 a  (cost=0.00..8591.67
rows=416667 width=4)
             ->  Parallel Hash  (cost=32202.28..32202.28 rows=416667
width=8)
                   ->  Parallel Hash Join  (cost=15428.00..32202.28
rows=416667 width=8)
                         Hash Cond: (c.c1 = d.c1)
                         ->  Parallel Seq Scan on t1 c  (cost=0.00..8591.67
rows=416667 width=4)
                         ->  Parallel Hash  (cost=8591.67..8591.67
rows=416667 width=4)
                               ->  Parallel Seq Scan on t1 d
 (cost=0.00..8591.67 rows=416667 width=4)
       ->  Parallel Hash  (cost=8591.67..8591.67 rows=416667 width=4)
             ->  Parallel Seq Scan on t1 e  (cost=0.00..8591.67 rows=416667
width=4)
Settings: max_parallel_workers_per_gather = '4'

d) Targeting a parallel query (and its parallel workers) with
instrumentation:

postgres=# select pid, backend_type,pg_log_backend_explain_plan(pid)
postgres=# from pg_stat_activity
postgres=# where (backend_type = 'client backend' and pid !=
pg_backend_pid())
postgres=#    or backend_type = 'parallel worker';
  pid  |  backend_type   | pg_log_backend_explain_plan
-------+-----------------+-----------------------------
 24103 | client backend  | t
 24389 | parallel worker | t
 24390 | parallel worker | t
(3 rows)

2023-12-02 16:36:34.840 UTC [24103] LOG:  logging explain plan of PID 24103
Query Text: explain (analyze, buffers)
select *
from t2 a
inner join t1 b on a.c1=b.c1
inner join t1 c on a.c1=c.c1
inner join t1 d on a.c1=d.c1
inner join t1 e on a.c1=e.c1;
Gather  (cost=70894.63..202643.27 rows=1000000 width=20) (never executed)
 Workers Planned: 2
 Workers Launched: 2
 ->  Parallel Hash Join  (cost=69894.63..101643.27 rows=416667 width=20)
(never executed)
       Hash Cond: (a.c1 = e.c1)
       ->  Parallel Hash Join  (cost=54466.62..77218.65 rows=416667
width=16) (never executed)
             Hash Cond: (a.c1 = c.c1)
             ->  Parallel Hash Join  (cost=15428.00..29997.42 rows=416667
width=8) (never executed)
                   Hash Cond: (b.c1 = a.c1)
                   ->  Parallel Seq Scan on t1 b  (cost=0.00..8591.67
rows=416667 width=4) (never executed)
                   ->  Parallel Hash  (cost=8591.67..8591.67 rows=416667
width=4) (never executed)
                         ->  Parallel Seq Scan on t2 a  (cost=0.00..8591.67
rows=416667 width=4) (never executed)
             ->  Parallel Hash  (cost=32202.28..32202.28 rows=416667
width=8) (never executed)
                   ->  Parallel Hash Join  (cost=15428.00..32202.28
rows=416667 width=8) (never executed)
                         Hash Cond: (c.c1 = d.c1)
                         ->  Parallel Seq Scan on t1 c  (cost=0.00..8591.67
rows=416667 width=4) (never executed)
                         ->  Parallel Hash  (cost=8591.67..8591.67
rows=416667 width=4) (never executed)
                               ->  Parallel Seq Scan on t1 d
 (cost=0.00..8591.67 rows=416667 width=4) (actual time=0.023..8.688
rows=107903 loops=1) (in progress)
                                     Buffers: shared hit=46 read=432
       ->  Parallel Hash  (cost=8591.67..8591.67 rows=416667 width=4)
(actual time=607.171..607.171 rows=341486 loops=1)
             Buffers: shared hit=717 read=794, temp written=896
             ->  Parallel Seq Scan on t1 e  (cost=0.00..8591.67 rows=416667
width=4) (actual time=0.009..20.413 rows=341486 loops=1)
                   Buffers: shared hit=717 read=794
Settings: max_parallel_workers_per_gather = '4'

2023-12-02 16:36:34.841 UTC [24389] LOG:  logging explain plan of PID 24389
Query Text: explain (analyze, buffers)
select *
from t2 a
inner join t1 b on a.c1=b.c1
inner join t1 c on a.c1=c.c1
inner join t1 d on a.c1=d.c1
inner join t1 e on a.c1=e.c1;
Parallel Hash Join  (cost=69894.63..101643.27 rows=416667 width=20) (never
executed)
 Hash Cond: (a.c1 = e.c1)
 ->  Parallel Hash Join  (cost=54466.62..77218.65 rows=416667 width=16)
(never executed)
       Hash Cond: (a.c1 = c.c1)
       ->  Parallel Hash Join  (cost=15428.00..29997.42 rows=416667
width=8) (never executed)
             Hash Cond: (b.c1 = a.c1)
             ->  Parallel Seq Scan on t1 b  (cost=0.00..8591.67 rows=416667
width=4) (never executed)
             ->  Parallel Hash  (cost=8591.67..8591.67 rows=416667 width=4)
(never executed)
                   ->  Parallel Seq Scan on t2 a  (cost=0.00..8591.67
rows=416667 width=4) (never executed)
       ->  Parallel Hash  (cost=32202.28..32202.28 rows=416667 width=8)
(never executed)
             ->  Parallel Hash Join  (cost=15428.00..32202.28 rows=416667
width=8) (never executed)
                   Hash Cond: (c.c1 = d.c1)
                   ->  Parallel Seq Scan on t1 c  (cost=0.00..8591.67
rows=416667 width=4) (never executed)
                   ->  Parallel Hash  (cost=8591.67..8591.67 rows=416667
width=4) (never executed)
                         ->  Parallel Seq Scan on t1 d  (cost=0.00..8591.67
rows=416667 width=4) (actual time=0.024..7.486 rows=99146 loops=1) (in
progress)
                               Buffers: shared hit=43 read=396
 ->  Parallel Hash  (cost=8591.67..8591.67 rows=416667 width=4) (actual
time=595.768..595.768 rows=329056 loops=1)
       Buffers: shared hit=752 read=704, temp written=868
       ->  Parallel Seq Scan on t1 e  (cost=0.00..8591.67 rows=416667
width=4) (actual time=0.003..20.849 rows=329056 loops=1)
             Buffers: shared hit=752 read=704
Settings: max_parallel_workers_per_gather = '4'

2023-12-02 16:36:34.844 UTC [24390] LOG:  logging explain plan of PID 24390
Query Text: explain (analyze, buffers)
select *
from t2 a
inner join t1 b on a.c1=b.c1
inner join t1 c on a.c1=c.c1
inner join t1 d on a.c1=d.c1
inner join t1 e on a.c1=e.c1;
Parallel Hash Join  (cost=69894.63..101643.27 rows=416667 width=20) (never
executed)
 Hash Cond: (a.c1 = e.c1)
 ->  Parallel Hash Join  (cost=54466.62..77218.65 rows=416667 width=16)
(never executed)
       Hash Cond: (a.c1 = c.c1)
       ->  Parallel Hash Join  (cost=15428.00..29997.42 rows=416667
width=8) (never executed)
             Hash Cond: (b.c1 = a.c1)
             ->  Parallel Seq Scan on t1 b  (cost=0.00..8591.67 rows=416667
width=4) (never executed)
             ->  Parallel Hash  (cost=8591.67..8591.67 rows=416667 width=4)
(never executed)
                   ->  Parallel Seq Scan on t2 a  (cost=0.00..8591.67
rows=416667 width=4) (never executed)
       ->  Parallel Hash  (cost=32202.28..32202.28 rows=416667 width=8)
(never executed)
             ->  Parallel Hash Join  (cost=15428.00..32202.28 rows=416667
width=8) (never executed)
                   Hash Cond: (c.c1 = d.c1)
                   ->  Parallel Seq Scan on t1 c  (cost=0.00..8591.67
rows=416667 width=4) (never executed)
                   ->  Parallel Hash  (cost=8591.67..8591.67 rows=416667
width=4) (never executed)
                         ->  Parallel Seq Scan on t1 d  (cost=0.00..8591.67
rows=416667 width=4) (actual time=0.005..7.186 rows=98901 loops=1) (in
progress)
                               Buffers: shared hit=11 read=427
 ->  Parallel Hash  (cost=8591.67..8591.67 rows=416667 width=4) (actual
time=594.224..594.224 rows=329458 loops=1)
       Buffers: shared hit=708 read=750, temp written=864
       ->  Parallel Seq Scan on t1 e  (cost=0.00..8591.67 rows=416667
width=4) (actual time=0.955..21.233 rows=329458 loops=1)
             Buffers: shared hit=708 read=750
Settings: max_parallel_workers_per_gather = '4'

# IMPLEMENTATION DETAILS

- Process signaling

The whole process signaling implementation is identical to the logic done
for pg_log_backend_memory_contexts(). After signaling a process, the
ultimate
function called to perform the plan logging is
ProcessLogExplainPlanInterrupt()
in explain.c.

- How to track a currently running query?

Explain plans are printed via a QueryDesc structure so we need to be able
to access that object for the currently running query.

For a simple select query, where the QueryDesc is created here (
https://github.com/postgres/postgres/blob/REL_16_STABLE/src/backend/tcop/pquery.c#L495
)
the QueryDesc is accessible via the global ActivePortal pointer as the
objects is stored here (
https://github.com/postgres/postgres/blob/REL_16_STABLE/src/backend/tcop/pquery.c#L522
).

The problem is that for EXPLAIN commands the QueryDesc created here (
https://github.com/postgres/postgres/blob/REL_16_STABLE/src/backend/commands/explain.c#L575
)
isn't accessible externally. It exists only in that code context.

So my solution was to have a global pointer in explain.c that is either
NULL or is pointed to the currently active QueryDesc. At the end of
standard_ExecutorStart() in execMain.c I call new function
ExplainTrackQuery(QueryDesc)
in explain.c that will take care of pointing the global pointer to the
QueryDesc instance.

This is an important part of the code. The overhead of the implementation
is that every query will do the new logic of assigning the global pointer
and making sure pointer is always valid (see next section).

- How to make sure the new global pointer is always valid?

The global pointer starts as NULL, gets assigned via
ExplainTrackQuery(QueryDesc)
and gets cleared with the help of a MemoryContextCallback.

The strategy there is that a MemoryContextCallback will be assigned
in the same MemoryContext where the tracked QueryDesc was created. When
the MemoryContext is gone (executor is complete) the QueryDesc instance
will be destroyed and function QueryDescReleaseFunc() in explain.c will
be called to clear the global pointer. With that we can make sure that
the pointer always get cleared, even if the query gets cancelled.

- Safely printing in-flight execution plans

A plan string is built in function ExplainNode here (
https://github.com/postgres/postgres/blob/REL_16_STABLE/src/backend/commands/explain.c#L1178
)
which is called at the end of a query execution when EXPLAIN is used.
That function performs logic using a PlanState (part of QueryDesc) of
the running query and a ExplainState.

The main challenge there is that ExplainNode calls InstrEndLoop which
changes values in Instrumentation. This is ok for a regular EXPLAIN
where the query is already complete but not ok for the new feature with
in-flight explains.

So the new code has custom logic to clone Instrumentation instance of
the current node. The cloned object can be safely written.

Function InstrEndLoop has a safety rule here (
https://github.com/postgres/postgres/blob/REL_16_STABLE/src/backend/executor/instrument.c#L148
)
that prevents adjusting instrumentation details in a running node. This
never happens in the current code logic but with the new in-flight
explain it will happen very often.

I didn't want to remove this safety rule as InstrEndLoop gets called in
other places too (even in auto_explain) so the solution was to keep
InstrEndLoop and have a new InstrEndLoopForce for the in-flight explain.
Both InstrEndLoop and InstrEndLoopForce call a new internal
InstrEndLoopInternal to avoid duplicating the code.

- Memory management

Considering that pg_log_backend_explain_plan() can be called indefinite
times in the same query execution, all allocated objects in the new
implementation (via palloc) are manually deallocated. This avoids private
memory to keep growing until MemoryContext is released.

- ExplainState customization

A ExplainState is allocated and customized for the in-flight logging.
Instrumentation related settings are enabled based on how the target
query started, which is usually via EXPLAIN ANALYZE or with auto_explain.

es = NewExplainState();
es->in_flight = true;
es->analyze = currentQueryDesc->instrument_options;
es->buffers = (currentQueryDesc->instrument_options &
    INSTRUMENT_BUFFERS) != 0;
es->wal = (currentQueryDesc->instrument_options &
    INSTRUMENT_WAL) != 0;
es->timing = (currentQueryDesc->instrument_options &
    INSTRUMENT_TIMER) != 0;

There are other settings that I currently selected some static values
for testing:

es->summary = (es->analyze);
es->format = EXPLAIN_FORMAT_TEXT;
es->verbose = false;
es->settings = true;

For those we can think about customizations like global settings or
passing through attributes in pg_log_backend_explain_plan(). There
is definitely room for improvement here.

- Implementation overhead

As mentioned earlier, the new feature adds overhead by having
to adjust the query desc global pointer in every QueryDesc that
passes through standard_ExecutorStart(). If we think this is not
a good idea we can think about moving ExplainTrackQuery(QueryDesc)
to specific QueryDesc allocations, like having the feature just
for EXPLAIN commands. But that would limit what we can inspect.

# FINAL CONSIDERATIONS

This should be enough for an initial proposal. Apologies for the huge
mail. This is my first patch so I am probably missing a lot of standards
and good practices the community is already familiar with.

Plus, I haven't implemented any tests yet. If you think it is worth
considering this new feature I can work on them.

Kind Regards,

Rafael Castro.

v1-0001-Proposal-for-in-flight-explain-logging.patch
Description: Binary data

Proposal: In-flight explain logging

Reply via email to