On Mon, Mar 2, 2026 at 4:06 PM Andres Freund <[email protected]> wrote:

Hi Andres,

> On 2026-03-02 09:01:05 +0100, Jakub Wartak wrote:
> > On Thu, Feb 26, 2026 at 5:13 PM Andres Freund <[email protected]> wrote:
> > > > > > but I think having it in PgStat_BktypeIO is not great. This makes
> > > > > > PgStat_IO 30k*BACKEND_NUM_TYPES bigger, or ~ 0.5MB. Having a stats 
> > > > > > snapshot
> > > > > > be half a megabyte bigger for no reason seems too wasteful.
> > > > >
> > > > > Yea, that's not awesome.
> > > >
> > > > Guys, question, could You please explain me what are the drawbacks of 
> > > > having
> > > > this semi-big (internal-only) stat snapshot of 0.5MB? I'm struggling to
> > > > understand two things:
> > > > a) 0.5MB is not a lot those days (ok my 286 had 1MB in the day ;))
> > >
> > > I don't really agree with that, I guess. And even if I did, it's one 
> > > thing to
> > > use 0.5MB when you actually use it, it's quite another when most of that
> > > memory is never used.
> > >
> > >
> > > With the patch, *every* backend ends up with a substantially larger
> > > pgStatLocal. Before:
> > >
> > > nm -t d --size-sort -r -S src/backend/postgres|head -n20|less
> > > (the second column is the decimal size, third the type of the symbol)
> > >
> > > 0000000004131808 0000000000297456 r yy_transition
> > > ...
> > > 0000000003916352 0000000000054744 r UnicodeDecompMain
> > > 0000000021004896 0000000000052824 B pgStatLocal
> > > 0000000003850592 0000000000040416 r unicode_categories
> > > ...
> > >
> > > after:
> > > 0000000023220512 0000000000329304 B pgStatLocal
> > > 0000000018531648 0000000000297456 r yy_transition
> > > ...
> > >
> > > And because pgStatLocal is zero initialized data, it'll be 
> > > on-demand-allocated
> > > in every single backend (whereas e.g. yy_transition is read-only shared). 
> > >  So
> > > you're not talking a single time increase, you're multiplying it by the 
> > > numer
> > > of active connections
> > >
> > > Now, it's true that most backend won't ever touch pgStatLocal.  However, 
> > > most
> > > backends will touch Pending[Backend]IOStats, which also increased 
> > > noticably:
> > >
> > > before:
> > > 0000000021060960 0000000000002880 b PendingIOStats
> > > 0000000021057792 0000000000002880 b PendingBackendStats
> > >
> > > after:
> > > 0000000023568416 0000000000018240 b PendingIOStats
> > > 0000000023549888 0000000000018240 b PendingBackendStats
> > >
> > >
> > > Again, I think some increase here doesn't have to be fatal, but increasing
> > > with mainly impossible-to-use memory seems just too much waste to mee.
> > >
> > >
> > > This also increases the shared-memory usage of pgstats: Before it used 
> > > ~300kB
> > > on a small system. That nearly doubles with this patch. But that's perhaps
> > > less concerning, given it's per-system, rather than per-backend memory 
> > > usage.
> > >
> > >
> > >
> > > > b) how does it affect anything, because testing show it's not?
> > >
> > > Which of your testing would conceivably show the effect?  The concern here
> > > isn't really performance, it's that it increases our memory usage, which 
> > > you'd
> > > only see having an effect if you are tight on memory or have a workload 
> > > that
> > > is cache sensitive.
> > >
> >
> > Oh ok, now I get understand the problem about pgStatLocal properly,
> > thanks for detailed
> > explanation! (but I'm somewhat I'm still lost a little in the woods of
> > pgstat infra). Anyway, I
> > agree that PgStat_IO started to be way too big especially when the
> > pg_stat_io(_histogram)
> > views wouldn't be really accessed.
> >
> > How about the attached v6-0002? It now dynamically allocates PgStat_IO
> > memory to avoid
> > the memory cost (only allocated if pgstat_io_snapshot_cb() is used).Is
> > that the right path? And
> > if so, perhaps it should allocate it from mxct
> > pgStatLocal.snapshot.context instead?
>
> I think even the per-backend pending IO stats are too big. And for both
> pending stats, stored stats and snapshots, I still don't think I am OK with
> storing so many histograms that are not possible to use.  I think that needs
> to be fixed first.

v7-0001: no changes since quite some time

Memory reduction stuff (I didn't want to squash it, so for now they are
separate)

v7-0002:
   As PendingBackendStats (per individual backend IO stats) was not collecting
   latency buckets at all (but it was sharing the the same struct/typedef),
   I cloned the  struct without those latency buckets. This reduces struct back
   again from 18240, back to 2880 bytes per backend (BSS) as on master.

v7-0003:
   Sadly I couldn't easily make backend-local side recording inside
   PendingIOStats dynamically from within pgstat_count_io_op_time() on first
   use of specific IO traffic type, so that is for each
   [IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES] as any
   MemoryContextAlloc() from there happens to be used as part of critical
   sections and this blows up.

   It's just +15kB per backend, so I hope that is ok when
   we just allocate if we have really desire to use it
   (track_io/wal_io_timings on) --   so nm(1) reports just 2888 (so just
   +8b pointer). The drawback of this is that setting GUCs locally won't be
   effective for histogram collection immediatley,  but only for newly spawned
   backends. This means that I had to switch it to TAP test instead, so
   it can be tested. I don't have strong opinion if that saving +15kB is
   worth it or not for users not running with track_[io/wal_io]_timings.

v7-0004:
   (This was already sent with previous message) With orginal v5 every backend
   had big pgStatLocal (0000000000329304 B pgStatLocal) that was there but not
   used at all if pg_stat_io(_histogram) views wouldn't be really accessed. Now
   it is (0000000000000984 B pgStatLocal) and allocates
   PgStat_Snapshot.PgStat_IO only when quering those views.

So with all 3 above combined we have back:
0000000011573376 0000000000002888 B PendingIOStats
0000000011570304 0000000000002880 b PendingBackendStats
0000000011569184 0000000000000984 B pgStatLocal

That's actual saving over master itself:
0000000011577344 0000000000052824 B pgStatLocal
0000000011633408 0000000000002880 b PendingIOStats
0000000011630304 0000000000002880 b PendingBackendStats

> This also increases the shared-memory usage of pgstats: Before it used ~300kB
> on a small system. That nearly doubles with this patch. But that's perhaps
> less concerning, given it's per-system, rather than per-backend memory usage.

v7-0005:
   Skipping 4 backend types out of of 17 makes it ignoring ~23% of backend
   types and with simple array , I can get this down from ~592384 down to
   ~519424 _total_ memory allocated for'Shared Memory Stats' shm (this one
   was sent earlier).

v7-0006:
   We could reduce total pgstats shm  down to ~482944b if we would eliminate
   tracking of two further IMHO useless types: autovacuum_launcher and
   standalone_backend. Master is @ 315904 (so that's just 163kb more according
   to pg_shm_allocations).

Patches probably need some squash and pgident, etc.

-J.
From 41510e5b8da6e8c84b01249fe227f57927941f9c Mon Sep 17 00:00:00 2001
From: Jakub Wartak <[email protected]>
Date: Fri, 23 Jan 2026 08:10:09 +0100
Subject: [PATCH v7 1/6] Add pg_stat_io_histogram view to provide more detailed
 insight into IO profile

pg_stat_io_histogram displays a histogram of IO latencies for specific
backend_type, object, context and io_type. The histogram has buckets that allow
faster identification of I/O latency outliers due to faulty hardware and/or
misbehaving I/O stack. Such I/O outliers e.g. slow fsyncs could sometimes
cause intermittent issues e.g. for COMMIT or affect the synchronous standbys
performance.

Author: Jakub Wartak <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Reviewed-by: Ants Aasma <[email protected]>
Discussion: https://postgr.es/m/CAKZiRmwvE4uJLKTgPXeBA4m%2Bd4tTghayoefcaM9%3Dz3_S7i72GA%40mail.gmail.com
---
 configure                              |  38 ++++
 configure.ac                           |   1 +
 doc/src/sgml/config.sgml               |  12 +-
 doc/src/sgml/monitoring.sgml           | 293 ++++++++++++++++++++++++-
 doc/src/sgml/wal.sgml                  |   5 +-
 meson.build                            |   1 +
 src/backend/catalog/system_views.sql   |  11 +
 src/backend/utils/activity/pgstat_io.c |  63 ++++++
 src/backend/utils/adt/pgstatfuncs.c    | 145 ++++++++++++
 src/include/catalog/pg_proc.dat        |   9 +
 src/include/pgstat.h                   |  14 ++
 src/include/port/pg_bitutils.h         |  31 ++-
 src/test/regress/expected/rules.out    |   8 +
 src/test/regress/expected/stats.out    |  23 ++
 src/test/regress/sql/stats.sql         |  15 ++
 src/tools/pgindent/typedefs.list       |   1 +
 16 files changed, 662 insertions(+), 8 deletions(-)

diff --git a/configure b/configure
index 4aaaf92ba0a..a78ca8b99d9 100755
--- a/configure
+++ b/configure
@@ -15931,6 +15931,44 @@ cat >>confdefs.h <<_ACEOF
 #define HAVE__BUILTIN_CLZ 1
 _ACEOF
 
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_clzl" >&5
+$as_echo_n "checking for __builtin_clzl... " >&6; }
+if ${pgac_cv__builtin_clzl+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+int
+call__builtin_clzl(unsigned long x)
+{
+    return __builtin_clzl(x);
+}
+int
+main ()
+{
+
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv__builtin_clzl=yes
+else
+  pgac_cv__builtin_clzl=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__builtin_clzl" >&5
+$as_echo "$pgac_cv__builtin_clzl" >&6; }
+if test x"${pgac_cv__builtin_clzl}" = xyes ; then
+
+cat >>confdefs.h <<_ACEOF
+#define HAVE__BUILTIN_CLZL 1
+_ACEOF
+
 fi
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_ctz" >&5
 $as_echo_n "checking for __builtin_ctz... " >&6; }
diff --git a/configure.ac b/configure.ac
index 9bc457bac87..fdde65205e2 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1868,6 +1868,7 @@ PGAC_CHECK_BUILTIN_FUNC([__builtin_bswap32], [int x])
 PGAC_CHECK_BUILTIN_FUNC([__builtin_bswap64], [long int x])
 # We assume that we needn't test all widths of these explicitly:
 PGAC_CHECK_BUILTIN_FUNC([__builtin_clz], [unsigned int x])
+PGAC_CHECK_BUILTIN_FUNC([__builtin_clzl], [unsigned long x])
 PGAC_CHECK_BUILTIN_FUNC([__builtin_ctz], [unsigned int x])
 # __builtin_frame_address may draw a diagnostic for non-constant argument,
 # so it needs a different test function.
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8cdd826fbd3..c06c0874fce 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8840,9 +8840,11 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         displayed in <link linkend="monitoring-pg-stat-database-view">
         <structname>pg_stat_database</structname></link>,
         <link linkend="monitoring-pg-stat-io-view">
-        <structname>pg_stat_io</structname></link> (if <varname>object</varname>
-        is not <literal>wal</literal>), in the output of the
-        <link linkend="pg-stat-get-backend-io">
+        <structname>pg_stat_io</structname></link> and
+        <link linkend="monitoring-pg-stat-io-histogram-view">
+        <structname>pg_stat_io_histogram</structname></link>
+        (if <varname>object</varname> is not <literal>wal</literal>),
+        in the output of the <link linkend="pg-stat-get-backend-io">
         <function>pg_stat_get_backend_io()</function></link> function (if
         <varname>object</varname> is not <literal>wal</literal>), in the
         output of <xref linkend="sql-explain"/> when the <literal>BUFFERS</literal>
@@ -8872,7 +8874,9 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         measure the overhead of timing on your system.
         I/O timing information is displayed in
         <link linkend="monitoring-pg-stat-io-view">
-        <structname>pg_stat_io</structname></link> for the
+        <structname>pg_stat_io</structname></link> and
+        <link linkend="monitoring-pg-stat-io-histogram-view">
+        <structname>pg_stat_io_histogram</structname></link> for the
         <varname>object</varname> <literal>wal</literal> and in the output of
         the <link linkend="pg-stat-get-backend-io">
         <function>pg_stat_get_backend_io()</function></link> function for the
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index b3d53550688..4e2d8251c08 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -501,6 +501,17 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
      </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_io_histogram</structname><indexterm><primary>pg_stat_io_histogram</primary></indexterm></entry>
+      <entry>
+       One row for each combination of backend type, context, target object,
+       IO operation type and latency bucket (in microseconds) containing
+       cluster-wide I/O statistics.
+       See <link linkend="monitoring-pg-stat-io-histogram-view">
+       <structname>pg_stat_io_histogram</structname></link> for details.
+     </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_replication_slots</structname><indexterm><primary>pg_stat_replication_slots</primary></indexterm></entry>
       <entry>One row per replication slot, showing statistics about the
@@ -698,7 +709,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
   <para>
    The <structname>pg_stat_io</structname> and
-   <structname>pg_statio_</structname> set of views are useful for determining
+   <structname>pg_stat_io_histogram</structname> set of views are useful for determining
    the effectiveness of the buffer cache. They can be used to calculate a cache
    hit ratio. Note that while <productname>PostgreSQL</productname>'s I/O
    statistics capture most instances in which the kernel was invoked in order
@@ -707,6 +718,8 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    Users are advised to use the <productname>PostgreSQL</productname>
    statistics views in combination with operating system utilities for a more
    complete picture of their database's I/O performance.
+   Furthermore the <structname>pg_stat_io_histogram</structname> view can be helpful
+   identifying latency outliers for specific I/O operations.
   </para>
 
  </sect2>
@@ -3275,6 +3288,284 @@ description | Waiting for a newly initialized WAL file to reach durable storage
 
  </sect2>
 
+ <sect2 id="monitoring-pg-stat-io-histogram-view">
+  <title><structname>pg_stat_io_histogram</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_io_histogram</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_stat_io_histogram</structname> view will contain one row for each
+   combination of backend type, target I/O object, and I/O context, IO operation
+   type, bucket latency cluster-wide I/O statistics. Combinations which do not make sense
+   are omitted.
+  </para>
+
+  <para>
+   The view shows measured perceived I/O latency by the backend, not the kernel or device
+   one. This is important distinction when troubleshooting, as the I/O latency observed by
+   the backend might get affected by:
+   <itemizedlist>
+     <listitem>
+        <para>OS scheduler decisions and available CPU resources.</para>
+        <para>With AIO, it might include time to service other IOs from the queue. That will often inflate IO latency.</para>
+        <para>In case of writing, additional filesystem journaling operations.</para>
+     </listitem>
+  </itemizedlist>
+  </para>
+
+  <para>
+   Currently, I/O on relations (e.g. tables, indexes) and WAL activity are
+   tracked. However, relation I/O which bypasses shared buffers
+   (e.g. when moving a table from one tablespace to another) is currently
+   not tracked.
+  </para>
+
+  <table id="pg-stat-io-histogram-view" xreflabel="pg_stat_io_histogram">
+   <title><structname>pg_stat_io_histogram</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        Column Type
+       </para>
+       <para>
+        Description
+       </para>
+      </entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>backend_type</structfield> <type>text</type>
+       </para>
+       <para>
+        Type of backend (e.g. background worker, autovacuum worker). See <link
+        linkend="monitoring-pg-stat-activity-view">
+        <structname>pg_stat_activity</structname></link> for more information
+        on <varname>backend_type</varname>s. Some
+        <varname>backend_type</varname>s do not accumulate I/O operation
+        statistics and will not be included in the view.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>object</structfield> <type>text</type>
+       </para>
+       <para>
+        Target object of an I/O operation. Possible values are:
+       <itemizedlist>
+        <listitem>
+         <para>
+          <literal>relation</literal>: Permanent relations.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>temp relation</literal>: Temporary relations.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>wal</literal>: Write Ahead Logs.
+         </para>
+        </listitem>
+       </itemizedlist>
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>context</structfield> <type>text</type>
+       </para>
+       <para>
+        The context of an I/O operation. Possible values are:
+       </para>
+       <itemizedlist>
+        <listitem>
+         <para>
+          <literal>normal</literal>: The default or standard
+          <varname>context</varname> for a type of I/O operation. For
+          example, by default, relation data is read into and written out from
+          shared buffers. Thus, reads and writes of relation data to and from
+          shared buffers are tracked in <varname>context</varname>
+          <literal>normal</literal>.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>init</literal>: I/O operations performed while creating the
+          WAL segments are tracked in <varname>context</varname>
+          <literal>init</literal>.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>vacuum</literal>: I/O operations performed outside of shared
+          buffers while vacuuming and analyzing permanent relations. Temporary
+          table vacuums use the same local buffer pool as other temporary table
+          I/O operations and are tracked in <varname>context</varname>
+          <literal>normal</literal>.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>bulkread</literal>: Certain large read I/O operations
+          done outside of shared buffers, for example, a sequential scan of a
+          large table.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>bulkwrite</literal>: Certain large write I/O operations
+          done outside of shared buffers, such as <command>COPY</command>.
+         </para>
+        </listitem>
+       </itemizedlist>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>io_type</structfield> <type>text</type>
+       </para>
+       <para>
+        The type of I/O operation. Possible values are:
+       </para>
+       <itemizedlist>
+        <listitem>
+         <para>
+          <literal>evict</literal>: eviction from shared buffers cache.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>fsync</literal>: synchronization of modified kernel's
+          filesystem page cache with storage device.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>hit</literal>: shared buffers cache lookup hit.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>reuse</literal>: reuse of existing buffer in case of
+          reusing limited-space ring buffer (applies to <literal>bulkread</literal>,
+          <literal>bulkwrite</literal>, or <literal>vacuum</literal> contexts).
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>writeback</literal>: advise kernel that the described dirty
+          data should be flushed to disk preferably asynchronously.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>extend</literal>: add new zeroed blocks to the end of file.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>read</literal>: self explanatory.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>write</literal>: self explanatory.
+         </para>
+        </listitem>
+       </itemizedlist>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>bucket_latency_us</structfield> <type>int4range</type>
+       </para>
+       <para>
+        The latency bucket (in microseconds).
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>bucket_count</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Number of times latency of the I/O operation hit this specific bucket (with
+        up to <varname>bucket_latency_us</varname> microseconds).
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
+       </para>
+       <para>
+        Time at which these statistics were last reset.
+       </para>
+      </entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <para>
+   Some backend types never perform I/O operations on some I/O objects and/or
+   in some I/O contexts. These rows might display zero bucket counts for such
+   specific operations.
+  </para>
+
+  <para>
+   <structname>pg_stat_io_histogram</structname> can be used to identify
+   I/O storage issues
+   For example:
+   <itemizedlist>
+    <listitem>
+     <para>
+      Presence of abnormally high latency for <varname>fsyncs</varname> might
+      indicate I/O saturation, oversubscription or hardware connectivity issues.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      Unusually high latency for <varname>fsyncs</varname> on standby's startup
+      backend type, might be responsible for high duration of commits in
+      synchronous replication setups.
+     </para>
+    </listitem>
+   </itemizedlist>
+  </para>
+
+  <note>
+   <para>
+    Columns tracking I/O wait time will only be non-zero when
+    <xref linkend="guc-track-io-timing"/> is enabled. The user should be
+    careful when referencing these columns in combination with their
+    corresponding I/O operations in case <varname>track_io_timing</varname>
+    was not enabled for the entire time since the last stats reset.
+   </para>
+  </note>
+ </sect2>
+
  <sect2 id="monitoring-pg-stat-bgwriter-view">
   <title><structname>pg_stat_bgwriter</structname></title>
 
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index f3b86b26be9..8b8c407e69f 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -832,8 +832,9 @@
    of times <function>XLogWrite</function> writes and
    <function>issue_xlog_fsync</function> syncs WAL data to disk are also
    counted as <varname>writes</varname> and <varname>fsyncs</varname>
-   in <structname>pg_stat_io</structname> for the <varname>object</varname>
-   <literal>wal</literal>, respectively.
+   in <structname>pg_stat_io</structname> and
+   <structname>pg_stat_io_histogram</structname> for the
+   <varname>object</varname> <literal>wal</literal>, respectively.
   </para>
 
   <para>
diff --git a/meson.build b/meson.build
index 2df54409ca6..00575624688 100644
--- a/meson.build
+++ b/meson.build
@@ -2045,6 +2045,7 @@ builtins = [
   'bswap32',
   'bswap64',
   'clz',
+  'clzl',
   'ctz',
   'constant_p',
   'frame_address',
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 2eda7d80d02..55c3ec4eaec 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1247,6 +1247,17 @@ SELECT
        b.stats_reset
 FROM pg_stat_get_io() b;
 
+CREATE VIEW pg_stat_io_histogram AS
+SELECT
+       b.backend_type,
+       b.object,
+       b.context,
+       b.io_type,
+       b.bucket_latency_us,
+       b.bucket_count,
+       b.stats_reset
+FROM pg_stat_get_io_histogram() b;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 28de24538dc..148a2a9c7d5 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -17,6 +17,7 @@
 #include "postgres.h"
 
 #include "executor/instrument.h"
+#include "port/pg_bitutils.h"
 #include "storage/bufmgr.h"
 #include "utils/pgstat_internal.h"
 
@@ -107,6 +108,32 @@ pgstat_prepare_io_time(bool track_io_guc)
 	return io_start;
 }
 
+#define MIN_PG_STAT_IO_HIST_LATENCY 8191
+static inline int get_bucket_index(uint64_t ns) {
+	const uint32_t max_index = PGSTAT_IO_HIST_BUCKETS - 1;
+	/*
+	 * hopefully pre-calculated by the compiler:
+	 * clzl(8191) = clz(01111111111111b on uint64)
+	 */
+	const uint32_t min_latency_leading_zeros =
+		pg_leading_zero_bits64(MIN_PG_STAT_IO_HIST_LATENCY);
+
+	/*
+	 * make sure the tmp value has at least 8191 (our minimum bucket size)
+	 * as __builtin_clzl might return undefined behavior when operating on 0
+	 */
+	uint64_t tmp = ns | MIN_PG_STAT_IO_HIST_LATENCY;
+
+	/* count leading zeros */
+	int leading_zeros = pg_leading_zero_bits64(tmp);
+
+	/* normalize the index */
+	uint32_t index = min_latency_leading_zeros - leading_zeros;
+
+	/* clamp it to the maximum */
+	return (index > max_index) ? max_index : index;
+}
+
 /*
  * Like pgstat_count_io_op() except it also accumulates time.
  *
@@ -125,6 +152,7 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 	if (!INSTR_TIME_IS_ZERO(start_time))
 	{
 		instr_time	io_time;
+		int bucket_index;
 
 		INSTR_TIME_SET_CURRENT(io_time);
 		INSTR_TIME_SUBTRACT(io_time, start_time);
@@ -152,6 +180,10 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 		INSTR_TIME_ADD(PendingIOStats.pending_times[io_object][io_context][io_op],
 					   io_time);
 
+		/* calculate the bucket_index based on latency in nanoseconds (uint64) */
+		bucket_index = get_bucket_index(INSTR_TIME_GET_NANOSEC(io_time));
+		PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][bucket_index]++;
+
 		/* Add the per-backend count */
 		pgstat_count_backend_io_op_time(io_object, io_context, io_op,
 										io_time);
@@ -221,6 +253,10 @@ pgstat_io_flush_cb(bool nowait)
 
 				bktype_shstats->times[io_object][io_context][io_op] +=
 					INSTR_TIME_GET_MICROSEC(time);
+
+				for(int b = 0; b < PGSTAT_IO_HIST_BUCKETS; b++)
+					bktype_shstats->hist_time_buckets[io_object][io_context][io_op][b] +=
+						PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][b];
 			}
 		}
 	}
@@ -274,6 +310,33 @@ pgstat_get_io_object_name(IOObject io_object)
 	pg_unreachable();
 }
 
+const char *
+pgstat_get_io_op_name(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			return "evict";
+		case IOOP_FSYNC:
+			return "fsync";
+		case IOOP_HIT:
+			return "hit";
+		case IOOP_REUSE:
+			return "reuse";
+		case IOOP_WRITEBACK:
+			return "writeback";
+		case IOOP_EXTEND:
+			return "extend";
+		case IOOP_READ:
+			return "read";
+		case IOOP_WRITE:
+			return "write";
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+	pg_unreachable();
+}
+
 void
 pgstat_io_init_shmem_cb(void *stats)
 {
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index b1df96e7b0b..ac08ab14195 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -18,6 +18,7 @@
 #include "access/xlog.h"
 #include "access/xlogprefetcher.h"
 #include "catalog/catalog.h"
+#include "catalog/namespace.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_type.h"
 #include "common/ip.h"
@@ -30,6 +31,7 @@
 #include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
+#include "utils/rangetypes.h"
 #include "utils/timestamp.h"
 
 #define UINT32_ACCESS_ONCE(var)		 ((uint32)(*((volatile uint32 *)&(var))))
@@ -1639,6 +1641,149 @@ pg_stat_get_backend_io(PG_FUNCTION_ARGS)
 	return (Datum) 0;
 }
 
+/*
+* When adding a new column to the pg_stat_io_histogram view and the
+* pg_stat_get_io_histogram() function, add a new enum value here above
+* HIST_IO_NUM_COLUMNS.
+*/
+typedef enum hist_io_stat_col
+{
+	HIST_IO_COL_INVALID = -1,
+	HIST_IO_COL_BACKEND_TYPE,
+	HIST_IO_COL_OBJECT,
+	HIST_IO_COL_CONTEXT,
+	HIST_IO_COL_IOTYPE,
+	HIST_IO_COL_BUCKET_US,
+	HIST_IO_COL_COUNT,
+	HIST_IO_COL_RESET_TIME,
+	HIST_IO_NUM_COLUMNS
+} histogram_io_stat_col;
+
+/*
+ * pg_stat_io_histogram_build_tuples
+ *
+ * Helper routine for pg_stat_get_io_histogram() and pg_stat_get_backend_io()
+ * filling a result tuplestore with one tuple for each object and each
+ * context supported by the caller, based on the contents of bktype_stats.
+ */
+static void
+pg_stat_io_histogram_build_tuples(ReturnSetInfo *rsinfo,
+						PgStat_BktypeIO *bktype_stats,
+						BackendType bktype,
+						TimestampTz stat_reset_timestamp)
+{
+	/* Get OID for int4range type */
+	Datum		bktype_desc = CStringGetTextDatum(GetBackendTypeDesc(bktype));
+	Oid			range_typid = TypenameGetTypid("int4range");
+	TypeCacheEntry *typcache = lookup_type_cache(range_typid, TYPECACHE_RANGE_INFO);
+
+	for (int io_obj = 0; io_obj < IOOBJECT_NUM_TYPES; io_obj++)
+	{
+		const char *obj_name = pgstat_get_io_object_name(io_obj);
+
+		for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+		{
+			const char *context_name = pgstat_get_io_context_name(io_context);
+
+			/*
+			 * Some combinations of BackendType, IOObject, and IOContext are
+			 * not valid for any type of IOOp. In such cases, omit the entire
+			 * row from the view.
+			 */
+			if (!pgstat_tracks_io_object(bktype, io_obj, io_context))
+				continue;
+
+			for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+			{
+				const char *op_name = pgstat_get_io_op_name(io_op);
+
+				for(int bucket = 0; bucket < PGSTAT_IO_HIST_BUCKETS; bucket++) {
+					Datum		values[HIST_IO_NUM_COLUMNS] = {0};
+					bool		nulls[HIST_IO_NUM_COLUMNS] = {0};
+					RangeBound	lower, upper;
+					RangeType	*range;
+
+					values[HIST_IO_COL_BACKEND_TYPE] = bktype_desc;
+					values[HIST_IO_COL_OBJECT] = CStringGetTextDatum(obj_name);
+					values[HIST_IO_COL_CONTEXT] = CStringGetTextDatum(context_name);
+					values[HIST_IO_COL_IOTYPE] = CStringGetTextDatum(op_name);
+
+					/* bucket's maximum latency as range in microseconds */
+					if(bucket == 0)
+						lower.val = Int32GetDatum(0);
+					else
+						lower.val = Int32GetDatum(1 << (2 + bucket));
+					lower.infinite = false;
+					lower.inclusive = true;
+					lower.lower = true;
+
+					if(bucket == PGSTAT_IO_HIST_BUCKETS - 1)
+						upper.infinite = true;
+					else {
+						upper.val = Int32GetDatum(1 << (2 + bucket + 1));
+						upper.infinite = false;
+					}
+					upper.inclusive = true;
+					upper.lower = false;
+
+					range = make_range(typcache, &lower, &upper, false, NULL);
+					values[HIST_IO_COL_BUCKET_US] = RangeTypePGetDatum(range);
+
+					/* bucket count */
+					values[HIST_IO_COL_COUNT] = Int64GetDatum(
+						bktype_stats->hist_time_buckets[io_obj][io_context][io_op][bucket]);
+
+					if (stat_reset_timestamp != 0)
+						values[HIST_IO_COL_RESET_TIME] = TimestampTzGetDatum(stat_reset_timestamp);
+					else
+						nulls[HIST_IO_COL_RESET_TIME] = true;
+
+					tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+								 values, nulls);
+				}
+			}
+		}
+	}
+}
+
+Datum
+pg_stat_get_io_histogram(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo;
+	PgStat_IO  *backends_io_stats;
+
+	InitMaterializedSRF(fcinfo, 0);
+	rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	backends_io_stats = pgstat_fetch_stat_io();
+
+	for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
+	{
+		PgStat_BktypeIO *bktype_stats = &backends_io_stats->stats[bktype];
+
+		/*
+		 * In Assert builds, we can afford an extra loop through all of the
+		 * counters (in pg_stat_io_build_tuples()), checking that only
+		 * expected stats are non-zero, since it keeps the non-Assert code
+		 * cleaner.
+		 */
+		Assert(pgstat_bktype_io_stats_valid(bktype_stats, bktype));
+
+		/*
+		 * For those BackendTypes without IO Operation stats, skip
+		 * representing them in the view altogether.
+		 */
+		if (!pgstat_tracks_io_bktype(bktype))
+			continue;
+
+		/* save tuples with data from this PgStat_BktypeIO */
+		pg_stat_io_histogram_build_tuples(rsinfo, bktype_stats, bktype,
+								backends_io_stats->stat_reset_timestamp);
+	}
+
+	return (Datum) 0;
+}
+
 /*
  * pg_stat_wal_build_tuple
  *
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 361e2cfffeb..3ba04f9e11f 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6038,6 +6038,15 @@
   proargnames => '{backend_type,object,context,reads,read_bytes,read_time,writes,write_bytes,write_time,writebacks,writeback_time,extends,extend_bytes,extend_time,hits,evictions,reuses,fsyncs,fsync_time,stats_reset}',
   prosrc => 'pg_stat_get_io' },
 
+{ oid => '6149', descr => 'statistics: per backend type IO latency histogram',
+  proname => 'pg_stat_get_io_histogram', prorows => '30', proretset => 't',
+  provolatile => 'v', proparallel => 'r', prorettype => 'record',
+  proargtypes => '',
+  proallargtypes => '{text,text,text,text,int4range,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,object,context,io_type,bucket_latency_us,bucket_count,stats_reset}',
+  prosrc => 'pg_stat_get_io_histogram' },
+
 { oid => '6386', descr => 'statistics: backend IO statistics',
   proname => 'pg_stat_get_backend_io', prorows => '5', proretset => 't',
   provolatile => 'v', proparallel => 'r', prorettype => 'record',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 0e9d2b4c623..8e06f3e05d2 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -326,11 +326,23 @@ typedef enum IOOp
 	(((unsigned int) (io_op)) < IOOP_NUM_TYPES && \
 	 ((unsigned int) (io_op)) >= IOOP_EXTEND)
 
+/*
+ * This should represent balance between being fast and providing value
+ * to the users:
+ * 1. We want to cover various fast and slow device types (0.01ms - 15ms)
+ * 2. We want to also cover sporadic long tail latencies (hardware issues,
+ *    delayed fsyncs, stuck I/O)
+ * 3. We want to be as small as possible here in terms of size:
+ *    16 * sizeof(uint64) = which should be less than two cachelines.
+ */
+#define PGSTAT_IO_HIST_BUCKETS 16
+
 typedef struct PgStat_BktypeIO
 {
 	uint64		bytes[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
 	PgStat_Counter counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
 	PgStat_Counter times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+	uint64		hist_time_buckets[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES][PGSTAT_IO_HIST_BUCKETS];
 } PgStat_BktypeIO;
 
 typedef struct PgStat_PendingIO
@@ -338,6 +350,7 @@ typedef struct PgStat_PendingIO
 	uint64		bytes[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
 	PgStat_Counter counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
 	instr_time	pending_times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+	uint64		pending_hist_time_buckets[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES][PGSTAT_IO_HIST_BUCKETS];
 } PgStat_PendingIO;
 
 typedef struct PgStat_IO
@@ -610,6 +623,7 @@ extern void pgstat_count_io_op_time(IOObject io_object, IOContext io_context,
 extern PgStat_IO *pgstat_fetch_stat_io(void);
 extern const char *pgstat_get_io_context_name(IOContext io_context);
 extern const char *pgstat_get_io_object_name(IOObject io_object);
+extern const char *pgstat_get_io_op_name(IOOp io_op);
 
 extern bool pgstat_tracks_io_bktype(BackendType bktype);
 extern bool pgstat_tracks_io_object(BackendType bktype,
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 0bca559caaa..4e94257682b 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -32,6 +32,35 @@ extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_number_of_ones[256];
 
+
+/*
+ * pg_leading_zero_bits64
+ *		Returns the number of leading 0-bits in x, starting at the most significant bit position.
+ *		Word must not be 0 (as it is undefined behavior).
+ */
+static inline int
+pg_leading_zero_bits64(uint64 word)
+{
+#ifdef HAVE__BUILTIN_CLZL
+	Assert(word != 0);
+
+	return __builtin_clzl(word);
+#else
+	int n = 64;
+	uint64 y;
+	if (word == 0)
+		return 64;
+
+	y = word >> 32; if (y != 0) { n -= 32; word = y; }
+	y = word >> 16; if (y != 0) { n -= 16; word = y; }
+	y = word >> 8;  if (y != 0) { n -= 8;  word = y; }
+	y = word >> 4;  if (y != 0) { n -= 4;  word = y; }
+	y = word >> 2;  if (y != 0) { n -= 2;  word = y; }
+	y = word >> 1;  if (y != 0) { return n - 2; }
+	return n - 1;
+#endif
+}
+
 /*
  * pg_leftmost_one_pos32
  *		Returns the position of the most significant set bit in "word",
@@ -71,7 +100,7 @@ pg_leftmost_one_pos32(uint32 word)
 static inline int
 pg_leftmost_one_pos64(uint64 word)
 {
-#ifdef HAVE__BUILTIN_CLZ
+#ifdef HAVE__BUILTIN_CLZL
 	Assert(word != 0);
 
 #if SIZEOF_LONG == 8
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index deb6e2ad6a9..e3be836a461 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1951,6 +1951,14 @@ pg_stat_io| SELECT backend_type,
     fsync_time,
     stats_reset
    FROM pg_stat_get_io() b(backend_type, object, context, reads, read_bytes, read_time, writes, write_bytes, write_time, writebacks, writeback_time, extends, extend_bytes, extend_time, hits, evictions, reuses, fsyncs, fsync_time, stats_reset);
+pg_stat_io_histogram| SELECT backend_type,
+    object,
+    context,
+    io_type,
+    bucket_latency_us,
+    bucket_count,
+    stats_reset
+   FROM pg_stat_get_io_histogram() b(backend_type, object, context, io_type, bucket_latency_us, bucket_count, stats_reset);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index cd00f35bf7a..4c95f09d651 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -1765,6 +1765,29 @@ SELECT :my_io_stats_pre_reset > :my_io_stats_post_backend_reset;
  t
 (1 row)
 
+-- Check that pg_stat_io_histograms sees some growing counts in buckets
+-- We could also try with checkpointer, but it often runs with fsync=off
+-- during test.
+SET track_io_timing TO 'on';
+SELECT sum(bucket_count) AS hist_bucket_count_sum FROM pg_stat_get_io_histogram()
+WHERE backend_type='client backend' AND object='relation' AND context='normal' \gset
+CREATE TABLE test_io_hist(id bigint);
+INSERT INTO test_io_hist SELECT generate_series(1, 100) s;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(bucket_count) AS hist_bucket_count_sum2 FROM pg_stat_get_io_histogram()
+WHERE backend_type='client backend' AND object='relation' AND context='normal' \gset
+SELECT :hist_bucket_count_sum2 > :hist_bucket_count_sum;
+ ?column? 
+----------
+ t
+(1 row)
+
+RESET track_io_timing;
 -- Check invalid input for pg_stat_get_backend_io()
 SELECT pg_stat_get_backend_io(NULL);
  pg_stat_get_backend_io 
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index 8768e0f27fd..063b1011d7e 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -841,6 +841,21 @@ SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) +
   FROM pg_stat_get_backend_io(pg_backend_pid()) \gset
 SELECT :my_io_stats_pre_reset > :my_io_stats_post_backend_reset;
 
+
+-- Check that pg_stat_io_histograms sees some growing counts in buckets
+-- We could also try with checkpointer, but it often runs with fsync=off
+-- during test.
+SET track_io_timing TO 'on';
+SELECT sum(bucket_count) AS hist_bucket_count_sum FROM pg_stat_get_io_histogram()
+WHERE backend_type='client backend' AND object='relation' AND context='normal' \gset
+CREATE TABLE test_io_hist(id bigint);
+INSERT INTO test_io_hist SELECT generate_series(1, 100) s;
+SELECT pg_stat_force_next_flush();
+SELECT sum(bucket_count) AS hist_bucket_count_sum2 FROM pg_stat_get_io_histogram()
+WHERE backend_type='client backend' AND object='relation' AND context='normal' \gset
+SELECT :hist_bucket_count_sum2 > :hist_bucket_count_sum;
+RESET track_io_timing;
+
 -- Check invalid input for pg_stat_get_backend_io()
 SELECT pg_stat_get_backend_io(NULL);
 SELECT pg_stat_get_backend_io(0);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 77e3c04144e..15c16db8793 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3758,6 +3758,7 @@ gtrgm_consistent_cache
 gzFile
 heap_page_items_state
 help_handler
+histogram_io_stat_col
 hlCheck
 hstoreCheckKeyLen_t
 hstoreCheckValLen_t
-- 
2.43.0

From fab13516302da9ddcb3fb7b7ec5699182c2a9ce6 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <[email protected]>
Date: Fri, 6 Mar 2026 08:25:54 +0100
Subject: [PATCH v7 2/6] PendingBackendStats save memory

---
 src/backend/utils/activity/pgstat_backend.c |  4 ++--
 src/include/pgstat.h                        | 16 ++++++++++++----
 2 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/src/backend/utils/activity/pgstat_backend.c b/src/backend/utils/activity/pgstat_backend.c
index f2f8d3ff75f..4cd3fb923c9 100644
--- a/src/backend/utils/activity/pgstat_backend.c
+++ b/src/backend/utils/activity/pgstat_backend.c
@@ -167,7 +167,7 @@ pgstat_flush_backend_entry_io(PgStat_EntryRef *entry_ref)
 {
 	PgStatShared_Backend *shbackendent;
 	PgStat_BktypeIO *bktype_shstats;
-	PgStat_PendingIO pending_io;
+	PgStat_BackendPendingIO pending_io;
 
 	/*
 	 * This function can be called even if nothing at all has happened for IO
@@ -204,7 +204,7 @@ pgstat_flush_backend_entry_io(PgStat_EntryRef *entry_ref)
 	/*
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
-	MemSet(&PendingBackendStats.pending_io, 0, sizeof(PgStat_PendingIO));
+	MemSet(&PendingBackendStats.pending_io, 0, sizeof(PgStat_BackendPendingIO));
 
 	backend_has_iostats = false;
 }
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 8e06f3e05d2..9554de3a803 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -521,15 +521,23 @@ typedef struct PgStat_Backend
 } PgStat_Backend;
 
 /* ---------
- * PgStat_BackendPending	Non-flushed backend stats.
+ * PgStat_BackendPending(IO)	Non-flushed backend stats.
  * ---------
  */
+typedef struct PgStat_BackendPendingIO {
+	uint64          bytes[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+	PgStat_Counter  counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+	instr_time      pending_times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+} PgStat_BackendPendingIO;
+
 typedef struct PgStat_BackendPending
 {
 	/*
-	 * Backend statistics store the same amount of IO data as PGSTAT_KIND_IO.
-	 */
-	PgStat_PendingIO pending_io;
+	* Backend statistics store almost the same amount of IO data as
+	* PGSTAT_KIND_IO. The only difference between PgStat_BackendPendingIO
+	* and PgStat_PendingIO is that the latter also track IO latency histograms.
+	*/
+	PgStat_BackendPendingIO pending_io;
 } PgStat_BackendPending;
 
 /*
-- 
2.43.0

From bcb30aadc7a62f73d0958e2dc84aa285b2357ab8 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <[email protected]>
Date: Fri, 6 Mar 2026 12:09:10 +0100
Subject: [PATCH v7 4/6] Convert PgStat_IO to pointer to avoid huge static
 memory allocation if not used.

---
 src/backend/utils/activity/pgstat.c    |  9 ++++++++-
 src/backend/utils/activity/pgstat_io.c | 14 +++++++++++---
 src/include/utils/pgstat_internal.h    |  2 +-
 3 files changed, 20 insertions(+), 5 deletions(-)

diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index f015f217766..d61c50a4aef 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -1644,10 +1644,17 @@ pgstat_write_statsfile(void)
 
 		pgstat_build_snapshot_fixed(kind);
 		if (pgstat_is_kind_builtin(kind))
-			ptr = ((char *) &pgStatLocal.snapshot) + info->snapshot_ctl_off;
+		{
+			if(kind == PGSTAT_KIND_IO)
+				ptr = (char *) pgStatLocal.snapshot.io;
+			else
+				ptr = ((char *) &pgStatLocal.snapshot) + info->snapshot_ctl_off;
+		}
 		else
 			ptr = pgStatLocal.snapshot.custom_data[kind - PGSTAT_KIND_CUSTOM_MIN];
 
+		Assert(ptr != NULL);
+
 		fputc(PGSTAT_FILE_ENTRY_FIXED, fpout);
 		pgstat_write_chunk_s(fpout, &kind);
 		pgstat_write_chunk(fpout, ptr, info->shared_data_len);
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index ae689d3926e..8605ea65605 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -19,6 +19,7 @@
 #include "executor/instrument.h"
 #include "port/pg_bitutils.h"
 #include "storage/bufmgr.h"
+#include "utils/memutils.h"
 #include "utils/pgstat_internal.h"
 
 PgStat_PendingIO PendingIOStats;
@@ -199,7 +200,7 @@ pgstat_fetch_stat_io(void)
 {
 	pgstat_snapshot_fixed(PGSTAT_KIND_IO);
 
-	return &pgStatLocal.snapshot.io;
+	return pgStatLocal.snapshot.io;
 }
 
 /*
@@ -348,6 +349,9 @@ pgstat_io_init_shmem_cb(void *stats)
 
 	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
 		LWLockInitialize(&stat_shmem->locks[i], LWTRANCHE_PGSTATS_DATA);
+
+	/* this might end up being lazily allocated in pgstat_io_snapshot_cb() */
+	pgStatLocal.snapshot.io = NULL;
 }
 
 void
@@ -375,11 +379,15 @@ pgstat_io_reset_all_cb(TimestampTz ts)
 void
 pgstat_io_snapshot_cb(void)
 {
+	if (unlikely(pgStatLocal.snapshot.io == NULL))
+		pgStatLocal.snapshot.io = MemoryContextAllocZero(TopMemoryContext,
+				sizeof(PgStat_IO));
+
 	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
 	{
 		LWLock	   *bktype_lock = &pgStatLocal.shmem->io.locks[i];
 		PgStat_BktypeIO *bktype_shstats = &pgStatLocal.shmem->io.stats.stats[i];
-		PgStat_BktypeIO *bktype_snap = &pgStatLocal.snapshot.io.stats[i];
+		PgStat_BktypeIO *bktype_snap = &pgStatLocal.snapshot.io->stats[i];
 
 		LWLockAcquire(bktype_lock, LW_SHARED);
 
@@ -388,7 +396,7 @@ pgstat_io_snapshot_cb(void)
 		 * the reset timestamp as well.
 		 */
 		if (i == 0)
-			pgStatLocal.snapshot.io.stat_reset_timestamp =
+			pgStatLocal.snapshot.io->stat_reset_timestamp =
 				pgStatLocal.shmem->io.stats.stat_reset_timestamp;
 
 		/* using struct assignment due to better type safety */
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index 9b8fbae00ed..407657e060c 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -600,7 +600,7 @@ typedef struct PgStat_Snapshot
 
 	PgStat_CheckpointerStats checkpointer;
 
-	PgStat_IO	io;
+	PgStat_IO	*io;
 
 	PgStat_SLRUStats slru[SLRU_NUM_ELEMENTS];
 
-- 
2.43.0

From 067d7a08972ec8728212af63f9a4a852c6fe0345 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <[email protected]>
Date: Fri, 6 Mar 2026 11:26:19 +0100
Subject: [PATCH v7 3/6] PendingIOStats save memory

---
 src/backend/utils/activity/pgstat.c      | 10 ++++++++
 src/backend/utils/activity/pgstat_io.c   | 20 +++++++++-------
 src/include/pgstat.h                     |  8 ++++++-
 src/test/recovery/t/029_stats_restart.pl | 29 ++++++++++++++++++++++++
 src/test/regress/expected/stats.out      | 23 -------------------
 src/test/regress/sql/stats.sql           | 15 ------------
 6 files changed, 58 insertions(+), 47 deletions(-)

diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 11bb71cad5a..f015f217766 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -104,8 +104,10 @@
 #include <unistd.h>
 
 #include "access/xact.h"
+#include "access/xlog.h"
 #include "lib/dshash.h"
 #include "pgstat.h"
+#include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
@@ -671,6 +673,14 @@ pgstat_initialize(void)
 	/* Set up a process-exit hook to clean up */
 	before_shmem_exit(pgstat_shutdown_hook, 0);
 
+	/* Allocate I/O latency buckets only if we are going to populate it */
+	if (track_io_timing || track_wal_io_timing)
+		PendingIOStats.pending_hist_time_buckets = MemoryContextAllocZero(TopMemoryContext,
+																		  IOOBJECT_NUM_TYPES * IOCONTEXT_NUM_TYPES * IOOP_NUM_TYPES *
+																		  PGSTAT_IO_HIST_BUCKETS * sizeof(uint64));
+	else
+		PendingIOStats.pending_hist_time_buckets = NULL;
+
 #ifdef USE_ASSERT_CHECKING
 	pgstat_is_initialized = true;
 #endif
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 148a2a9c7d5..ae689d3926e 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -21,7 +21,7 @@
 #include "storage/bufmgr.h"
 #include "utils/pgstat_internal.h"
 
-static PgStat_PendingIO PendingIOStats;
+PgStat_PendingIO PendingIOStats;
 static bool have_iostats = false;
 
 /*
@@ -180,9 +180,11 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 		INSTR_TIME_ADD(PendingIOStats.pending_times[io_object][io_context][io_op],
 					   io_time);
 
-		/* calculate the bucket_index based on latency in nanoseconds (uint64) */
-		bucket_index = get_bucket_index(INSTR_TIME_GET_NANOSEC(io_time));
-		PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][bucket_index]++;
+		if(PendingIOStats.pending_hist_time_buckets != NULL) {
+			/* calculate the bucket_index based on latency in nanoseconds (uint64) */
+			bucket_index = get_bucket_index(INSTR_TIME_GET_NANOSEC(io_time));
+			PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][bucket_index]++;
+		}
 
 		/* Add the per-backend count */
 		pgstat_count_backend_io_op_time(io_object, io_context, io_op,
@@ -254,9 +256,10 @@ pgstat_io_flush_cb(bool nowait)
 				bktype_shstats->times[io_object][io_context][io_op] +=
 					INSTR_TIME_GET_MICROSEC(time);
 
-				for(int b = 0; b < PGSTAT_IO_HIST_BUCKETS; b++)
-					bktype_shstats->hist_time_buckets[io_object][io_context][io_op][b] +=
-						PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][b];
+				if(PendingIOStats.pending_hist_time_buckets != NULL)
+					for(int b = 0; b < PGSTAT_IO_HIST_BUCKETS; b++)
+						bktype_shstats->hist_time_buckets[io_object][io_context][io_op][b] +=
+							PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][b];
 			}
 		}
 	}
@@ -265,7 +268,8 @@ pgstat_io_flush_cb(bool nowait)
 
 	LWLockRelease(bktype_lock);
 
-	memset(&PendingIOStats, 0, sizeof(PendingIOStats));
+	/* Avoid overwriting latency buckets array pointer */
+	memset(&PendingIOStats, 0, offsetof(PgStat_PendingIO, pending_hist_time_buckets));
 
 	have_iostats = false;
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 9554de3a803..59114f1bc3f 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -350,9 +350,15 @@ typedef struct PgStat_PendingIO
 	uint64		bytes[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
 	PgStat_Counter counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
 	instr_time	pending_times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
-	uint64		pending_hist_time_buckets[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES][PGSTAT_IO_HIST_BUCKETS];
+	/*
+	 * Dynamically allocated array of [IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES]
+	 * [IOOP_NUM_TYPES][PGSTAT_IO_HIST_BUCKETS] only with track_io_timings true.
+	 */
+	uint64		(*pending_hist_time_buckets)[IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES][PGSTAT_IO_HIST_BUCKETS];
 } PgStat_PendingIO;
 
+extern PgStat_PendingIO PendingIOStats;
+
 typedef struct PgStat_IO
 {
 	TimestampTz stat_reset_timestamp;
diff --git a/src/test/recovery/t/029_stats_restart.pl b/src/test/recovery/t/029_stats_restart.pl
index cdc427dbc78..33939c8701a 100644
--- a/src/test/recovery/t/029_stats_restart.pl
+++ b/src/test/recovery/t/029_stats_restart.pl
@@ -293,7 +293,36 @@ cmp_ok(
 	$wal_restart_immediate->{reset},
 	"$sect: reset timestamp is new");
 
+
+## Test pg_stat_io_histogram that is becoming active due to dynamic memory
+## allocation only for new backends with globally set track_[io|wal_io]_timing
+$sect = "pg_stat_io_histogram";
+$node->append_conf('postgresql.conf', "track_io_timing = 'on'");
+$node->append_conf('postgresql.conf', "track_wal_io_timing = 'on'");
+$node->restart;
+
+
+## Check that pg_stat_io_histograms sees some growing counts in buckets
+## We could also try with checkpointer, but it often runs with fsync=off
+## during test.
+my $countbefore = $node->safe_psql('postgres',
+	"SELECT sum(bucket_count) AS hist_bucket_count_sum FROM pg_stat_get_io_histogram() " .
+	"WHERE backend_type='client backend' AND object='relation' AND context='normal'");
+
+$node->safe_psql('postgres', "CREATE TABLE test_io_hist(id bigint);");
+$node->safe_psql('postgres', "INSERT INTO test_io_hist SELECT generate_series(1, 100) s;");
+$node->safe_psql('postgres', "SELECT pg_stat_force_next_flush();");
+
+my $countafter = $node->safe_psql('postgres',
+	"SELECT sum(bucket_count) AS hist_bucket_count_sum FROM pg_stat_get_io_histogram() " .
+	"WHERE backend_type='client backend' AND object='relation' AND context='normal'");
+
+cmp_ok(
+	$countafter, '>', $countbefore,
+	"pg_stat_io_histogram: latency buckets growing");
+
 $node->stop;
+
 done_testing();
 
 sub trigger_funcrel_stat
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 4c95f09d651..cd00f35bf7a 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -1765,29 +1765,6 @@ SELECT :my_io_stats_pre_reset > :my_io_stats_post_backend_reset;
  t
 (1 row)
 
--- Check that pg_stat_io_histograms sees some growing counts in buckets
--- We could also try with checkpointer, but it often runs with fsync=off
--- during test.
-SET track_io_timing TO 'on';
-SELECT sum(bucket_count) AS hist_bucket_count_sum FROM pg_stat_get_io_histogram()
-WHERE backend_type='client backend' AND object='relation' AND context='normal' \gset
-CREATE TABLE test_io_hist(id bigint);
-INSERT INTO test_io_hist SELECT generate_series(1, 100) s;
-SELECT pg_stat_force_next_flush();
- pg_stat_force_next_flush 
---------------------------
- 
-(1 row)
-
-SELECT sum(bucket_count) AS hist_bucket_count_sum2 FROM pg_stat_get_io_histogram()
-WHERE backend_type='client backend' AND object='relation' AND context='normal' \gset
-SELECT :hist_bucket_count_sum2 > :hist_bucket_count_sum;
- ?column? 
-----------
- t
-(1 row)
-
-RESET track_io_timing;
 -- Check invalid input for pg_stat_get_backend_io()
 SELECT pg_stat_get_backend_io(NULL);
  pg_stat_get_backend_io 
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index 063b1011d7e..8768e0f27fd 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -841,21 +841,6 @@ SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) +
   FROM pg_stat_get_backend_io(pg_backend_pid()) \gset
 SELECT :my_io_stats_pre_reset > :my_io_stats_post_backend_reset;
 
-
--- Check that pg_stat_io_histograms sees some growing counts in buckets
--- We could also try with checkpointer, but it often runs with fsync=off
--- during test.
-SET track_io_timing TO 'on';
-SELECT sum(bucket_count) AS hist_bucket_count_sum FROM pg_stat_get_io_histogram()
-WHERE backend_type='client backend' AND object='relation' AND context='normal' \gset
-CREATE TABLE test_io_hist(id bigint);
-INSERT INTO test_io_hist SELECT generate_series(1, 100) s;
-SELECT pg_stat_force_next_flush();
-SELECT sum(bucket_count) AS hist_bucket_count_sum2 FROM pg_stat_get_io_histogram()
-WHERE backend_type='client backend' AND object='relation' AND context='normal' \gset
-SELECT :hist_bucket_count_sum2 > :hist_bucket_count_sum;
-RESET track_io_timing;
-
 -- Check invalid input for pg_stat_get_backend_io()
 SELECT pg_stat_get_backend_io(NULL);
 SELECT pg_stat_get_backend_io(0);
-- 
2.43.0

From 73df7dd739cb2fd98e3745dbd4f65290531a262b Mon Sep 17 00:00:00 2001
From: Jakub Wartak <[email protected]>
Date: Fri, 6 Mar 2026 13:29:40 +0100
Subject: [PATCH v7 5/6] Condense PgStat_IO.stats[BACKEND_NUM_TYPES] array by
 using PGSTAT_USED_BACKEND_NUM_TYPES to be more memory efficient.

---
 src/backend/utils/activity/pgstat_io.c | 57 +++++++++++++++++++++++---
 src/backend/utils/adt/pgstatfuncs.c    | 22 ++++++----
 src/include/miscadmin.h                |  2 +-
 src/include/pgstat.h                   |  5 ++-
 4 files changed, 71 insertions(+), 15 deletions(-)

diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 8605ea65605..1e9bff4da41 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -225,13 +225,14 @@ pgstat_io_flush_cb(bool nowait)
 {
 	LWLock	   *bktype_lock;
 	PgStat_BktypeIO *bktype_shstats;
+	BackendType condensedBkType = pgstat_remap_condensed_bktype(MyBackendType);
 
 	if (!have_iostats)
 		return false;
 
 	bktype_lock = &pgStatLocal.shmem->io.locks[MyBackendType];
 	bktype_shstats =
-		&pgStatLocal.shmem->io.stats.stats[MyBackendType];
+		&pgStatLocal.shmem->io.stats.stats[condensedBkType];
 
 	if (!nowait)
 		LWLockAcquire(bktype_lock, LW_EXCLUSIVE);
@@ -360,7 +361,11 @@ pgstat_io_reset_all_cb(TimestampTz ts)
 	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
 	{
 		LWLock	   *bktype_lock = &pgStatLocal.shmem->io.locks[i];
-		PgStat_BktypeIO *bktype_shstats = &pgStatLocal.shmem->io.stats.stats[i];
+		BackendType bktype = pgstat_remap_condensed_bktype(i);
+		PgStat_BktypeIO *bktype_shstats;
+		if(bktype == -1)
+			continue;
+		bktype_shstats = &pgStatLocal.shmem->io.stats.stats[bktype];
 
 		LWLockAcquire(bktype_lock, LW_EXCLUSIVE);
 
@@ -386,8 +391,13 @@ pgstat_io_snapshot_cb(void)
 	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
 	{
 		LWLock	   *bktype_lock = &pgStatLocal.shmem->io.locks[i];
-		PgStat_BktypeIO *bktype_shstats = &pgStatLocal.shmem->io.stats.stats[i];
-		PgStat_BktypeIO *bktype_snap = &pgStatLocal.snapshot.io->stats[i];
+		BackendType bktype = pgstat_remap_condensed_bktype(i);
+		PgStat_BktypeIO *bktype_shstats;
+		PgStat_BktypeIO *bktype_snap;
+		if(bktype == -1)
+				continue;
+		bktype_shstats = &pgStatLocal.shmem->io.stats.stats[bktype];
+		bktype_snap = &pgStatLocal.snapshot.io->stats[bktype];
 
 		LWLockAcquire(bktype_lock, LW_SHARED);
 
@@ -419,7 +429,8 @@ pgstat_io_snapshot_cb(void)
 * Function returns true if BackendType participates in the cumulative stats
 * subsystem for IO and false if it does not.
 *
-* When adding a new BackendType, also consider adding relevant restrictions to
+* When adding a new BackendType, ensure that pgstat_remap_condensed_bktype()
+* is updated and also consider adding relevant restrictions to
 * pgstat_tracks_io_object() and pgstat_tracks_io_op().
 */
 bool
@@ -457,6 +468,42 @@ pgstat_tracks_io_bktype(BackendType bktype)
 	return false;
 }
 
+
+/*
+ * Remap sparse backend type IDs to contiguous ones. Keep in sync with enum
+ * BackendType and PGSTAT_USED_BACKEND_NUM_TYPES count.
+ *
+ * Returns -1 if the input ID is invalid or unused.
+ */
+int
+pgstat_remap_condensed_bktype(BackendType bktype) {
+	/* -1 here means it should not be used */
+	static const int mapping_table[BACKEND_NUM_TYPES] = {
+		-1, /* B_INVALID */
+		0,
+		-1, /* B_DEAD_END_BACKEND */
+		1,
+		2,
+		3,
+		4,
+		5,
+		6,
+		-1, /* B_ARCHIVER */
+		7,
+		8,
+		8,
+		10,
+		11,
+		12,
+		13,
+		-1  /* B_LOGGER */
+	};
+
+	if (bktype < 0 || bktype > BACKEND_NUM_TYPES)
+		return -1;
+	return mapping_table[bktype];
+}
+
 /*
  * Some BackendTypes do not perform IO on certain IOObjects or in certain
  * IOContexts. Some IOObjects are never operated on in some IOContexts. Check
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index ac08ab14195..74f1351289a 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1578,9 +1578,12 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
 
 	backends_io_stats = pgstat_fetch_stat_io();
 
-	for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
 	{
+		BackendType bktype = pgstat_remap_condensed_bktype(i);
 		PgStat_BktypeIO *bktype_stats = &backends_io_stats->stats[bktype];
+		if(bktype == -1)
+			continue;
 
 		/*
 		 * In Assert builds, we can afford an extra loop through all of the
@@ -1588,17 +1591,17 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
 		 * expected stats are non-zero, since it keeps the non-Assert code
 		 * cleaner.
 		 */
-		Assert(pgstat_bktype_io_stats_valid(bktype_stats, bktype));
+		Assert(pgstat_bktype_io_stats_valid(bktype_stats, i));
 
 		/*
 		 * For those BackendTypes without IO Operation stats, skip
 		 * representing them in the view altogether.
 		 */
-		if (!pgstat_tracks_io_bktype(bktype))
+		if (!pgstat_tracks_io_bktype(i))
 			continue;
 
 		/* save tuples with data from this PgStat_BktypeIO */
-		pg_stat_io_build_tuples(rsinfo, bktype_stats, bktype,
+		pg_stat_io_build_tuples(rsinfo, bktype_stats, i,
 								backends_io_stats->stat_reset_timestamp);
 	}
 
@@ -1757,9 +1760,12 @@ pg_stat_get_io_histogram(PG_FUNCTION_ARGS)
 
 	backends_io_stats = pgstat_fetch_stat_io();
 
-	for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
 	{
+		BackendType bktype = pgstat_remap_condensed_bktype(i);
 		PgStat_BktypeIO *bktype_stats = &backends_io_stats->stats[bktype];
+		if(bktype == -1)
+			continue;
 
 		/*
 		 * In Assert builds, we can afford an extra loop through all of the
@@ -1767,17 +1773,17 @@ pg_stat_get_io_histogram(PG_FUNCTION_ARGS)
 		 * expected stats are non-zero, since it keeps the non-Assert code
 		 * cleaner.
 		 */
-		Assert(pgstat_bktype_io_stats_valid(bktype_stats, bktype));
+		Assert(pgstat_bktype_io_stats_valid(bktype_stats, i));
 
 		/*
 		 * For those BackendTypes without IO Operation stats, skip
 		 * representing them in the view altogether.
 		 */
-		if (!pgstat_tracks_io_bktype(bktype))
+		if (!pgstat_tracks_io_bktype(i))
 			continue;
 
 		/* save tuples with data from this PgStat_BktypeIO */
-		pg_stat_io_histogram_build_tuples(rsinfo, bktype_stats, bktype,
+		pg_stat_io_histogram_build_tuples(rsinfo, bktype_stats, i,
 								backends_io_stats->stat_reset_timestamp);
 	}
 
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index f16f35659b9..d0c62d3248e 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -332,7 +332,7 @@ extern void SwitchBackToLocalLatch(void);
  * MyBackendType indicates what kind of a backend this is.
  *
  * If you add entries, please also update the child_process_kinds array in
- * launch_backend.c.
+ * launch_backend.c and PGSTAT_USED_BACKEND_NUM_TYPES in pgstat.h
  */
 typedef enum BackendType
 {
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 59114f1bc3f..22114d378bd 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -359,10 +359,12 @@ typedef struct PgStat_PendingIO
 
 extern PgStat_PendingIO PendingIOStats;
 
+/* This needs to stay in sync with pgstat_tracks_io_bktype() */
+#define PGSTAT_USED_BACKEND_NUM_TYPES BACKEND_NUM_TYPES - 4
 typedef struct PgStat_IO
 {
 	TimestampTz stat_reset_timestamp;
-	PgStat_BktypeIO stats[BACKEND_NUM_TYPES];
+	PgStat_BktypeIO stats[PGSTAT_USED_BACKEND_NUM_TYPES];
 } PgStat_IO;
 
 typedef struct PgStat_StatDBEntry
@@ -639,6 +641,7 @@ extern const char *pgstat_get_io_context_name(IOContext io_context);
 extern const char *pgstat_get_io_object_name(IOObject io_object);
 extern const char *pgstat_get_io_op_name(IOOp io_op);
 
+extern int pgstat_remap_condensed_bktype(BackendType bktype);
 extern bool pgstat_tracks_io_bktype(BackendType bktype);
 extern bool pgstat_tracks_io_object(BackendType bktype,
 									IOObject io_object, IOContext io_context);
-- 
2.43.0

From fc233581899610e5b96b0f561dd74bd60eac17e3 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <[email protected]>
Date: Fri, 6 Mar 2026 14:00:38 +0100
Subject: [PATCH v7 6/6] Further condense and reduce memory used by
 pgstat_io(_histogram) subsystem by eliminating tracking of useless backend
 types: autovacum launcher and standalone backend.

---
 src/backend/utils/activity/pgstat_io.c   | 17 +++++++++++------
 src/include/pgstat.h                     |  2 +-
 src/test/recovery/t/029_stats_restart.pl |  5 -----
 src/test/regress/expected/stats.out      | 14 +-------------
 4 files changed, 13 insertions(+), 25 deletions(-)

diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 1e9bff4da41..6c11430ad94 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -73,6 +73,8 @@ pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op,
 	Assert((unsigned int) io_object < IOOBJECT_NUM_TYPES);
 	Assert((unsigned int) io_context < IOCONTEXT_NUM_TYPES);
 	Assert(pgstat_is_ioop_tracked_in_bytes(io_op) || bytes == 0);
+	if(unlikely(MyBackendType == B_STANDALONE_BACKEND || MyBackendType == B_AUTOVAC_LAUNCHER))
+		return;
 	Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op));
 
 	PendingIOStats.counts[io_object][io_context][io_op] += cnt;
@@ -425,6 +427,9 @@ pgstat_io_snapshot_cb(void)
 * - Syslogger because it is not connected to shared memory
 * - Archiver because most relevant archiving IO is delegated to a
 *   specialized command or module
+* - Autovacum launcher because it hardly performs any IO
+* - Standalone backend as it is only used in unusual maintenance
+*   scenarios
 *
 * Function returns true if BackendType participates in the cumulative stats
 * subsystem for IO and false if it does not.
@@ -446,9 +451,10 @@ pgstat_tracks_io_bktype(BackendType bktype)
 		case B_DEAD_END_BACKEND:
 		case B_ARCHIVER:
 		case B_LOGGER:
+		case B_AUTOVAC_LAUNCHER:
+		case B_STANDALONE_BACKEND:
 			return false;
 
-		case B_AUTOVAC_LAUNCHER:
 		case B_AUTOVAC_WORKER:
 		case B_BACKEND:
 		case B_BG_WORKER:
@@ -456,7 +462,6 @@ pgstat_tracks_io_bktype(BackendType bktype)
 		case B_CHECKPOINTER:
 		case B_IO_WORKER:
 		case B_SLOTSYNC_WORKER:
-		case B_STANDALONE_BACKEND:
 		case B_STARTUP:
 		case B_WAL_RECEIVER:
 		case B_WAL_SENDER:
@@ -482,20 +487,20 @@ pgstat_remap_condensed_bktype(BackendType bktype) {
 		-1, /* B_INVALID */
 		0,
 		-1, /* B_DEAD_END_BACKEND */
+		-1, /* B_AUTOVAC_LAUNCHER */
 		1,
 		2,
 		3,
 		4,
+		-1, /* B_STANDALONE_BACKEND */
+		-1, /* B_ARCHIVER */
 		5,
 		6,
-		-1, /* B_ARCHIVER */
 		7,
 		8,
-		8,
+		9,
 		10,
 		11,
-		12,
-		13,
 		-1  /* B_LOGGER */
 	};
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 22114d378bd..80476eda514 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -360,7 +360,7 @@ typedef struct PgStat_PendingIO
 extern PgStat_PendingIO PendingIOStats;
 
 /* This needs to stay in sync with pgstat_tracks_io_bktype() */
-#define PGSTAT_USED_BACKEND_NUM_TYPES BACKEND_NUM_TYPES - 4
+#define PGSTAT_USED_BACKEND_NUM_TYPES BACKEND_NUM_TYPES - 6
 typedef struct PgStat_IO
 {
 	TimestampTz stat_reset_timestamp;
diff --git a/src/test/recovery/t/029_stats_restart.pl b/src/test/recovery/t/029_stats_restart.pl
index 33939c8701a..681fb9ac16d 100644
--- a/src/test/recovery/t/029_stats_restart.pl
+++ b/src/test/recovery/t/029_stats_restart.pl
@@ -22,12 +22,7 @@ my $sect = "startup";
 
 # Check some WAL statistics after a fresh startup.  The startup process
 # should have done WAL reads, and initialization some WAL writes.
-my $standalone_io_stats = io_stats('init', 'wal', 'standalone backend');
 my $startup_io_stats = io_stats('normal', 'wal', 'startup');
-cmp_ok(
-	'0', '<',
-	$standalone_io_stats->{writes},
-	"$sect: increased standalone backend IO writes");
 cmp_ok(
 	'0', '<',
 	$startup_io_stats->{reads},
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index cd00f35bf7a..7cefd37a99a 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -16,11 +16,6 @@ SHOW track_counts;  -- must be on
 SELECT backend_type, object, context FROM pg_stat_io
   ORDER BY backend_type COLLATE "C", object COLLATE "C", context COLLATE "C";
 backend_type|object|context
-autovacuum launcher|relation|bulkread
-autovacuum launcher|relation|init
-autovacuum launcher|relation|normal
-autovacuum launcher|wal|init
-autovacuum launcher|wal|normal
 autovacuum worker|relation|bulkread
 autovacuum worker|relation|init
 autovacuum worker|relation|normal
@@ -67,13 +62,6 @@ slotsync worker|relation|vacuum
 slotsync worker|temp relation|normal
 slotsync worker|wal|init
 slotsync worker|wal|normal
-standalone backend|relation|bulkread
-standalone backend|relation|bulkwrite
-standalone backend|relation|init
-standalone backend|relation|normal
-standalone backend|relation|vacuum
-standalone backend|wal|init
-standalone backend|wal|normal
 startup|relation|bulkread
 startup|relation|bulkwrite
 startup|relation|init
@@ -95,7 +83,7 @@ walsummarizer|wal|init
 walsummarizer|wal|normal
 walwriter|wal|init
 walwriter|wal|normal
-(79 rows)
+(67 rows)
 \a
 -- ensure that both seqscan and indexscan plans are allowed
 SET enable_seqscan TO on;
-- 
2.43.0

Reply via email to