On Tue, Jun 10, 2025 at 9:40 AM Andrew Johnson <andr...@metronome.com> wrote: > > Hello hackers, > > I'd like to propose adding a new view named "pg_stat_multixact" to > expose multixact member usage. This addresses a major monitoring gap > that ultimately led to a production outage at Metronome [1]. > > Problem > Multixact membership exhaustion is an edge case that can cause write > lockouts, but there's no visibility into membership space usage. > Without any direct telemetry from the database, we're essentially > flying blind. It is possible to estimate multixact membership usage > through scanning the filesystem, but there are several drawbacks to > that method that Naga Appani outlined in a previous thread [2]. > > This complements Peter Geoghegan's recent thread about vacuum failsafe > improvements [3], where Sami Imseih noted "exposing the members > count... will be a good idea as well" [4]. > > Solution > - New view (pg_stat_multixact) with the columns "members" (bigint) and > "update_timestamp" (timestamptz). > - Updates member count and timestamp during multixact allocation and > freeze threshold checks. > > I've attached a patch that: > - Implements this view using pgstat patterns. > - Includes isolation tests. > - Includes documentation changes to monitoring.sgml. >
Hi Andrew, Thanks for referencing my earlier proposal and for working to improve observability around MultiXact usage, it’s great to see more attention on this area. After quickly reviewing your patch, I wanted to share a few thoughts on the overall approach. I shared a patch [0] that adds a SQL-callable function exposing the same counters via ReadMultiXactCounts() without complexity. Since these values are global, not aggregatable per backend or over time, and not meaningfully resettable, introducing new statistics infrastructure may be more than what’s needed unless there's an additional use case I’m overlooking. A lightweight function seems better aligned with the nature of these metrics and the operational use cases they serve, particularly for historical/ongoing diagnostics and periodic monitoring. [0] https://www.postgresql.org/message-id/CA%2BQeY%2BDTggHskCXOa39nag2sFds9BD-7k__zPbvL-_VVyJw7Sg%40mail.gmail.com Best regards, Naga Appani > I have also: > - Tested initdb works > - Ran make check-world with --enable-tap-tests to ensure all tests pass > > I'm aiming to get this into the upcoming CommitFest. I would > appreciate your thoughts on this proposal and attached patch. > > [1] > https://metronome.com/blog/root-cause-analysis-postgresql-multixact-member-exhaustion-incidents-may-2025 > [2] > https://www.postgresql.org/message-id/flat/caldsspi3gh08ntccn44uveuaygot74su6uei_06quta5rmk...@mail.gmail.com#bfd9ae766ef42f7599258183aa8ddb3b > [3] > https://www.postgresql.org/message-id/cah2-wzmlpwjk3gbaxy8dhy+a-juz_6ugwfe6dke8b5-dtdv...@mail.gmail.com > [4] > https://www.postgresql.org/message-id/CAA5RZ0u43s4YbR%3D0mJ0_k3VGWjchJHhYnCoaZVzeLd3ccZtwhQ%40mail.gmail.com > > -- > Respectfully, > > Andrew Johnson > Software Engineer > Metronome, Inc.