To clear up any confusion, I tried the reductions as hinted in the error message and after a few iterations still faced the same error. Also, re. the sample configuration I provided - this is the result of said iterations after initially using the stock values from the PGDG RPM.
Jim On Fri, 9 Aug 2024 at 11:26, Jim Vanns <jva...@ilm.com> wrote: > > Hi pggen community! > > I am struggling with this error almost daily now and despite various efforts, > not succeeding in avoiding or dealing with it; > > ERROR: multixact "members" limit exceeded > DETAIL: This command would create a multixact with 2 members, but the > remaining space is only enough for 0 members. > HINT: Execute a database-wide VACUUM in database with OID 16467 with reduced > vacuum_multixact_freeze_min_age and vacuum_multixact_freeze_table_age > settings. > > Runtime details follow below, but before that; I am now seeing the above > error almost daily after approximately 12 hours of normal or expected > behaviour and throughput. Then it hits and all writes are blocked etc. and > the service is largely unusable/unable to recover. Restarting PG does allow > autovacuum processes to kick in with aggressive vacuuming to handle the > multixact freezing, but that isn't a suitable solution! Although having read > sources that now explain why multixact XIDs exist and when they're used, I am > not able to properly figure out how to plan for it or configure postgresql > appropriately to handle it given our workload. > > My questions are; > > 1) How should I be managing this? Although not new to PG, I am new to this > particular problem. > 2) How can I confirm what processes/tables are contributing to this multixact > "members" limit? > 3) What are the units for > vacuum_multixact_freeze_min_age/vacuum_multixact_freeze_table_age and how > should I be > setting them appropriately for my rates etc.? I can't really find > anything that explains this clearly. > 4) How can I check that autovacuum workers are specifically able to freeze > multixact XIDs and thus avoid this? > 5) Can I check if autovacuum is actually succeeding in its work? > > Schema (brief): > 10 tables > 1 table (the 'hottest) partitioned by 2h ranges; so an extra 12x > partitions/relations a day, though only the most recent one might be > considered 'active' > > System (brief): > PG: 15.5 w/ TimescaleDB 2.14 extension > Write heavy workload; > Mean Txn/s (low): 8k > Mean Txn/s (high): 10k > Mean rows/s: 100k > Concurrency: 32 threads (local socket sessions) for 'upserts' via primary > service plus auxiliary processes (background workers/autovacuum workers etc.) > > Pattern (brief): > COPY (binary) to session-local temporary tables (ON COMMIT DELETE) > INSERT FROM tt TO main_table(s) > UPDATE FROM tt TO main_table(s) > VACUUM tt (every 100k txns) > > Config (excerpt): > # - Transactions - (based on a period mean of ~8k txn/s) > # See/ref; > # www.postgresql.org/docs/15/routine-vacuuming.html#VACUUM-FOR-WRAPAROUND > # > www.percona.com/blog/tuning-autovacuum-in-postgresql-and-autovacuum-internals > # blog.sentry.io/transaction-id-wraparound-in-postgres > # > https://blog.danslimmon.com/2023/12/11/concurrent-locks-and-multixacts-in-postgres > # > https://pganalyze.com/blog/5mins-postgres-multiXact-ids-foreign-keys-performance > # What unit is 'age' here? No. of. transactions? > vacuum_freeze_min_age = 28800000 # 1h @ 8k/s > vacuum_multixact_freeze_min_age = 28800000 # 1h @ 8k/s > autovacuum_freeze_max_age = 157600000 # 2h @ 8k/s > autovacuum_multixact_freeze_max_age = 57600000 # 2h @ 8k/s > vacuum_multixact_freeze_table_age = 115200000 # 4h @ 8k/s > > # We don't UPDATE or DELETE often; we never DELETE and only perform > # UPDATE operations every 6h due to internal cache(s). So we set this to > # zero so the formula doesn't consider it a major factor > autovacuum_vacuum_threshold = 0 # In tuples/rows UPDATEd or DELETEd > autovacuum_vacuum_insert_threshold = 5000 # In tuples/rows INSERTed > > autovacuum_vacuum_scale_factor = 0.1 # 10% > autovacuum_analyze_scale_factor = 0.1 # 10% > autovacuum_vacuum_insert_scale_factor = 0.1 # 10% > > autovacuum_naptime = 60 > autovacuum_max_workers = 8 > > # Give autovacuum more credits to ensure a better chance at scanning > autovacuum_vacuum_cost_limit = 2000 # 10x the default > > Thanks for your help and any guidance/knowledge you can share! > > Jim > > -- > Jim Vanns > Principal Production Engineer > Industrial Light & Magic, London -- Jim Vanns Principal Production Engineer Industrial Light & Magic, London