Hello, Attached are the updated files to help reproduce the state mentioned in the previous email more easily.
- The load_data.sql has the commands to insert a lot of dummy data that will overflow to TOAST - Which then is deleted (via the commands in the sql file) to create a situation where a truncate will happen on a future vacuum - Updated ruby script.rb to make simple select calls with high concurrency - Renamed the POC to be `v1` Thank you Shayon On Tue, May 6, 2025 at 6:05 PM Shayon Mukherjee <shay...@gmail.com> wrote: > Hello hackers, > > I'd like to discuss an issue we observed where the truncation phase of > autovacuum can become significantly prolonged on toast tables of busy > systems due to repeated interruptions by lock waiters, leading to extended > `AccessExclusiveLock` contention and impacting overall database & > application availability. > > > *Recap of the production incident* > We experienced this issue directly where a `VACUUM` on a TOAST table took > approximately 5 hours. The following shows the scale of pages eventually > removed and the duration: > > ``` > automatic vacuum of table "tines.pg_toast.pg_toast_21059": index scans: 1 > pages: 26096516 removed, 1421195 remain, 0 skipped due to pins, 26090236 > skipped frozen > tuples: 4575083 removed, 26717710 remain, 21498 are dead but not yet > removable, oldest xmin: 173810717 > index scan needed: 1151751 pages from table (4.19% of total) had 4629490 > dead item identifiers removed > index "pg_toast_21059_index": pages: 303420 in total, 14038 newly deleted, > 298668 currently deleted, 298418 reusable > I/O timings: read: 18428758.270 ms, write: 0.000 ms > avg read rate: 10.730 MB/s, avg write rate: 0.000 MB/s > buffer usage: 5132686 hits, 25714763 misses, 0 dirtied > WAL usage: 0 records, 0 full page images, 0 bytes > system usage: CPU: user: 327.79 s, system: 251.11 s, elapsed: 18722.21 s > ``` > > During this 5-hour window, read queries against primary and replica > databases (on AWS Aurora) were frequently failing due to lock timeouts > (lock_timeout was set to 1s), consistent with the `AccessExclusiveLock` > being held and released repeatedly by the truncation part of the autovacuum > process. > > This then further led to me getting a better understanding of the problem > and order of events. Which brings me to: > > *Understanding the problem* > My understanding is that the PostgreSQL autovacuum process, specifically > its lazy variant (`lazy_vacuum_heap`), performs few passes to reclaim > space. One critical phase, if deemed beneficial, is the truncation of empty > pages from the end of a relation, handled by the `lazy_truncate_heap` > function. > > As I traced down the code, the interaction occurs as follows: > > - To physically truncate a relation, `VACUUM` must acquire an > `AccessExclusiveLock`. > - After acquiring the `AccessExclusiveLock`, `VACUUM` performs a > verification step by scanning the relation backwards from its apparent end > (`count_nondeletable_pages`). This is necessary because other transactions > might have added data to these trailing pages. > - While holding the `AccessExclusiveLock` and performing this backward > scan, `VACUUM` periodically checks if any other database sessions are > waiting for locks on the same relation using > `LockHasWaitersRelation(vacrel->rel, AccessExclusiveLock)`. This check > occurs at intervals defined by `VACUUM_TRUNCATE_LOCK_CHECK_INTERVAL` (20ms) > and roughly every 32 blocks scanned. > - Interruption and Retry Mechanism: > - If `LockHasWaitersRelation` returns `true` during the backward scan, > `count_nondeletable_pages` sets a flag (`lock_waiter_detected = true`) and > returns, effectively "suspending" its scan to allow waiting sessions to > proceed. > - Back in `lazy_truncate_heap`, if `lock_waiter_detected` is true, > `VACUUM` releases the `AccessExclusiveLock`. > - `lazy_truncate_heap` contains a `do...while` loop. If > `lock_waiter_detected` was true AND there are still potentially truncatable > pages (`new_rel_pages > vacrel->nonempty_pages`), this loop causes `VACUUM` > to retry the entire truncation sequence: attempt to re-acquire the > `AccessExclusiveLock` (with its own timeout of > `VACUUM_TRUNCATE_LOCK_TIMEOUT`, default 5s), restart the > `count_nondeletable_pages` backward scan, and so on. > > - Next comes the "stuck in a loop" problem and this is where things get > interesting: > - On a very busy system with constant concurrent access, `VACUUM` can > repeatedly acquire the `AccessExclusiveLock`, start the backward scan, > detect waiters, release the lock, and retry. Each retry of > `count_nondeletable_pages` involves re-reading potentially many blocks from > disk while holding the exclusive lock which aggravates the lock contention. > - This cycle leads to prolonged `VACUUM` operations, significant > repeated I/O, and extended lock contention, reducing database availability > and causing query timeouts for other sessions (e.g., `canceling statement > due to lock timeout`, `LockNotAvailable`). This is particularly acute for > the TOAST tables in my case which sadly happens to be very large (for now). > > > *Testing the theory and patch* > To validate this understanding, I was able to reproduce a similar scenario > by loading a test table with a significant amount of TOAST-able data and > then simulating a busy read workload against it concurrently with a > `VACUUM` operation. I have attached a sql script to load the data and a > ruby script to simulate the concurrent workload if anyone is curious. The > tests demonstrated how the truncation phase could indeed enter extended > periods of repeatedly attempting to acquire the `AccessExclusiveLock` and > performing the backward scan, only to be interrupted by the simulated > concurrent workload. > > Considering the potential for the truncation phase to cause prolonged lock > contention under these conditions, I am wondering if it would be beneficial > to explicitly cap the number of retries for the truncation process within > the `do...while` loop in `lazy_truncate_heap` when interruptions by lock > waiters occur? > > Because I was already too deep in the code, I decided to write a small > patch to introduce a limit to the retries and observe the behavior. > Specifically, it adds a counter `truncate_interruption_retry_count` and > checks it against a new constant `VACUUM_TRUNCATE_INTERRUPTION_MAX_RETRIES` > (which for now I have set to 3). If the backward scan in > `count_nondeletable_pages` is interrupted by lock waiters this many times, > `VACUUM` will cease further truncation attempts for the current `VACUUM` > cycle on that table. > > I was able to perform the test on a similar setup and the results are as > follows (noticed the `stopping truncate after 3 retries due to repeated > conflicting lock requests` coming from the patch): > > ``` > VACUUM (VERBOSE) toast_target_table; > INFO: vacuuming "postgres.public.toast_target_table" > INFO: table "toast_target_table": suspending truncate due to conflicting > lock request > INFO: table "toast_target_table": truncated 144032 to 143744 pages > INFO: table "toast_target_table": suspending truncate due to conflicting > lock request > INFO: table "toast_target_table": truncated 143744 to 143456 pages > INFO: table "toast_target_table": suspending truncate due to conflicting > lock request > INFO: table "toast_target_table": truncated 143456 to 143168 pages > INFO: table "toast_target_table": stopping truncate after 3 retries due > to repeated conflicting lock requests > INFO: finished vacuuming "postgres.public.toast_target_table": index > scans: 0 > pages: 864 removed, 143168 remain, 1 scanned (0.00% of total), 0 eagerly > scanned > tuples: 0 removed, 7544 remain, 0 are dead but not yet removable > removable cutoff: 2985896, which was 1 XIDs old when operation ended > frozen: 0 pages from table (0.00% of total) had 0 tuples frozen > visibility map: 0 pages set all-visible, 0 pages set all-frozen (0 were > all-visible) > index scan not needed: 0 pages from table (0.00% of total) had 0 dead item > identifiers removed > index "toast_target_table_pkey": pages: 28132 in total, 0 newly deleted, > 27646 currently deleted, 27646 reusable > avg read rate: 13.423 MB/s, avg write rate: 0.006 MB/s > buffer usage: 28172 hits, 28571 reads, 12 dirtied > WAL usage: 22 records, 15 full page images, 120289 bytes, 0 buffers full > system usage: CPU: user: 0.13 s, system: 3.55 s, elapsed: 16.62 s > INFO: vacuuming "postgres.pg_toast.pg_toast_649469" > INFO: table "pg_toast_649469": suspending truncate due to conflicting > lock request > INFO: table "pg_toast_649469": truncated 2553888 to 2552416 pages > INFO: table "pg_toast_649469": suspending truncate due to conflicting > lock request > INFO: table "pg_toast_649469": truncated 2552416 to 2550560 pages > INFO: table "pg_toast_649469": suspending truncate due to conflicting > lock request > INFO: table "pg_toast_649469": truncated 2550560 to 2548832 pages > INFO: table "pg_toast_649469": stopping truncate after 3 retries due to > repeated conflicting lock requests > INFO: finished vacuuming "postgres.pg_toast.pg_toast_649469": index > scans: 0 > pages: 5056 removed, 2548832 remain, 1 scanned (0.00% of total), 0 eagerly > scanned > tuples: 0 removed, 14008 remain, 0 are dead but not yet removable > removable cutoff: 2985897, which was 1 XIDs old when operation ended > new relfrozenxid: 2985897, which is 2 XIDs ahead of previous value > frozen: 0 pages from table (0.00% of total) had 0 tuples frozen > visibility map: 0 pages set all-visible, 0 pages set all-frozen (0 were > all-visible) > index scan not needed: 0 pages from table (0.00% of total) had 0 dead item > identifiers removed > index "pg_toast_649469_index": pages: 56259 in total, 0 newly deleted, > 55293 currently deleted, 55293 reusable > avg read rate: 5.067 MB/s, avg write rate: 0.002 MB/s > buffer usage: 56371 hits, 61075 reads, 19 dirtied > WAL usage: 28 records, 20 full page images, 165414 bytes, 0 buffers full > system usage: CPU: user: 1.67 s, system: 59.10 s, elapsed: 94.16 s > ``` > > All that said, I wanted to bring some visibility into this part of the > problem and also learn from the community on potential ways we could solve > it. One way to solve this is capping the number of retries like in the > patch I attached. The reason I feel like something like this is beneficial > is because it would prioritize overall system availability by preventing > lazy truncate from acquiring the `AccessExclusiveLock` and holding it for > an extended period of time. > > The patch is definitely more of a proof of concept for now (it lacks some > tests and maybe even some configurable params?) and more importantly I am > also curious to hear from the community if this feels like a worthwhile > change or approach (assuming my understanding of the problem makes sense). > Or perhaps there are other ways to solve this problem as well? > > Thank you > Shayon Mukherjee >
# To run this script you will need ruby install, then: # 1. Ensure you have the 'pg' gem installed: gem install pg # 2. Set environment variables for database connection if not using defaults: # PGHOST, PGPORT, PGDATABASE, PGUSER, PGPASSWORD # 3. Execute the script: ruby workload_simulator_toast_lock_timeout.rb require 'pg' require 'thread' # --- Configuration --- DB_PARAMS = { host: ENV['PGHOST'] || 'localhost', port: ENV['PGPORT'] || 5432, dbname: ENV['PGDATABASE'] || 'postgres', user: ENV['PGUSER'] || 'shayon', password: ENV['PGPASSWORD'] }.freeze TABLE_NAME = 'toast_target_table'.freeze NUM_WORKER_THREADS = 30 WORK_DURATION_SECONDS = 7200 LOCK_TIMEOUT_MS = 500 def attempt_toast_access_with_timeout(conn, table_name, target_id, timeout_ms) begin conn.transaction do |txn_conn| txn_conn.exec("SET LOCAL lock_timeout = '#{timeout_ms}ms';") result = txn_conn.exec_params("SELECT length(payload) FROM #{table_name} WHERE id = $1", [target_id]) puts "[#{Thread.current.object_id}] Accessed ID #{target_id}, result: #{result.to_a}" end true rescue PG::LockNotAvailable => e puts "[Worker #{Thread.current.object_id}] Lock timeout (#{timeout_ms}ms) accessing TOAST for ID #{target_id} on table #{table_name}." false rescue PG::QueryCanceled => e puts "[Worker #{Thread.current.object_id}] Query canceled (likely lock timeout: #{timeout_ms}ms) for ID #{target_id} on table #{table_name}." false rescue PG::ConnectionBad => e puts "[Worker #{Thread.current.object_id}] Connection bad: #{e.message.lines.first.strip}" raise rescue PG::Error => e puts "[Worker #{Thread.current.object_id}] Other PG::Error for ID #{target_id} on table #{table_name}: #{e.message.lines.first.strip}" false end end def worker_task(worker_id, table_name) conn = nil connection_retries = 0 max_connection_retries = 5 max_id_cache = nil last_max_id_check = Time.now - 60 ops_this_connection = 0 loop do begin conn = PG.connect(DB_PARAMS) connection_retries = 0 ops_this_connection = 0 start_time = Time.now while (Time.now - start_time) < WORK_DURATION_SECONDS if ops_this_connection % 500 == 0 if !max_id_cache || (Time.now - last_max_id_check > 10) max_id_result = conn.query("SELECT MAX(id) FROM #{table_name}") if max_id_result && max_id_result.ntuples > 0 && max_id_result.getvalue(0,0) current_max = max_id_result.getvalue(0,0).to_i max_id_cache = current_max if current_max > 0 last_max_id_check = Time.now else max_id_cache = nil end end end target_id = max_id_cache && max_id_cache > 0 ? rand(1..max_id_cache) : rand(1..100) attempt_toast_access_with_timeout(conn, table_name, target_id, LOCK_TIMEOUT_MS) ops_this_connection += 1 end break rescue PG::ConnectionBad conn&.close connection_retries += 1 if connection_retries > max_connection_retries puts "[Worker #{worker_id}] Max connection retries reached. Exiting." break end wait_time = 2**connection_retries puts "[Worker #{worker_id}] Connection lost. Retrying in #{wait_time}s..." sleep wait_time ensure conn&.close end end end # --- Main Execution --- puts "Starting #{NUM_WORKER_THREADS} EXTREME TOAST ACCESS workload generator threads." puts "Queries will have a lock timeout of #{LOCK_TIMEOUT_MS}ms." puts "Targeting table: #{TABLE_NAME} on #{DB_PARAMS[:host]}:#{DB_PARAMS[:port]}/#{DB_PARAMS[:dbname]}" puts "---" threads = [] NUM_WORKER_THREADS.times do |i| threads << Thread.new do worker_task(i + 1, TABLE_NAME) end end threads.each(&:join) puts "All workload threads finished."
load-data.sql
Description: application/sql
From adca67867ff5f37ee8c97879a1ec5dc2bf0150f0 Mon Sep 17 00:00:00 2001 From: Shayon Mukherjee <shayonj@gmail.com> Date: Mon, 5 May 2025 15:32:41 -0400 Subject: [PATCH v1] Limit VACUUM truncation retries after scan interruption --- src/backend/access/heap/vacuumlazy.c | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+) diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c index f28326bad0..a91c1b97cb 100644 --- a/src/backend/access/heap/vacuumlazy.c +++ b/src/backend/access/heap/vacuumlazy.c @@ -179,6 +179,8 @@ #define VACUUM_TRUNCATE_LOCK_CHECK_INTERVAL 20 /* ms */ #define VACUUM_TRUNCATE_LOCK_WAIT_INTERVAL 50 /* ms */ #define VACUUM_TRUNCATE_LOCK_TIMEOUT 5000 /* ms */ +/* Max retries for the truncation attempt if the backward scan is interrupted */ +#define VACUUM_TRUNCATE_INTERRUPTION_MAX_RETRIES 3 /* * Threshold that controls whether we bypass index vacuuming and heap @@ -3213,6 +3215,7 @@ lazy_truncate_heap(LVRelState *vacrel) BlockNumber new_rel_pages; bool lock_waiter_detected; int lock_retry; + int truncate_interruption_retry_count = 0; /* Report that we are now truncating */ pgstat_progress_update_param(PROGRESS_VACUUM_PHASE, @@ -3227,6 +3230,19 @@ lazy_truncate_heap(LVRelState *vacrel) */ do { + /* + * Check if we've retried too many times due to interruptions. We + * check here to allow at least one full attempt even if + * max_truncate_retries is set low. + */ + if (truncate_interruption_retry_count >= VACUUM_TRUNCATE_INTERRUPTION_MAX_RETRIES) + { + ereport(vacrel->verbose ? INFO : DEBUG2, + (errmsg("table \"%s\": stopping truncate after %d retries due to repeated conflicting lock requests", + vacrel->relname, truncate_interruption_retry_count))); + break; + } + /* * We need full exclusive lock on the relation in order to do * truncation. If we can't get it, give up rather than waiting --- we @@ -3329,6 +3345,14 @@ lazy_truncate_heap(LVRelState *vacrel) vacrel->relname, orig_rel_pages, new_rel_pages))); orig_rel_pages = new_rel_pages; + + /* + * Increment retry count only if we were interrupted and will loop + * again + */ + if (lock_waiter_detected && new_rel_pages > vacrel->nonempty_pages) + truncate_interruption_retry_count++; + } while (new_rel_pages > vacrel->nonempty_pages && lock_waiter_detected); } -- 2.39.5 (Apple Git-154)