Hello hackers, I'd like to discuss an issue we observed where the truncation phase of autovacuum can become significantly prolonged on toast tables of busy systems due to repeated interruptions by lock waiters, leading to extended `AccessExclusiveLock` contention and impacting overall database & application availability.
*Recap of the production incident* We experienced this issue directly where a `VACUUM` on a TOAST table took approximately 5 hours. The following shows the scale of pages eventually removed and the duration: ``` automatic vacuum of table "tines.pg_toast.pg_toast_21059": index scans: 1 pages: 26096516 removed, 1421195 remain, 0 skipped due to pins, 26090236 skipped frozen tuples: 4575083 removed, 26717710 remain, 21498 are dead but not yet removable, oldest xmin: 173810717 index scan needed: 1151751 pages from table (4.19% of total) had 4629490 dead item identifiers removed index "pg_toast_21059_index": pages: 303420 in total, 14038 newly deleted, 298668 currently deleted, 298418 reusable I/O timings: read: 18428758.270 ms, write: 0.000 ms avg read rate: 10.730 MB/s, avg write rate: 0.000 MB/s buffer usage: 5132686 hits, 25714763 misses, 0 dirtied WAL usage: 0 records, 0 full page images, 0 bytes system usage: CPU: user: 327.79 s, system: 251.11 s, elapsed: 18722.21 s ``` During this 5-hour window, read queries against primary and replica databases (on AWS Aurora) were frequently failing due to lock timeouts (lock_timeout was set to 1s), consistent with the `AccessExclusiveLock` being held and released repeatedly by the truncation part of the autovacuum process. This then further led to me getting a better understanding of the problem and order of events. Which brings me to: *Understanding the problem* My understanding is that the PostgreSQL autovacuum process, specifically its lazy variant (`lazy_vacuum_heap`), performs few passes to reclaim space. One critical phase, if deemed beneficial, is the truncation of empty pages from the end of a relation, handled by the `lazy_truncate_heap` function. As I traced down the code, the interaction occurs as follows: - To physically truncate a relation, `VACUUM` must acquire an `AccessExclusiveLock`. - After acquiring the `AccessExclusiveLock`, `VACUUM` performs a verification step by scanning the relation backwards from its apparent end (`count_nondeletable_pages`). This is necessary because other transactions might have added data to these trailing pages. - While holding the `AccessExclusiveLock` and performing this backward scan, `VACUUM` periodically checks if any other database sessions are waiting for locks on the same relation using `LockHasWaitersRelation(vacrel->rel, AccessExclusiveLock)`. This check occurs at intervals defined by `VACUUM_TRUNCATE_LOCK_CHECK_INTERVAL` (20ms) and roughly every 32 blocks scanned. - Interruption and Retry Mechanism: - If `LockHasWaitersRelation` returns `true` during the backward scan, `count_nondeletable_pages` sets a flag (`lock_waiter_detected = true`) and returns, effectively "suspending" its scan to allow waiting sessions to proceed. - Back in `lazy_truncate_heap`, if `lock_waiter_detected` is true, `VACUUM` releases the `AccessExclusiveLock`. - `lazy_truncate_heap` contains a `do...while` loop. If `lock_waiter_detected` was true AND there are still potentially truncatable pages (`new_rel_pages > vacrel->nonempty_pages`), this loop causes `VACUUM` to retry the entire truncation sequence: attempt to re-acquire the `AccessExclusiveLock` (with its own timeout of `VACUUM_TRUNCATE_LOCK_TIMEOUT`, default 5s), restart the `count_nondeletable_pages` backward scan, and so on. - Next comes the "stuck in a loop" problem and this is where things get interesting: - On a very busy system with constant concurrent access, `VACUUM` can repeatedly acquire the `AccessExclusiveLock`, start the backward scan, detect waiters, release the lock, and retry. Each retry of `count_nondeletable_pages` involves re-reading potentially many blocks from disk while holding the exclusive lock which aggravates the lock contention. - This cycle leads to prolonged `VACUUM` operations, significant repeated I/O, and extended lock contention, reducing database availability and causing query timeouts for other sessions (e.g., `canceling statement due to lock timeout`, `LockNotAvailable`). This is particularly acute for the TOAST tables in my case which sadly happens to be very large (for now). *Testing the theory and patch* To validate this understanding, I was able to reproduce a similar scenario by loading a test table with a significant amount of TOAST-able data and then simulating a busy read workload against it concurrently with a `VACUUM` operation. I have attached a sql script to load the data and a ruby script to simulate the concurrent workload if anyone is curious. The tests demonstrated how the truncation phase could indeed enter extended periods of repeatedly attempting to acquire the `AccessExclusiveLock` and performing the backward scan, only to be interrupted by the simulated concurrent workload. Considering the potential for the truncation phase to cause prolonged lock contention under these conditions, I am wondering if it would be beneficial to explicitly cap the number of retries for the truncation process within the `do...while` loop in `lazy_truncate_heap` when interruptions by lock waiters occur? Because I was already too deep in the code, I decided to write a small patch to introduce a limit to the retries and observe the behavior. Specifically, it adds a counter `truncate_interruption_retry_count` and checks it against a new constant `VACUUM_TRUNCATE_INTERRUPTION_MAX_RETRIES` (which for now I have set to 3). If the backward scan in `count_nondeletable_pages` is interrupted by lock waiters this many times, `VACUUM` will cease further truncation attempts for the current `VACUUM` cycle on that table. I was able to perform the test on a similar setup and the results are as follows (noticed the `stopping truncate after 3 retries due to repeated conflicting lock requests` coming from the patch): ``` VACUUM (VERBOSE) toast_target_table; INFO: vacuuming "postgres.public.toast_target_table" INFO: table "toast_target_table": suspending truncate due to conflicting lock request INFO: table "toast_target_table": truncated 144032 to 143744 pages INFO: table "toast_target_table": suspending truncate due to conflicting lock request INFO: table "toast_target_table": truncated 143744 to 143456 pages INFO: table "toast_target_table": suspending truncate due to conflicting lock request INFO: table "toast_target_table": truncated 143456 to 143168 pages INFO: table "toast_target_table": stopping truncate after 3 retries due to repeated conflicting lock requests INFO: finished vacuuming "postgres.public.toast_target_table": index scans: 0 pages: 864 removed, 143168 remain, 1 scanned (0.00% of total), 0 eagerly scanned tuples: 0 removed, 7544 remain, 0 are dead but not yet removable removable cutoff: 2985896, which was 1 XIDs old when operation ended frozen: 0 pages from table (0.00% of total) had 0 tuples frozen visibility map: 0 pages set all-visible, 0 pages set all-frozen (0 were all-visible) index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed index "toast_target_table_pkey": pages: 28132 in total, 0 newly deleted, 27646 currently deleted, 27646 reusable avg read rate: 13.423 MB/s, avg write rate: 0.006 MB/s buffer usage: 28172 hits, 28571 reads, 12 dirtied WAL usage: 22 records, 15 full page images, 120289 bytes, 0 buffers full system usage: CPU: user: 0.13 s, system: 3.55 s, elapsed: 16.62 s INFO: vacuuming "postgres.pg_toast.pg_toast_649469" INFO: table "pg_toast_649469": suspending truncate due to conflicting lock request INFO: table "pg_toast_649469": truncated 2553888 to 2552416 pages INFO: table "pg_toast_649469": suspending truncate due to conflicting lock request INFO: table "pg_toast_649469": truncated 2552416 to 2550560 pages INFO: table "pg_toast_649469": suspending truncate due to conflicting lock request INFO: table "pg_toast_649469": truncated 2550560 to 2548832 pages INFO: table "pg_toast_649469": stopping truncate after 3 retries due to repeated conflicting lock requests INFO: finished vacuuming "postgres.pg_toast.pg_toast_649469": index scans: 0 pages: 5056 removed, 2548832 remain, 1 scanned (0.00% of total), 0 eagerly scanned tuples: 0 removed, 14008 remain, 0 are dead but not yet removable removable cutoff: 2985897, which was 1 XIDs old when operation ended new relfrozenxid: 2985897, which is 2 XIDs ahead of previous value frozen: 0 pages from table (0.00% of total) had 0 tuples frozen visibility map: 0 pages set all-visible, 0 pages set all-frozen (0 were all-visible) index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed index "pg_toast_649469_index": pages: 56259 in total, 0 newly deleted, 55293 currently deleted, 55293 reusable avg read rate: 5.067 MB/s, avg write rate: 0.002 MB/s buffer usage: 56371 hits, 61075 reads, 19 dirtied WAL usage: 28 records, 20 full page images, 165414 bytes, 0 buffers full system usage: CPU: user: 1.67 s, system: 59.10 s, elapsed: 94.16 s ``` All that said, I wanted to bring some visibility into this part of the problem and also learn from the community on potential ways we could solve it. One way to solve this is capping the number of retries like in the patch I attached. The reason I feel like something like this is beneficial is because it would prioritize overall system availability by preventing lazy truncate from acquiring the `AccessExclusiveLock` and holding it for an extended period of time. The patch is definitely more of a proof of concept for now (it lacks some tests and maybe even some configurable params?) and more importantly I am also curious to hear from the community if this feels like a worthwhile change or approach (assuming my understanding of the problem makes sense). Or perhaps there are other ways to solve this problem as well? Thank you Shayon Mukherjee
# To run this script you will need ruby install, then: # 1. Ensure you have the 'pg' gem installed: gem install pg # 2. Set environment variables for database connection if not using defaults: # PGHOST, PGPORT, PGDATABASE, PGUSER, PGPASSWORD # 3. Execute the script: ruby workload_simulator_toast_lock_timeout.rb require 'pg' require 'thread' # --- Configuration --- DB_PARAMS = { host: ENV['PGHOST'] || 'localhost', port: ENV['PGPORT'] || 5432, dbname: ENV['PGDATABASE'] || 'postgres', user: ENV['PGUSER'] || 'shayon', password: ENV['PGPASSWORD'] }.freeze TABLE_NAME = 'toast_target_table'.freeze NUM_WORKER_THREADS = 15 WORK_DURATION_SECONDS = 7200 LOCK_TIMEOUT_MS = 500 def attempt_toast_access_with_timeout(conn, table_name, target_id, timeout_ms) begin conn.transaction do |txn_conn| txn_conn.exec("SET LOCAL lock_timeout = '#{timeout_ms}ms';") result = txn_conn.exec_params("SELECT length(payload) FROM #{table_name} WHERE id = $1", [target_id]) puts "[#{Thread.current.object_id}] Accessed ID #{target_id}, result: #{result.to_a}" end true rescue PG::LockNotAvailable => e puts "[Worker #{Thread.current.object_id}] Lock timeout (#{timeout_ms}ms) accessing TOAST for ID #{target_id} on table #{table_name}." false rescue PG::QueryCanceled => e puts "[Worker #{Thread.current.object_id}] Query canceled (likely lock timeout: #{timeout_ms}ms) for ID #{target_id} on table #{table_name}." false rescue PG::ConnectionBad => e puts "[Worker #{Thread.current.object_id}] Connection bad: #{e.message.lines.first.strip}" raise rescue PG::Error => e puts "[Worker #{Thread.current.object_id}] Other PG::Error for ID #{target_id} on table #{table_name}: #{e.message.lines.first.strip}" false end end def worker_task(worker_id, table_name) conn = nil connection_retries = 0 max_connection_retries = 5 max_id_cache = nil last_max_id_check = Time.now - 60 ops_this_connection = 0 loop do begin conn = PG.connect(DB_PARAMS) connection_retries = 0 ops_this_connection = 0 start_time = Time.now while (Time.now - start_time) < WORK_DURATION_SECONDS if ops_this_connection % 500 == 0 if !max_id_cache || (Time.now - last_max_id_check > 10) max_id_result = conn.query("SELECT MAX(id) FROM #{table_name}") if max_id_result && max_id_result.ntuples > 0 && max_id_result.getvalue(0,0) current_max = max_id_result.getvalue(0,0).to_i max_id_cache = current_max if current_max > 0 last_max_id_check = Time.now else max_id_cache = nil end end end target_id = max_id_cache && max_id_cache > 0 ? rand(1..max_id_cache) : rand(1..100) attempt_toast_access_with_timeout(conn, table_name, target_id, LOCK_TIMEOUT_MS) ops_this_connection += 1 end break rescue PG::ConnectionBad conn&.close connection_retries += 1 if connection_retries > max_connection_retries puts "[Worker #{worker_id}] Max connection retries reached. Exiting." break end wait_time = 2**connection_retries puts "[Worker #{worker_id}] Connection lost. Retrying in #{wait_time}s..." sleep wait_time ensure conn&.close end end end # --- Main Execution --- puts "Starting #{NUM_WORKER_THREADS} EXTREME TOAST ACCESS workload generator threads." puts "Queries will have a lock timeout of #{LOCK_TIMEOUT_MS}ms." puts "Targeting table: #{TABLE_NAME} on #{DB_PARAMS[:host]}:#{DB_PARAMS[:port]}/#{DB_PARAMS[:dbname]}" puts "---" threads = [] NUM_WORKER_THREADS.times do |i| threads << Thread.new do worker_task(i + 1, TABLE_NAME) end end threads.each(&:join) puts "All workload threads finished."
v18-0001-Limit-VACUUM-truncation-retries-after-scan-inter.patch
Description: Binary data
load-data.sql
Description: Binary data