Re: Prolonged truncation phase during vacuum on toast table with repeated interruptions by lock waiters and a proposed POC patch

Shayon Mukherjee Wed, 07 May 2025 06:34:11 -0700

Hello,

Attached are the updated files to help reproduce the state mentioned in the
previous email more easily.


- The load_data.sql has the commands to insert a lot of dummy data that
will overflow to TOAST
  - Which then is deleted (via the commands in the sql file) to create a
situation where a truncate will happen on a future vacuum
- Updated ruby script.rb to make simple select calls with high concurrency
- Renamed the POC to be `v1`

Thank you
Shayon

On Tue, May 6, 2025 at 6:05 PM Shayon Mukherjee <shay...@gmail.com> wrote:

> Hello hackers,
>
> I'd like to discuss an issue we observed where the truncation phase of
> autovacuum can become significantly prolonged on toast tables of busy
> systems due to repeated interruptions by lock waiters, leading to extended
> `AccessExclusiveLock` contention and impacting overall database &
> application availability.
>
>
> *Recap of the production incident*
> We experienced this issue directly where a `VACUUM` on a TOAST table took
> approximately 5 hours. The following shows the scale of pages eventually
> removed and the duration:
>
> ```
> automatic vacuum of table "tines.pg_toast.pg_toast_21059": index scans: 1
> pages: 26096516 removed, 1421195 remain, 0 skipped due to pins, 26090236
> skipped frozen
> tuples: 4575083 removed, 26717710 remain, 21498 are dead but not yet
> removable, oldest xmin: 173810717
> index scan needed: 1151751 pages from table (4.19% of total) had 4629490
> dead item identifiers removed
> index "pg_toast_21059_index": pages: 303420 in total, 14038 newly deleted,
> 298668 currently deleted, 298418 reusable
> I/O timings: read: 18428758.270 ms, write: 0.000 ms
> avg read rate: 10.730 MB/s, avg write rate: 0.000 MB/s
> buffer usage: 5132686 hits, 25714763 misses, 0 dirtied
> WAL usage: 0 records, 0 full page images, 0 bytes
> system usage: CPU: user: 327.79 s, system: 251.11 s, elapsed: 18722.21 s
> ```
>
> During this 5-hour window, read queries against primary and replica
> databases (on AWS Aurora) were frequently failing due to lock timeouts
> (lock_timeout was set to 1s), consistent with the `AccessExclusiveLock`
> being held and released repeatedly by the truncation part of the autovacuum
> process.
>
> This then further led to me getting a better understanding of the problem
> and order of events. Which brings me to:
>
> *Understanding the problem*
> My understanding is that the PostgreSQL autovacuum process, specifically
> its lazy variant (`lazy_vacuum_heap`), performs few passes to reclaim
> space. One critical phase, if deemed beneficial, is the truncation of empty
> pages from the end of a relation, handled by the `lazy_truncate_heap`
> function.
>
> As I traced down the code, the interaction occurs as follows:
>
> - To physically truncate a relation, `VACUUM` must acquire an
> `AccessExclusiveLock`.
> - After acquiring the `AccessExclusiveLock`, `VACUUM` performs a
> verification step by scanning the relation backwards from its apparent end
> (`count_nondeletable_pages`). This is necessary because other transactions
> might have added data to these trailing pages.
> - While holding the `AccessExclusiveLock` and performing this backward
> scan, `VACUUM` periodically checks if any other database sessions are
> waiting for locks on the same relation using
> `LockHasWaitersRelation(vacrel->rel, AccessExclusiveLock)`. This check
> occurs at intervals defined by `VACUUM_TRUNCATE_LOCK_CHECK_INTERVAL` (20ms)
> and roughly every 32 blocks scanned.
> - Interruption and Retry Mechanism:
>     - If `LockHasWaitersRelation` returns `true` during the backward scan,
> `count_nondeletable_pages` sets a flag (`lock_waiter_detected = true`) and
> returns, effectively "suspending" its scan to allow waiting sessions to
> proceed.
>     - Back in `lazy_truncate_heap`, if `lock_waiter_detected` is true,
> `VACUUM` releases the `AccessExclusiveLock`.
>     - `lazy_truncate_heap` contains a `do...while` loop. If
> `lock_waiter_detected` was true AND there are still potentially truncatable
> pages (`new_rel_pages > vacrel->nonempty_pages`), this loop causes `VACUUM`
> to retry the entire truncation sequence: attempt to re-acquire the
> `AccessExclusiveLock` (with its own timeout of
> `VACUUM_TRUNCATE_LOCK_TIMEOUT`, default 5s), restart the
> `count_nondeletable_pages` backward scan, and so on.
>
> - Next comes the "stuck in a loop" problem and this is where things get
> interesting:
>     - On a very busy system with constant concurrent access, `VACUUM` can
> repeatedly acquire the `AccessExclusiveLock`, start the backward scan,
> detect waiters, release the lock, and retry. Each retry of
> `count_nondeletable_pages` involves re-reading potentially many blocks from
> disk while holding the exclusive lock which aggravates the lock contention.
>     - This cycle leads to prolonged `VACUUM` operations, significant
> repeated I/O, and extended lock contention, reducing database availability
> and causing query timeouts for other sessions (e.g., `canceling statement
> due to lock timeout`, `LockNotAvailable`). This is particularly acute for
> the TOAST tables in my case which sadly happens to be very large (for now).
>
>
> *Testing the theory and patch*
> To validate this understanding, I was able to reproduce a similar scenario
> by loading a test table with a significant amount of TOAST-able data and
> then simulating a busy read workload against it concurrently with a
> `VACUUM` operation. I have attached a sql script to load the data and a
> ruby script to simulate the concurrent workload if anyone is curious. The
> tests demonstrated how the truncation phase could indeed enter extended
> periods of repeatedly attempting to acquire the `AccessExclusiveLock` and
> performing the backward scan, only to be interrupted by the simulated
> concurrent workload.
>
> Considering the potential for the truncation phase to cause prolonged lock
> contention under these conditions, I am wondering if it would be beneficial
> to explicitly cap the number of retries for the truncation process within
> the `do...while` loop in `lazy_truncate_heap` when interruptions by lock
> waiters occur?
>
> Because I was already too deep in the code, I decided to write a small
> patch to introduce a limit to the retries and observe the behavior.
> Specifically, it adds a counter `truncate_interruption_retry_count` and
> checks it against a new constant `VACUUM_TRUNCATE_INTERRUPTION_MAX_RETRIES`
> (which for now I have set to 3). If the backward scan in
> `count_nondeletable_pages` is interrupted by lock waiters this many times,
> `VACUUM` will cease further truncation attempts for the current `VACUUM`
> cycle on that table.
>
> I was able to perform the test on a similar setup and the results are as
> follows (noticed the `stopping truncate after 3 retries due to repeated
> conflicting lock requests` coming from the patch):
>
> ```
> VACUUM (VERBOSE) toast_target_table;
> INFO:  vacuuming "postgres.public.toast_target_table"
> INFO:  table "toast_target_table": suspending truncate due to conflicting
> lock request
> INFO:  table "toast_target_table": truncated 144032 to 143744 pages
> INFO:  table "toast_target_table": suspending truncate due to conflicting
> lock request
> INFO:  table "toast_target_table": truncated 143744 to 143456 pages
> INFO:  table "toast_target_table": suspending truncate due to conflicting
> lock request
> INFO:  table "toast_target_table": truncated 143456 to 143168 pages
> INFO:  table "toast_target_table": stopping truncate after 3 retries due
> to repeated conflicting lock requests
> INFO:  finished vacuuming "postgres.public.toast_target_table": index
> scans: 0
> pages: 864 removed, 143168 remain, 1 scanned (0.00% of total), 0 eagerly
> scanned
> tuples: 0 removed, 7544 remain, 0 are dead but not yet removable
> removable cutoff: 2985896, which was 1 XIDs old when operation ended
> frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
> visibility map: 0 pages set all-visible, 0 pages set all-frozen (0 were
> all-visible)
> index scan not needed: 0 pages from table (0.00% of total) had 0 dead item
> identifiers removed
> index "toast_target_table_pkey": pages: 28132 in total, 0 newly deleted,
> 27646 currently deleted, 27646 reusable
> avg read rate: 13.423 MB/s, avg write rate: 0.006 MB/s
> buffer usage: 28172 hits, 28571 reads, 12 dirtied
> WAL usage: 22 records, 15 full page images, 120289 bytes, 0 buffers full
> system usage: CPU: user: 0.13 s, system: 3.55 s, elapsed: 16.62 s
> INFO:  vacuuming "postgres.pg_toast.pg_toast_649469"
> INFO:  table "pg_toast_649469": suspending truncate due to conflicting
> lock request
> INFO:  table "pg_toast_649469": truncated 2553888 to 2552416 pages
> INFO:  table "pg_toast_649469": suspending truncate due to conflicting
> lock request
> INFO:  table "pg_toast_649469": truncated 2552416 to 2550560 pages
> INFO:  table "pg_toast_649469": suspending truncate due to conflicting
> lock request
> INFO:  table "pg_toast_649469": truncated 2550560 to 2548832 pages
> INFO:  table "pg_toast_649469": stopping truncate after 3 retries due to
> repeated conflicting lock requests
> INFO:  finished vacuuming "postgres.pg_toast.pg_toast_649469": index
> scans: 0
> pages: 5056 removed, 2548832 remain, 1 scanned (0.00% of total), 0 eagerly
> scanned
> tuples: 0 removed, 14008 remain, 0 are dead but not yet removable
> removable cutoff: 2985897, which was 1 XIDs old when operation ended
> new relfrozenxid: 2985897, which is 2 XIDs ahead of previous value
> frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
> visibility map: 0 pages set all-visible, 0 pages set all-frozen (0 were
> all-visible)
> index scan not needed: 0 pages from table (0.00% of total) had 0 dead item
> identifiers removed
> index "pg_toast_649469_index": pages: 56259 in total, 0 newly deleted,
> 55293 currently deleted, 55293 reusable
> avg read rate: 5.067 MB/s, avg write rate: 0.002 MB/s
> buffer usage: 56371 hits, 61075 reads, 19 dirtied
> WAL usage: 28 records, 20 full page images, 165414 bytes, 0 buffers full
> system usage: CPU: user: 1.67 s, system: 59.10 s, elapsed: 94.16 s
> ```
>
> All that said, I wanted to bring some visibility into this part of the
> problem and also learn from the community on potential ways we could solve
> it. One way to solve this is capping the number of retries like in the
> patch I attached. The reason I feel like something like this is beneficial
> is because it would prioritize overall system availability by preventing
> lazy truncate from acquiring the `AccessExclusiveLock` and holding it for
> an extended period of time.
>
> The patch is definitely more of a proof of concept for now (it lacks some
> tests and maybe even some configurable params?) and more importantly I am
> also curious to hear from the community if this feels like a worthwhile
> change or approach (assuming my understanding of the problem makes sense).
> Or perhaps there are other ways to solve this problem as well?
>
> Thank you
> Shayon Mukherjee
>

# To run this script you will need ruby install, then:
# 1. Ensure you have the 'pg' gem installed: gem install pg
# 2. Set environment variables for database connection if not using defaults:
#    PGHOST, PGPORT, PGDATABASE, PGUSER, PGPASSWORD
# 3. Execute the script: ruby workload_simulator_toast_lock_timeout.rb

require 'pg'
require 'thread'

# --- Configuration ---
DB_PARAMS = {
  host: ENV['PGHOST'] || 'localhost',
  port: ENV['PGPORT'] || 5432,
  dbname: ENV['PGDATABASE'] || 'postgres',
  user: ENV['PGUSER'] || 'shayon',
  password: ENV['PGPASSWORD']
}.freeze

TABLE_NAME = 'toast_target_table'.freeze
NUM_WORKER_THREADS = 30
WORK_DURATION_SECONDS = 7200
LOCK_TIMEOUT_MS = 500

def attempt_toast_access_with_timeout(conn, table_name, target_id, timeout_ms)
  begin
    conn.transaction do |txn_conn|
      txn_conn.exec("SET LOCAL lock_timeout = '#{timeout_ms}ms';")
      result = txn_conn.exec_params("SELECT length(payload) FROM #{table_name} WHERE id = $1", [target_id])
      puts "[#{Thread.current.object_id}] Accessed ID #{target_id}, result: #{result.to_a}"
    end
    true
  rescue PG::LockNotAvailable => e
    puts "[Worker #{Thread.current.object_id}] Lock timeout (#{timeout_ms}ms) accessing TOAST for ID #{target_id} on table #{table_name}."
    false
  rescue PG::QueryCanceled => e
    puts "[Worker #{Thread.current.object_id}] Query canceled (likely lock timeout: #{timeout_ms}ms) for ID #{target_id} on table #{table_name}."
    false
  rescue PG::ConnectionBad => e
    puts "[Worker #{Thread.current.object_id}] Connection bad: #{e.message.lines.first.strip}"
    raise
  rescue PG::Error => e
    puts "[Worker #{Thread.current.object_id}] Other PG::Error for ID #{target_id} on table #{table_name}: #{e.message.lines.first.strip}"
    false
  end
end

def worker_task(worker_id, table_name)
  conn = nil
  connection_retries = 0
  max_connection_retries = 5
  max_id_cache = nil
  last_max_id_check = Time.now - 60
  ops_this_connection = 0

  loop do
    begin
      conn = PG.connect(DB_PARAMS)
      connection_retries = 0
      ops_this_connection = 0
      start_time = Time.now

      while (Time.now - start_time) < WORK_DURATION_SECONDS
        if ops_this_connection % 500 == 0
            if !max_id_cache || (Time.now - last_max_id_check > 10)
                max_id_result = conn.query("SELECT MAX(id) FROM #{table_name}")
                if max_id_result && max_id_result.ntuples > 0 && max_id_result.getvalue(0,0)
                    current_max = max_id_result.getvalue(0,0).to_i
                    max_id_cache = current_max if current_max > 0
                    last_max_id_check = Time.now
                else
                    max_id_cache = nil
                end
            end
        end

        target_id = max_id_cache && max_id_cache > 0 ? rand(1..max_id_cache) : rand(1..100)

        attempt_toast_access_with_timeout(conn, table_name, target_id, LOCK_TIMEOUT_MS)

        ops_this_connection += 1
      end
      break
    rescue PG::ConnectionBad
      conn&.close
      connection_retries += 1
      if connection_retries > max_connection_retries
        puts "[Worker #{worker_id}] Max connection retries reached. Exiting."
        break
      end
      wait_time = 2**connection_retries
      puts "[Worker #{worker_id}] Connection lost. Retrying in #{wait_time}s..."
      sleep wait_time
    ensure
      conn&.close
    end
  end
end

# --- Main Execution ---
puts "Starting #{NUM_WORKER_THREADS} EXTREME TOAST ACCESS workload generator threads."
puts "Queries will have a lock timeout of #{LOCK_TIMEOUT_MS}ms."
puts "Targeting table: #{TABLE_NAME} on #{DB_PARAMS[:host]}:#{DB_PARAMS[:port]}/#{DB_PARAMS[:dbname]}"
puts "---"

threads = []
NUM_WORKER_THREADS.times do |i|
  threads << Thread.new do
    worker_task(i + 1, TABLE_NAME)
  end
end

threads.each(&:join)
puts "All workload threads finished."

load-data.sql
Description: application/sql

From adca67867ff5f37ee8c97879a1ec5dc2bf0150f0 Mon Sep 17 00:00:00 2001
From: Shayon Mukherjee <shayonj@gmail.com>
Date: Mon, 5 May 2025 15:32:41 -0400
Subject: [PATCH v1] Limit VACUUM truncation retries after scan interruption

---
 src/backend/access/heap/vacuumlazy.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index f28326bad0..a91c1b97cb 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -179,6 +179,8 @@
 #define VACUUM_TRUNCATE_LOCK_CHECK_INTERVAL		20	/* ms */
 #define VACUUM_TRUNCATE_LOCK_WAIT_INTERVAL		50	/* ms */
 #define VACUUM_TRUNCATE_LOCK_TIMEOUT			5000	/* ms */
+/* Max retries for the truncation attempt if the backward scan is interrupted */
+#define VACUUM_TRUNCATE_INTERRUPTION_MAX_RETRIES	3
 
 /*
  * Threshold that controls whether we bypass index vacuuming and heap
@@ -3213,6 +3215,7 @@ lazy_truncate_heap(LVRelState *vacrel)
 	BlockNumber new_rel_pages;
 	bool		lock_waiter_detected;
 	int			lock_retry;
+	int			truncate_interruption_retry_count = 0;
 
 	/* Report that we are now truncating */
 	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
@@ -3227,6 +3230,19 @@ lazy_truncate_heap(LVRelState *vacrel)
 	 */
 	do
 	{
+		/*
+		 * Check if we've retried too many times due to interruptions. We
+		 * check here to allow at least one full attempt even if
+		 * max_truncate_retries is set low.
+		 */
+		if (truncate_interruption_retry_count >= VACUUM_TRUNCATE_INTERRUPTION_MAX_RETRIES)
+		{
+			ereport(vacrel->verbose ? INFO : DEBUG2,
+					(errmsg("table \"%s\": stopping truncate after %d retries due to repeated conflicting lock requests",
+							vacrel->relname, truncate_interruption_retry_count)));
+			break;
+		}
+
 		/*
 		 * We need full exclusive lock on the relation in order to do
 		 * truncation. If we can't get it, give up rather than waiting --- we
@@ -3329,6 +3345,14 @@ lazy_truncate_heap(LVRelState *vacrel)
 						vacrel->relname,
 						orig_rel_pages, new_rel_pages)));
 		orig_rel_pages = new_rel_pages;
+
+		/*
+		 * Increment retry count only if we were interrupted and will loop
+		 * again
+		 */
+		if (lock_waiter_detected && new_rel_pages > vacrel->nonempty_pages)
+			truncate_interruption_retry_count++;
+
 	} while (new_rel_pages > vacrel->nonempty_pages && lock_waiter_detected);
 }
 
-- 
2.39.5 (Apple Git-154)

Re: Prolonged truncation phase during vacuum on toast table with repeated interruptions by lock waiters and a proposed POC patch

Reply via email to