Hello hackers,

I'd like to discuss an issue we observed where the truncation phase of
autovacuum can become significantly prolonged on toast tables of busy
systems due to repeated interruptions by lock waiters, leading to extended
`AccessExclusiveLock` contention and impacting overall database &
application availability.


*Recap of the production incident*
We experienced this issue directly where a `VACUUM` on a TOAST table took
approximately 5 hours. The following shows the scale of pages eventually
removed and the duration:

```
automatic vacuum of table "tines.pg_toast.pg_toast_21059": index scans: 1
pages: 26096516 removed, 1421195 remain, 0 skipped due to pins, 26090236
skipped frozen
tuples: 4575083 removed, 26717710 remain, 21498 are dead but not yet
removable, oldest xmin: 173810717
index scan needed: 1151751 pages from table (4.19% of total) had 4629490
dead item identifiers removed
index "pg_toast_21059_index": pages: 303420 in total, 14038 newly deleted,
298668 currently deleted, 298418 reusable
I/O timings: read: 18428758.270 ms, write: 0.000 ms
avg read rate: 10.730 MB/s, avg write rate: 0.000 MB/s
buffer usage: 5132686 hits, 25714763 misses, 0 dirtied
WAL usage: 0 records, 0 full page images, 0 bytes
system usage: CPU: user: 327.79 s, system: 251.11 s, elapsed: 18722.21 s
```

During this 5-hour window, read queries against primary and replica
databases (on AWS Aurora) were frequently failing due to lock timeouts
(lock_timeout was set to 1s), consistent with the `AccessExclusiveLock`
being held and released repeatedly by the truncation part of the autovacuum
process.

This then further led to me getting a better understanding of the problem
and order of events. Which brings me to:

*Understanding the problem*
My understanding is that the PostgreSQL autovacuum process, specifically
its lazy variant (`lazy_vacuum_heap`), performs few passes to reclaim
space. One critical phase, if deemed beneficial, is the truncation of empty
pages from the end of a relation, handled by the `lazy_truncate_heap`
function.

As I traced down the code, the interaction occurs as follows:

- To physically truncate a relation, `VACUUM` must acquire an
`AccessExclusiveLock`.
- After acquiring the `AccessExclusiveLock`, `VACUUM` performs a
verification step by scanning the relation backwards from its apparent end
(`count_nondeletable_pages`). This is necessary because other transactions
might have added data to these trailing pages.
- While holding the `AccessExclusiveLock` and performing this backward
scan, `VACUUM` periodically checks if any other database sessions are
waiting for locks on the same relation using
`LockHasWaitersRelation(vacrel->rel, AccessExclusiveLock)`. This check
occurs at intervals defined by `VACUUM_TRUNCATE_LOCK_CHECK_INTERVAL` (20ms)
and roughly every 32 blocks scanned.
- Interruption and Retry Mechanism:
    - If `LockHasWaitersRelation` returns `true` during the backward scan,
`count_nondeletable_pages` sets a flag (`lock_waiter_detected = true`) and
returns, effectively "suspending" its scan to allow waiting sessions to
proceed.
    - Back in `lazy_truncate_heap`, if `lock_waiter_detected` is true,
`VACUUM` releases the `AccessExclusiveLock`.
    - `lazy_truncate_heap` contains a `do...while` loop. If
`lock_waiter_detected` was true AND there are still potentially truncatable
pages (`new_rel_pages > vacrel->nonempty_pages`), this loop causes `VACUUM`
to retry the entire truncation sequence: attempt to re-acquire the
`AccessExclusiveLock` (with its own timeout of
`VACUUM_TRUNCATE_LOCK_TIMEOUT`, default 5s), restart the
`count_nondeletable_pages` backward scan, and so on.

- Next comes the "stuck in a loop" problem and this is where things get
interesting:
    - On a very busy system with constant concurrent access, `VACUUM` can
repeatedly acquire the `AccessExclusiveLock`, start the backward scan,
detect waiters, release the lock, and retry. Each retry of
`count_nondeletable_pages` involves re-reading potentially many blocks from
disk while holding the exclusive lock which aggravates the lock contention.
    - This cycle leads to prolonged `VACUUM` operations, significant
repeated I/O, and extended lock contention, reducing database availability
and causing query timeouts for other sessions (e.g., `canceling statement
due to lock timeout`, `LockNotAvailable`). This is particularly acute for
the TOAST tables in my case which sadly happens to be very large (for now).


*Testing the theory and patch*
To validate this understanding, I was able to reproduce a similar scenario
by loading a test table with a significant amount of TOAST-able data and
then simulating a busy read workload against it concurrently with a
`VACUUM` operation. I have attached a sql script to load the data and a
ruby script to simulate the concurrent workload if anyone is curious. The
tests demonstrated how the truncation phase could indeed enter extended
periods of repeatedly attempting to acquire the `AccessExclusiveLock` and
performing the backward scan, only to be interrupted by the simulated
concurrent workload.

Considering the potential for the truncation phase to cause prolonged lock
contention under these conditions, I am wondering if it would be beneficial
to explicitly cap the number of retries for the truncation process within
the `do...while` loop in `lazy_truncate_heap` when interruptions by lock
waiters occur?

Because I was already too deep in the code, I decided to write a small
patch to introduce a limit to the retries and observe the behavior.
Specifically, it adds a counter `truncate_interruption_retry_count` and
checks it against a new constant `VACUUM_TRUNCATE_INTERRUPTION_MAX_RETRIES`
(which for now I have set to 3). If the backward scan in
`count_nondeletable_pages` is interrupted by lock waiters this many times,
`VACUUM` will cease further truncation attempts for the current `VACUUM`
cycle on that table.

I was able to perform the test on a similar setup and the results are as
follows (noticed the `stopping truncate after 3 retries due to repeated
conflicting lock requests` coming from the patch):

```
VACUUM (VERBOSE) toast_target_table;
INFO:  vacuuming "postgres.public.toast_target_table"
INFO:  table "toast_target_table": suspending truncate due to conflicting
lock request
INFO:  table "toast_target_table": truncated 144032 to 143744 pages
INFO:  table "toast_target_table": suspending truncate due to conflicting
lock request
INFO:  table "toast_target_table": truncated 143744 to 143456 pages
INFO:  table "toast_target_table": suspending truncate due to conflicting
lock request
INFO:  table "toast_target_table": truncated 143456 to 143168 pages
INFO:  table "toast_target_table": stopping truncate after 3 retries due to
repeated conflicting lock requests
INFO:  finished vacuuming "postgres.public.toast_target_table": index
scans: 0
pages: 864 removed, 143168 remain, 1 scanned (0.00% of total), 0 eagerly
scanned
tuples: 0 removed, 7544 remain, 0 are dead but not yet removable
removable cutoff: 2985896, which was 1 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
visibility map: 0 pages set all-visible, 0 pages set all-frozen (0 were
all-visible)
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item
identifiers removed
index "toast_target_table_pkey": pages: 28132 in total, 0 newly deleted,
27646 currently deleted, 27646 reusable
avg read rate: 13.423 MB/s, avg write rate: 0.006 MB/s
buffer usage: 28172 hits, 28571 reads, 12 dirtied
WAL usage: 22 records, 15 full page images, 120289 bytes, 0 buffers full
system usage: CPU: user: 0.13 s, system: 3.55 s, elapsed: 16.62 s
INFO:  vacuuming "postgres.pg_toast.pg_toast_649469"
INFO:  table "pg_toast_649469": suspending truncate due to conflicting lock
request
INFO:  table "pg_toast_649469": truncated 2553888 to 2552416 pages
INFO:  table "pg_toast_649469": suspending truncate due to conflicting lock
request
INFO:  table "pg_toast_649469": truncated 2552416 to 2550560 pages
INFO:  table "pg_toast_649469": suspending truncate due to conflicting lock
request
INFO:  table "pg_toast_649469": truncated 2550560 to 2548832 pages
INFO:  table "pg_toast_649469": stopping truncate after 3 retries due to
repeated conflicting lock requests
INFO:  finished vacuuming "postgres.pg_toast.pg_toast_649469": index scans:
0
pages: 5056 removed, 2548832 remain, 1 scanned (0.00% of total), 0 eagerly
scanned
tuples: 0 removed, 14008 remain, 0 are dead but not yet removable
removable cutoff: 2985897, which was 1 XIDs old when operation ended
new relfrozenxid: 2985897, which is 2 XIDs ahead of previous value
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
visibility map: 0 pages set all-visible, 0 pages set all-frozen (0 were
all-visible)
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item
identifiers removed
index "pg_toast_649469_index": pages: 56259 in total, 0 newly deleted,
55293 currently deleted, 55293 reusable
avg read rate: 5.067 MB/s, avg write rate: 0.002 MB/s
buffer usage: 56371 hits, 61075 reads, 19 dirtied
WAL usage: 28 records, 20 full page images, 165414 bytes, 0 buffers full
system usage: CPU: user: 1.67 s, system: 59.10 s, elapsed: 94.16 s
```

All that said, I wanted to bring some visibility into this part of the
problem and also learn from the community on potential ways we could solve
it. One way to solve this is capping the number of retries like in the
patch I attached. The reason I feel like something like this is beneficial
is because it would prioritize overall system availability by preventing
lazy truncate from acquiring the `AccessExclusiveLock` and holding it for
an extended period of time.

The patch is definitely more of a proof of concept for now (it lacks some
tests and maybe even some configurable params?) and more importantly I am
also curious to hear from the community if this feels like a worthwhile
change or approach (assuming my understanding of the problem makes sense).
Or perhaps there are other ways to solve this problem as well?

Thank you
Shayon Mukherjee
# To run this script you will need ruby install, then:
# 1. Ensure you have the 'pg' gem installed: gem install pg
# 2. Set environment variables for database connection if not using defaults:
#    PGHOST, PGPORT, PGDATABASE, PGUSER, PGPASSWORD
# 3. Execute the script: ruby workload_simulator_toast_lock_timeout.rb

require 'pg'
require 'thread'

# --- Configuration ---
DB_PARAMS = {
  host: ENV['PGHOST'] || 'localhost',
  port: ENV['PGPORT'] || 5432,
  dbname: ENV['PGDATABASE'] || 'postgres',
  user: ENV['PGUSER'] || 'shayon',
  password: ENV['PGPASSWORD']
}.freeze

TABLE_NAME = 'toast_target_table'.freeze
NUM_WORKER_THREADS = 15
WORK_DURATION_SECONDS = 7200
LOCK_TIMEOUT_MS = 500

def attempt_toast_access_with_timeout(conn, table_name, target_id, timeout_ms)
  begin
    conn.transaction do |txn_conn|
      txn_conn.exec("SET LOCAL lock_timeout = '#{timeout_ms}ms';")
      result = txn_conn.exec_params("SELECT length(payload) FROM #{table_name} WHERE id = $1", [target_id])
      puts "[#{Thread.current.object_id}] Accessed ID #{target_id}, result: #{result.to_a}"
    end
    true
  rescue PG::LockNotAvailable => e
    puts "[Worker #{Thread.current.object_id}] Lock timeout (#{timeout_ms}ms) accessing TOAST for ID #{target_id} on table #{table_name}."
    false
  rescue PG::QueryCanceled => e
    puts "[Worker #{Thread.current.object_id}] Query canceled (likely lock timeout: #{timeout_ms}ms) for ID #{target_id} on table #{table_name}."
    false
  rescue PG::ConnectionBad => e
    puts "[Worker #{Thread.current.object_id}] Connection bad: #{e.message.lines.first.strip}"
    raise
  rescue PG::Error => e
    puts "[Worker #{Thread.current.object_id}] Other PG::Error for ID #{target_id} on table #{table_name}: #{e.message.lines.first.strip}"
    false
  end
end

def worker_task(worker_id, table_name)
  conn = nil
  connection_retries = 0
  max_connection_retries = 5
  max_id_cache = nil
  last_max_id_check = Time.now - 60
  ops_this_connection = 0

  loop do
    begin
      conn = PG.connect(DB_PARAMS)
      connection_retries = 0
      ops_this_connection = 0
      start_time = Time.now

      while (Time.now - start_time) < WORK_DURATION_SECONDS
        if ops_this_connection % 500 == 0
            if !max_id_cache || (Time.now - last_max_id_check > 10)
                max_id_result = conn.query("SELECT MAX(id) FROM #{table_name}")
                if max_id_result && max_id_result.ntuples > 0 && max_id_result.getvalue(0,0)
                    current_max = max_id_result.getvalue(0,0).to_i
                    max_id_cache = current_max if current_max > 0
                    last_max_id_check = Time.now
                else
                    max_id_cache = nil
                end
            end
        end

        target_id = max_id_cache && max_id_cache > 0 ? rand(1..max_id_cache) : rand(1..100)

        attempt_toast_access_with_timeout(conn, table_name, target_id, LOCK_TIMEOUT_MS)

        ops_this_connection += 1
      end
      break
    rescue PG::ConnectionBad
      conn&.close
      connection_retries += 1
      if connection_retries > max_connection_retries
        puts "[Worker #{worker_id}] Max connection retries reached. Exiting."
        break
      end
      wait_time = 2**connection_retries
      puts "[Worker #{worker_id}] Connection lost. Retrying in #{wait_time}s..."
      sleep wait_time
    ensure
      conn&.close
    end
  end
end

# --- Main Execution ---
puts "Starting #{NUM_WORKER_THREADS} EXTREME TOAST ACCESS workload generator threads."
puts "Queries will have a lock timeout of #{LOCK_TIMEOUT_MS}ms."
puts "Targeting table: #{TABLE_NAME} on #{DB_PARAMS[:host]}:#{DB_PARAMS[:port]}/#{DB_PARAMS[:dbname]}"
puts "---"

threads = []
NUM_WORKER_THREADS.times do |i|
  threads << Thread.new do
    worker_task(i + 1, TABLE_NAME)
  end
end

threads.each(&:join)
puts "All workload threads finished."

Attachment: v18-0001-Limit-VACUUM-truncation-retries-after-scan-inter.patch
Description: Binary data

Attachment: load-data.sql
Description: Binary data

Reply via email to