This patch adds a few examples to demonstrate the following:

* The existence of the ctid column on every table
* The utility of ctds in self joins
* A practical usage of SKIP LOCKED

The reasoning for this is a bit long, but if you're interested, keep
reading.

In the past, there has been a desire to see a LIMIT clause of some sort on
UPDATE and DELETE statements. The reason for this usually stems from having
a large archive or backfill operation that if done in one single
transaction would overwhelm normal operations, either by the transaction
failing outright, locking too many rows, flooding the WAL causing replica
lag, or starving other processes of limited I/O.

The reasons for not adding a LIMIT clause are pretty straightforward: it
isn't in the SQL Standard, and UPDATE/DELETE operations are unordered
operations, so updating 1000 rows randomly isn't a great idea. The people
wanting the LIMIT clause were undeterred by this, because they know that
they intend to keep issuing updates until they run out of rows to update.

Given these limitations, I would write something like this:

WITH doomed AS (
    SELECT t.id
    FROM my_table AS t
    WHERE t.expiration_date < :'some_archive_date'
    FOR UPDATE SKIP LOCKED
    LIMIT 1000 )
DELETE FROM my_table
WHERE id IN (SELECT id FROM doomed );

This wouldn't interfere with any other updates, so I felt good about it
running when the system was not-too-busy. I'd then write a script to run
that in a loop, with sleeps to allow the replicas a chance to catch their
breath. Then, when the rowcount finally dipped below 1000, I'd issue the
final

DELETE FROM my_table WHERE expiration_date < :'some_archive_date';

And this was ok, because at that point I have good reason to believe that
there are at most 1000 rows lingering out there, so waiting on locks for
those was no big deal.

But a query like this involves one scan along one index (or worse, a seq
scan) followed by another scan, either index or seq. Either way, we're
taking up a lot of cache with rows we don't even care about.

Then in v12, the query planner got hip to bitmap tidscans, allowing for
this optimization:

WITH doomed AS (
    SELECT t.ctid AS tid
    FROM my_table AS t
    WHERE t.expiration_date < :'some_archive_date'
    FOR UPDATE SKIP LOCKED
    LIMIT 1000 )
DELETE FROM my_table
USING doomed WHERE my_table.ctid = doomed.tid;

And this works pretty well, especially if you set up a partial index to
meet the quals in the CTE. But we don't document this anywhere, and until
UPDATE and DELETE get a LIMIT clause, we probably should document this
workaround.
From 209fd8abe50603e85ca0cc07aecd72b87889e757 Mon Sep 17 00:00:00 2001
From: coreyhuinker <corey.huinker@gmail.com>
Date: Tue, 13 Jun 2023 13:00:40 -0400
Subject: [PATCH v1] Add examples that highlight the usage of the system column
 ctid in self-joins.

Currently we do not show any examples of using ctid anywhere, nor do we
address the often-requested but problematic use case of having a LIMIT
clause on UPDATE and DELETE statements. These examples are a subtle way
of addressing both those concerns.
---
 doc/src/sgml/ref/delete.sgml | 24 ++++++++++++++++++++++++
 doc/src/sgml/ref/select.sgml | 21 +++++++++++++++++++++
 doc/src/sgml/ref/update.sgml | 23 +++++++++++++++++++++++
 3 files changed, 68 insertions(+)

diff --git a/doc/src/sgml/ref/delete.sgml b/doc/src/sgml/ref/delete.sgml
index 1b81b4e7d7..cca9138843 100644
--- a/doc/src/sgml/ref/delete.sgml
+++ b/doc/src/sgml/ref/delete.sgml
@@ -234,6 +234,30 @@ DELETE FROM films
    In some cases the join style is easier to write or faster to
    execute than the sub-select style.
   </para>
+  <para>
+   In situations where a single operation would consume too many resources, it
+   may be desirable to break up a large <command>DELETE</command> into multiple
+   separate commands. The <glossterm linkend="glossary-sql-standard">SQL
+   standard</glossterm> does not define a <literal>LIMIT</literal> clause for
+   <command>DELETE</command> operations, but it is possible get the equivalent
+   functionality through the <literal>USING</literal> clause to a
+   <link linkend="queries-with">Common Table Expression</link> which identifies
+   a subset of rows to be deleted, locks those rows, and returns their system
+   column <link linkend="ddl-system-columns-ctid">ctid</link> values:
+<programlisting>
+WITH delete_batch AS (
+  SELECT l.ctid
+  FROM user_logs AS l
+  WHERE l.status = 'archived'
+  ORDER BY l.creation_date
+  LIMIT 10000
+  FOR UPDATE )
+DELETE FROM user_logs AS ul
+USING delete_branch AS del
+WHERE ul.ctid = del.ctid
+</programlisting>
+  This allows for flexible search criteria within the CTE and an efficient self-join.
+  </para>
  </refsect1>
 
  <refsect1>
diff --git a/doc/src/sgml/ref/select.sgml b/doc/src/sgml/ref/select.sgml
index 0ee0cc7e64..9d7c3d5c41 100644
--- a/doc/src/sgml/ref/select.sgml
+++ b/doc/src/sgml/ref/select.sgml
@@ -1676,6 +1676,27 @@ SELECT * FROM (SELECT * FROM mytable FOR UPDATE) ss WHERE col1 = 5;
     condition is not textually within the sub-query.
    </para>
 
+   <para>
+    In cases where a <acronym>DML</acronym> operation involving many rows
+    must be performed, and that table experiences numerous other simultaneous
+    <acronym>DML</acronym> operations, a <literal>FOR UPDATE</literal> clause
+    used in conjunction with <literal>SKIP LOCKED</literal> can be useful for
+    performing partial <acronym>DML</acronym> operations:
+
+<programlisting>
+WITH mods AS (SELECT ctid FROM mytable
+              WHERE status = 'active' AND retries > 10
+              ORDER BY id FOR UPDATE SKIP LOCKED)
+UPDATE mytable SET status = 'failed'
+FROM mods WHERE mytable.ctid = mods.ctid
+</programlisting>
+
+    This allows the <acronym>DML</acronym> operation to be performed in parts, avoiding locking,
+    until such time as the set of rows that remain to be modified is small enough
+    that the locking will not affect overall performance, at which point the same
+    statement can be issued without the <literal>SKIP LOCKED</literal> clause to ensure
+    that no rows were overlooked.
+   </para>
   <para>
    Previous releases failed to preserve a lock which is upgraded by a later
    savepoint.  For example, this code:
diff --git a/doc/src/sgml/ref/update.sgml b/doc/src/sgml/ref/update.sgml
index 2ab24b0523..c045d5dd49 100644
--- a/doc/src/sgml/ref/update.sgml
+++ b/doc/src/sgml/ref/update.sgml
@@ -442,6 +442,29 @@ COMMIT;
 <programlisting>
 UPDATE films SET kind = 'Dramatic' WHERE CURRENT OF c_films;
 </programlisting></para>
+
+  <para>
+   To break up a large <command>UPDATE</command> into more manageable pieces,
+   it is possible to do a self-join on the
+   <link linkend="ddl-system-columns-ctid">ctid</link> system column using
+   <link linkend="queries-with">Common Table Expression</link> to limit the
+   number of rows to be updated:
+<programlisting>
+WITH exceeded_max_retries AS (
+  SELECT w.ctid
+  FROM work_item AS w
+  WHERE w.status = 'active'
+  AND w.num_retries > 10
+  ORDER BY w.retry_timestamp
+  FOR UPDATE
+  LIMIT 5000
+)
+UPDATE work_item
+SET status = 'failed'
+FROM exceeded_max_retries AS emr
+WHERE work_item.ctid = emr.ctid
+</programlisting>
+  </para>
  </refsect1>
 
  <refsect1>
-- 
2.41.0

Reply via email to