A Case For Inlining Immediate Referential Integrity Checks ----------------------------------------------------------
The following is an overview of how Postgres currently implemented referential integrity, the some problems with that architecture, attempted solutions for those problems, and a suggstion of another possible solution. Notes On Notation and Referential Integrity In General ------------------------------------------------------ All referential integrity is ultimately of this form: R(X) => T(Y) Where one referencing table R has a set of columns X That references a set of columns Y which comprise a unique constraint on a target table T. Note that the Y-set of columns is usually the primary key of T, but does not have to be. The basic referential integrity checks fall into two basic categories, Insert and Delete, which can be checked Immediately following the statement, or can be Deferred to the end of the transaction. The Insert check is fairly straightforward. Any insert to R, or update of R that modifies [1] any column in X, is checked to see if all of the X columns are NOT NULL, and if so, a lookup is done on T to find a matching row tuple of Y. If none is found, then an error is raised. The Update check is more complicated, as it covers any UPDATE operation that modifies [1] any column in Y, where all of the values of Y are NOT NUL, as well as DELETE operation where all of the columns of Y are NOT NULL. For any Update check, the table R is scanned for any matching X tuples matching Y in the previous, and for any matches found, an action is taken. That action can be to fail the operation (NO ACTION, RESTRICT), update the X values to fixed values (SET NULL, SET DEFAULT), or to delete those rows in R (CASCADE). Current Implementation ---------------------- Currently, these operations are handled via per-row triggers. In our general case, one trigger is placed on R for INSERT operations, and one trigger is placed on T for DELETE operations, and an additional trigger is placed on T for UPDATE operations that affect any column of Y. These Insert trigger functions invoke the C function RI_FKey_check() [2]. The trigger is fired unconditionally, and the trigger itself determines if there is a referential integrity constraint to be made or not. Ultimately this trigger invokes an SPI query of the form SELECT 1 FROM <T> WHERE (<X = Y>) FOR KEY SHARE. This query is generally quite straightforward to the planner, as it becomes either a scan of a single unique index, or a partition search followed by a scan of a single unique index. The operation succeeds if a row is found, and fails if it does not. The Update trigger functions are implemented with a set of C functions RI_[noaction|restrict|cascade|setnull|setdefault]_[upd|del]() [3]. These functions each generate a variation of SPI query in one of the following forms cascade: DELETE FROM <R> WHERE <X = Y> restrict/noaction: SELECT 1 FROM <R> WHERE <X = Y> FOR KEY SHARE setnull: UPDATE <R> SET x1 = NULL, ... WHERE <X = Y> setdefault: UPDATE <R> SET x1 = DEFAULT, ... WHERE <X = Y> These triggers are either executed at statement time (Immediate) or are queued for execution as a part of the transaction commit (Deferred). Problems With The Current Implementation ---------------------------------------- The main problems with this architecture come down to visiblity and performance. The foremost problem with this implementation is that these extra queries are not visible to the end user in any way. It is possible to infer that the functions executed by looking at the constraint defnitions and comparing pg_stat_user_tables or pg_stat_user_indexes before and after the operation, but in general the time spent in these functions accrues to the DML statement (Immediate) or COMMIT statement (Deferred) without any insght into what took place. This is especially vexing in situations where an operation as simple as "DELETE FROM highly_referenced_table WHERE id = 1" hits the primary key index, but takes several seconds to run. The performance of Insert operations is generally not too bad, in that query boils down to an Index Scan for a single row. The problem, however, is that this query must be executed for every row inserted. The query itself is only planned once, and that query plan is cached for later re-use. That removes some of the query overhead, but also incurs a growing cache of plans which can create memory pressure if the number of foreign keys is large, and indeed this has become a problem for at least one customer [4]. Some profiling of the RI check indicated that about half of the time of the insert was spent in SPI functions that could be bypassed if the C function called index_beginscan and index_rescan directly [5]. And these indications bore out when Amit Langote wrote a patch [6] which finds the designanted index from the constraint (with some drilling through partitions if need be) and then invokes the scan functions. This method showed about a halving of the time involved, while also avoiding the memory pressure from many cached plans [7]. The performance of Delete operations is far less certain, and the potential performance impact is far greater. There are four main reasons for this. First, there is no guarantee of a suitable index on the R table let alone an optimal index, so a Sequential Scan is possible. Second, the operation may match multiple rows in the R table, so the scan must exhaust the whole table/index of R. Third, this already sub-optimal operation must be performed for every row affected by the Delete operation, with no ability to coordinate between row triggers which means that in a pathological case, a large table is sequential scanned once per row deleted. Lastly, the cascade, setnull and setdefault variants have the potential to fire more triggers. Attempts at Statement Level Triggers ------------------------------------ Back in 2016, Kevin Grittner made a mock up of handling RI delete checks with statement-level triggers [8] and got a 98% reduction in execution time. The source of the benefit was not hard to see: doing 1 query filtering for N values is almost always faster than N queries filtering for 1 each. It especially helps the most pathological case where a Sequential Scan is done in the row-trigger case, now a hash join to the transition table is possible. It was with that thinking that I made my own attempt at re-implementing RI with statement-level triggers [9]. This effort was incomplete, and raised questions from Alvaro Herrera [10] and Kevin Grittner [11]. Also, the implementation was only seeing about a 25% improvement instead of the 98% that was hoped. Antonin Houska found some bugs in the patch [12] and suggested keeping row level triggers. The chief source of baggage seemed to be that the transition tables contained the entire before/after rows, not just the columns needed to process the trigger, and that level of memory consumption was quite significant. Some time later, Antonin followed through with his idea for how to improve the patch [13], which sparked a lively discussion where it was observed that this discussion had happened before, in other forms, at least as far back as 2013 [14]. The chief challenges being how to optimize a many-row update without penalizing a single row update. All of the discussion, however, was about architecture and performance of triggers, without any mention of improving the visibility of RI checks to the user. Current State Of Affairs ------------------------ Amit's patch to remove all SPI from Insert triggers [6] appears to work, and is about as minimally invasive as possible under the circumstances, and I think it is important that it be included for v14. However, concerns have been raised about the patch bypassing permission checks, how it would handle foreign tables, alternate access methods, etc. Even if all issues are resolved with Amit's patch, it does nothing for Updates, and nothing for visiblity. As was mentioned before, the chief problem for Update performance is that statement-level triggers are too much overhead for single-row updates, and row level triggers need extra overhead to see if they can find common cause with other trigger firings. It has occurred to me that the solution to both of these issues is to, where possible, roll the trigger operation into the query itself, and in cases where that isn't possible, at least indicate that a trigger could be fired. The Proposed Changes -------------------- The changes I'm proposing boil down to the following: 1. Add informational nodes to query plans to indicate that a certain named trigger would be fired, whether it is a row or statement trigger, and in the case of per-row triggers, how many are expected to be fired. 2. To the extent possible, accrue time/cost expended on triggers to that node. 3. Make the planner capable of seeing these triggers, and where possible, bring the logic of the RI check inside the query itself. Adding Trigger Nodes to A Query Plan ------------------------------------ When a query is being planned that modifies a table, the planner would inspect the triggers affected on that table, and will create a node for each named trigger that would be fired. 1. Disabled triggers are ignored outright. 2. User-defined triggers and Deferred RI triggers will simply be named along with an estimate of the number of firings that will occur. 3. RI Insert triggers that are in immediate mode will be inlined as a node ReferentialIntegrityInsert 3. RI Update/Delete triggers of type ON DELETE RESTRICT and ON DELETE NO ACTION that are in immediate mode will be inlined as a node RerefentialIntegrityDeleteCheck 4. RI Update/Delete triggers of type CASCADE will be inlined as a node RerefentialIntegrityDeleteCascade 5. RI Update/Delete triggers of type SET NULL and SET DEFAULT will be inlined as a node RerefentialIntegrityUpdateSet. Note that nodes RerefentialIntegrityDeleteCascade and RerefentialIntegrityUpdateSet will themselves modify tables which can in turn have RI constraints that would create more RerefentialIntegrityDeleteCascade and RerefentialIntegrityUpdateSet nodes. In theory, this could continue infinitely. Praactically speaking, such situations either have a DEFERRED constraint somewhere in the loop, or one of the references starts off initially NULL. Still, the planner can't see that, so it makes sense to put a depth-limit on such things, and any UPDATEs/DELETEs that cascade beyond that limit are simply not inlined, and are queued as user-defined immediate triggers are. ReferentialIntegrityInsert Node ------------------------------- This is a node that does a lookup of every tuple in the input set to the unique index(normal or partitioned) on the referencing table. The node would have the option of first doing a distinct on the set of inputs, or instead might choose to cache the values already looked up to avoid duplicate index traversals, in which case the operation would more closely resemble a LEFT OUTER JOIN where, instead of returning a null row on a miss, the query would instead fail on an RI violation. RerefentialIntegrityDeleteCheck Node ------------------------------------ In many ways, this node is the opposite of ReferentialIntegrityInsert, in the sense that it fails if there IS a match found rather than if there was not. The power of this node lies in the fact that the planner can see how many tuples will be input, and can decide on a nested loop traversal, or a hash join, or a merge join if amenable indexes are present. RerefentialIntegrityDeleteCascade node ------------------------------------ This node amounts to DELETE FROM referencing_table WHERE id IN ( set_of_input_tuples ) with no returning clause and no need for one. The node cannot itself filter any rows, nor can it raise any errors, but the RI constraints of referencing_table might raise errors. RerefentialIntegrityUpdateSet node ------------------------------------ This node amounts to UPDATE referencing_table SET col = NULL|DEFAULT WHERE id IN ( set_of_input_tuples ) with no returning clause and no need for one. The node cannot itself filter any rows, nor can it raise any errors, but the RI constraints of referencing_table might raise errors. Advantages To This Approach --------------------------- 1. The visibility of triggers is given, regardless of which triggers are eligible for inlining. 2. All triggers have a fall-back case of operating exactly as they do now. 3. Inlining could be enabled/disabled by a session setting or other GUC. 4. The planner itself has vibility into the number of rows affected, and therefore can choose single-row vs set operations as appropriate. What Might This Look Like? -------------------------- Say there's a fact table class_registration with foreign keys to the dimension tables class, student, semester, you'd see something like this EXPLAIN INSERT INTO class_registration VALUES (...); QUERY PLAN ------------------------------------------------------------------------------------------ Insert on class_regisration (cost=0.00..0.01 rows=1 width=40) -> Result (cost=0.00..0.01 rows=1 width=4) -> RerefentialIntegrityInsert: class_registration_class_id_fkey -> IndexOnlyScan class_pk (...) -> RerefentialIntegrityInsert: class_registration_student_id_fkey -> IndexOnlyScan student_uq (...) -> RerefentialIntegrityInsert: class_registration_semester_id_fkey -> IndexOnlyScan semester_pk (...) Or in the case of deferred constraints EXPLAIN INSERT INTO class_registration VALUES (...); QUERY PLAN ------------------------------------------------------------------------------------------ Insert on class_regisration (cost=0.00..0.01 rows=1 width=40) -> Result (cost=0.00..0.01 rows=1 width=4) -> ForeignKey Check: class_registration_class_id_fkey (RI_FKey_check_ins) -> Row Deferred (Execute 1 Time) -> ForeignKey Check: class_registration_student_id_fkey (RI_FKey_check_ins) -> Row Deferred (Execute 1 Time) -> ForeignKey Check: class_registration_semester_id_fkey (RI_FKey_check_ins) -> Row Deferred (Execute 1 Time) Is there anything actionable for user? Not there, but now they know the extra lifting they were asking the database to perform. The actual tune-able would be on a DELETEs and UPDATEs EXPLAIN DELETE FROM class WHERE class_id <= 4; QUERY PLAN ----------------------------------------------------------------------------------------------------------- Delete on class (cost=0.00..0.04 rows=3 width=60) -> Index Only Scan on class_pk (...) -> rerefentialIntegrityDeleteCascade: class_class_registration_fkey -> Hash Cond: (r1.class_id = class.id) -> Hash (...) -> Seq Scan on class_registration c1 (...) -> Hash (...) -> Materialize Affected Tuples The reader could see that an index was missing, and work to fix it. The planner, for it's part, saw that a Seq Scan would be needed, but rather than Scan once per row as per-row triggers would have done, it does the scan once, hashes it, and then does a comparison against a similar hash of the rows deleted. It's possible that the plan would also view the DELETE as a CTE which is then fed as input to progressive layers of RI checks. Conclusion ---------- I think there are improvements to be made in RI checks in terms of performance, visibility, and tunability. It is my hope that this sparks discussion that leads to better performance and visibility of Referential Integrity. Footnotes: [1] Updates that set the column value to the value it already has do not require an integrity check. [2] https://doxygen.postgresql.org/ri__triggers_8c.html#a14c6f5e65d657bd5b4fd45769b4c0197 [3] https://doxygen.postgresql.org/ri__triggers_8c.html#a43b79b7b3f05fc8bec34e5fb6c37ba47 is the first alphabetically [4] https://www.postgresql.org/message-id/cakkq508z6r5e3jdqhfpwszsajlpho3oyyoamfesaupto5vg...@mail.gmail.com [5] https://www.postgresql.org/message-id/20201126.121818.26523414172308697.horikyota....@gmail.com [6] https://www.postgresql.org/message-id/ca+hiwqgkfjfydeq5vhph6eqpkjsbfpddy+j-kxyfepqedts...@mail.gmail.com [7] https://www.postgresql.org/message-id/CAKkQ50_h8TcBkY5KYQfneejrZ_d3veFcK3nGmN-WxucEu_QrCw%40mail.gmail.com [8] https://www.postgresql.org/message-id/CACjxUsM4s9%3DCUmPU4YFOYiD5f%3D2ULVDBjuFSo20Twe7KbUe8Mw%40mail.gmail.com [9] https://www.postgresql.org/message-id/CADkLM=dfunhniz9pop1pqa+hph4t9wuwnjwsf6uavnxcgua...@mail.gmail.com [10] https://www.postgresql.org/message-id/20181217172729.mjfkflaelii2boaj%40alvherre.pgsql [11] https://www.postgresql.org/message-id/CACjxUsOY-CXoNMPit%2Bk1PC_5LjYkvYPz2VJwK5YDDtzRh4J7vw%40mail.gmail.com [12] https://www.postgresql.org/message-id/17100.1550662686%40localhost [13] https://www.postgresql.org/message-id/1813.1586363881@antos [14] https://www.postgresql.org/message-id/flat/CA%2BU5nMLM1DaHBC6JXtUMfcG6f7FgV5mPSpufO7GRnbFKkF2f7g%40mail.gmail.com