Forking this thread https://www.postgresql.org/message-id/20181227132417.xe3oagawina7775b%40alvherre.pgsql
On Wed, Dec 26, 2018 at 01:09:39PM -0500, Robert Haas wrote: > ALTER TABLE already has a lot of logic that is oriented towards being > able to do multiple things at the same time. If we added CLUSTER, > VACUUM FULL, and REINDEX to that set, then you could, say, change a > data type, cluster, and change tablespaces all in a single SQL > command. On Thu, Dec 27, 2018 at 10:24:17AM -0300, Alvaro Herrera wrote: > I think it would be valuable to have those ALTER TABLE variants that rewrite > the table do so using the cluster order, if there is one, instead of the heap > order, which is what it does today. That's a neat idea. I haven't yet fit all of ALTERs processing logic in my head ... but there's an issue that ALTER (unlike CLUSTER) needs to deal with column type promotion, so the indices may need to be dropped and recreated. The table rewrite happens AFTER dropping indices (and all other processing), but the clustered index can't be scanned if it's just been dropped. I handled that by using a tuplesort, same as heapam_relation_copy_for_cluster. Experimental patch attached. With clustered ALTER: template1=# DROP TABLE t; CREATE TABLE t AS SELECT generate_series(1,999)i; CREATE INDEX ON t(i DESC); ALTER TABLE t CLUSTER ON t_i_idx; ALTER TABLE t ALTER i TYPE bigint; SELECT * FROM t LIMIT 9; DROP TABLE SELECT 999 CREATE INDEX ALTER TABLE ALTER TABLE i ----- 999 998 997 996 995 994 993 992 991 (9 rows) 0001 patch is stolen from the nearby thread: https://www.postgresql.org/message-id/flat/20200207143935.GP403%40telsasoft.com It doesn't make much sense for ALTER to use a clustered index when rewriting a table, if doesn't also go to the effort to preserve the cluster property when rebuilding its indices. 0002 patch is included and not squished with 0003 to show the original implementation using an index scan (by not dropping indices on the old table, and breaking various things), and the evolution to tuplesort. Note, this doesn't use clustered order when rewriting only due to tablespace change. Alter currently does an AM specific block copy without looking at tuples. But I think it'd be possible to use tuplesort and copy if desired.
>From f93bbd6c30e883068f46ff86def28d0e66aea4f5 Mon Sep 17 00:00:00 2001 From: Justin Pryzby <pryz...@telsasoft.com> Date: Fri, 7 Feb 2020 22:06:57 -0600 Subject: [PATCH v1 1/3] Preserve CLUSTER ON index during ALTER rewrite Amit Langote and Justin Pryzby https://www.postgresql.org/message-id/flat/20200202161718.GI13621%40telsasoft.com --- src/backend/commands/tablecmds.c | 39 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 39 insertions(+) diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index b7c8d66..642a85c 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -490,6 +490,7 @@ static void ATPostAlterTypeParse(Oid oldId, Oid oldRelId, Oid refRelId, static void RebuildConstraintComment(AlteredTableInfo *tab, int pass, Oid objid, Relation rel, List *domname, const char *conname); +static void PreserveClusterOn(AlteredTableInfo *tab, int pass, Oid indoid); static void TryReuseIndex(Oid oldId, IndexStmt *stmt); static void TryReuseForeignKey(Oid oldId, Constraint *con); static ObjectAddress ATExecAlterColumnGenericOptions(Relation rel, const char *colName, @@ -11838,6 +11839,9 @@ ATPostAlterTypeParse(Oid oldId, Oid oldRelId, Oid refRelId, char *cmd, newcmd->def = (Node *) stmt; tab->subcmds[AT_PASS_OLD_INDEX] = lappend(tab->subcmds[AT_PASS_OLD_INDEX], newcmd); + + /* Preserve index's indisclustered property, if set. */ + PreserveClusterOn(tab, AT_PASS_OLD_INDEX, oldId); } else if (IsA(stm, AlterTableStmt)) { @@ -11874,6 +11878,9 @@ ATPostAlterTypeParse(Oid oldId, Oid oldRelId, Oid refRelId, char *cmd, rel, NIL, indstmt->idxname); + + /* Preserve index's indisclustered property, if set. */ + PreserveClusterOn(tab, AT_PASS_OLD_INDEX, indoid); } else if (cmd->subtype == AT_AddConstraint) { @@ -11997,6 +12004,38 @@ RebuildConstraintComment(AlteredTableInfo *tab, int pass, Oid objid, } /* + * For a table's index that is to be recreated due to PostAlterType + * processing, preserve its indisclustered property by issuing ALTER TABLE + * CLUSTER ON command on the table that will run after the command to recreate + * the index. + */ +static void +PreserveClusterOn(AlteredTableInfo *tab, int pass, Oid indoid) +{ + HeapTuple indexTuple; + Form_pg_index indexForm; + + Assert(OidIsValid(indoid)); + Assert(pass == AT_PASS_OLD_INDEX); + + indexTuple = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(indoid)); + if (!HeapTupleIsValid(indexTuple)) + elog(ERROR, "cache lookup failed for index %u", indoid); + indexForm = (Form_pg_index) GETSTRUCT(indexTuple); + + if (indexForm->indisclustered) + { + AlterTableCmd *newcmd = makeNode(AlterTableCmd); + + newcmd->subtype = AT_ClusterOn; + newcmd->name = get_rel_name(indoid); + tab->subcmds[pass] = lappend(tab->subcmds[pass], newcmd); + } + + ReleaseSysCache(indexTuple); +} + +/* * Subroutine for ATPostAlterTypeParse(). Calls out to CheckIndexCompatible() * for the real analysis, then mutates the IndexStmt based on that verdict. */ -- 2.7.4
>From 50d1f41b7794d947293f5cec0d9d8406cbf05bbc Mon Sep 17 00:00:00 2001 From: Justin Pryzby <pryz...@telsasoft.com> Date: Sat, 1 Feb 2020 15:49:47 -0600 Subject: [PATCH v1 2/3] Allow ALTER TABLE to do an index scan, like CLUSTER The idea is to implement table rewrite in ALTER more similar to what CLUSTER does, to allow clustering during table-rewriting ALTER. Note, this does not do the AM-specific visibility checks that CLUSTER does, so will not clean up dead tuples. The indices are normally dropped before doing table rewrite, so index scan would be impossible. As a POC, this early implementation avoids dropping the indices. --- src/backend/commands/tablecmds.c | 67 +++++++++++++++++++++++++++++++++++++--- 1 file changed, 62 insertions(+), 5 deletions(-) diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 642a85c..44f94d2 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -364,6 +364,7 @@ static void ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode, AlterTableUtilityContext *context); static void ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode); +static Oid cluster_index(Relation rel); static AlteredTableInfo *ATGetQueueEntry(List **wqueue, Relation rel); static void ATSimplePermissions(Relation rel, int allowed_targets); static void ATWrongRelkindError(Relation rel, int allowed_targets); @@ -4294,7 +4295,7 @@ ATRewriteCatalogs(List **wqueue, LOCKMODE lockmode, * multiple columns of a table are altered). */ if (pass == AT_PASS_ALTER_TYPE) - ATPostAlterTypeCleanup(wqueue, tab, lockmode); + ; // ATPostAlterTypeCleanup(wqueue, tab, lockmode); relation_close(rel, NoLock); } @@ -5005,6 +5006,36 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode, } } +/* Return the OID of the index on which the table was previously (SET) clustered */ +/* stolen from alter.c */ +static Oid cluster_index(Relation rel) +{ + ListCell *index; + + /* We need to find the index that has indisclustered set. */ + foreach(index, RelationGetIndexList(rel)) + { + HeapTuple idxtuple; + Form_pg_index indexForm; + Oid indexOid; + + indexOid = lfirst_oid(index); + idxtuple = SearchSysCache1(INDEXRELID, + ObjectIdGetDatum(indexOid)); + if (!HeapTupleIsValid(idxtuple)) + elog(ERROR, "cache lookup failed for index %u", indexOid); + indexForm = (Form_pg_index) GETSTRUCT(idxtuple); + if (indexForm->indisclustered) + { + ReleaseSysCache(idxtuple); + return indexOid; + } + ReleaseSysCache(idxtuple); + } + + return InvalidOid; +} + /* * ATRewriteTable: scan or rewrite one table * @@ -5015,6 +5046,7 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode) { Relation oldrel; Relation newrel; + Relation index; TupleDesc oldTupDesc; TupleDesc newTupDesc; bool needscan = false; @@ -5026,6 +5058,7 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode) BulkInsertState bistate; int ti_options; ExprState *partqualstate = NULL; + Oid OIDindex; /* * Open the relation(s). We have surely already locked the existing @@ -5040,6 +5073,9 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode) else newrel = NULL; + OIDindex = cluster_index(oldrel); + index = OidIsValid(OIDindex) ? index_open(OIDindex, lockmode) : NULL; + /* * Prepare a BulkInsertState and options for table_tuple_insert. Because * we're building a new heap, we can skip WAL-logging and fsync it to disk @@ -5129,7 +5165,8 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode) ExprContext *econtext; TupleTableSlot *oldslot; TupleTableSlot *newslot; - TableScanDesc scan; + TableScanDesc tblscan; + IndexScanDesc indscan; MemoryContext oldCxt; List *dropped_attrs = NIL; ListCell *lc; @@ -5205,7 +5242,13 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode) * checking all the constraints. */ snapshot = RegisterSnapshot(GetLatestSnapshot()); - scan = table_beginscan(oldrel, snapshot, 0, NULL); + if (index) { + indscan = index_beginscan(oldrel, index, snapshot, 0, 0); // SnapshotAny? + tblscan = NULL; + } else { + tblscan = table_beginscan(oldrel, snapshot, 0, NULL); + indscan = NULL; + } /* * Switch to per-tuple memory context and reset it for each tuple @@ -5213,10 +5256,19 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode) */ oldCxt = MemoryContextSwitchTo(GetPerTupleMemoryContext(estate)); - while (table_scan_getnextslot(scan, ForwardScanDirection, oldslot)) + for (;;) { TupleTableSlot *insertslot; + if (tblscan) { + if (!table_scan_getnextslot(tblscan, ForwardScanDirection, oldslot)) + break; + } else { + /* indscan */ + if (!index_getnext_slot(indscan, ForwardScanDirection, oldslot)) + break; + } + if (tab->rewrite > 0) { /* Extract data from old tuple */ @@ -5362,7 +5414,10 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode) } MemoryContextSwitchTo(oldCxt); - table_endscan(scan); + if (tblscan) + table_endscan(tblscan); + if (indscan) + index_endscan(indscan); UnregisterSnapshot(snapshot); ExecDropSingleTupleTableSlot(oldslot); @@ -5373,6 +5428,8 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode) FreeExecutorState(estate); table_close(oldrel, NoLock); + if (index) + table_close(index, NoLock); if (newrel) { FreeBulkInsertState(bistate); -- 2.7.4
>From 59864324c953137555af058413135e80f947f1ce Mon Sep 17 00:00:00 2001 From: Justin Pryzby <pryz...@telsasoft.com> Date: Mon, 3 Feb 2020 20:03:28 -0600 Subject: [PATCH v1 3/3] Use tuplesort rather than index scan.. This is largely stolen from heapam_relation_copy_for_cluster, which can either do an indexscan (if it's cheaper) or tablescan+sort. This resolves the hack in the previous commit by using tuplesort, since indexes have already been dropped. --- src/backend/commands/tablecmds.c | 135 ++++++++++++++++++++------------------- 1 file changed, 70 insertions(+), 65 deletions(-) diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 44f94d2..5698cd7 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -98,6 +98,7 @@ #include "utils/snapmgr.h" #include "utils/syscache.h" #include "utils/timestamp.h" +#include "utils/tuplesort.h" #include "utils/typcache.h" /* @@ -176,6 +177,7 @@ typedef struct AlteredTableInfo List *changedConstraintDefs; /* string definitions of same */ List *changedIndexOids; /* OIDs of indexes to rebuild */ List *changedIndexDefs; /* string definitions of same */ + Tuplesortstate *tuplesort; /* sorted for rewrite */ } AlteredTableInfo; /* Struct describing one new constraint to check in Phase 3 scan */ @@ -364,7 +366,6 @@ static void ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode, AlterTableUtilityContext *context); static void ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode); -static Oid cluster_index(Relation rel); static AlteredTableInfo *ATGetQueueEntry(List **wqueue, Relation rel); static void ATSimplePermissions(Relation rel, int allowed_targets); static void ATWrongRelkindError(Relation rel, int allowed_targets); @@ -4295,7 +4296,7 @@ ATRewriteCatalogs(List **wqueue, LOCKMODE lockmode, * multiple columns of a table are altered). */ if (pass == AT_PASS_ALTER_TYPE) - ; // ATPostAlterTypeCleanup(wqueue, tab, lockmode); + ATPostAlterTypeCleanup(wqueue, tab, lockmode); relation_close(rel, NoLock); } @@ -5006,36 +5007,6 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode, } } -/* Return the OID of the index on which the table was previously (SET) clustered */ -/* stolen from alter.c */ -static Oid cluster_index(Relation rel) -{ - ListCell *index; - - /* We need to find the index that has indisclustered set. */ - foreach(index, RelationGetIndexList(rel)) - { - HeapTuple idxtuple; - Form_pg_index indexForm; - Oid indexOid; - - indexOid = lfirst_oid(index); - idxtuple = SearchSysCache1(INDEXRELID, - ObjectIdGetDatum(indexOid)); - if (!HeapTupleIsValid(idxtuple)) - elog(ERROR, "cache lookup failed for index %u", indexOid); - indexForm = (Form_pg_index) GETSTRUCT(idxtuple); - if (indexForm->indisclustered) - { - ReleaseSysCache(idxtuple); - return indexOid; - } - ReleaseSysCache(idxtuple); - } - - return InvalidOid; -} - /* * ATRewriteTable: scan or rewrite one table * @@ -5046,7 +5017,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode) { Relation oldrel; Relation newrel; - Relation index; TupleDesc oldTupDesc; TupleDesc newTupDesc; bool needscan = false; @@ -5058,7 +5028,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode) BulkInsertState bistate; int ti_options; ExprState *partqualstate = NULL; - Oid OIDindex; /* * Open the relation(s). We have surely already locked the existing @@ -5073,9 +5042,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode) else newrel = NULL; - OIDindex = cluster_index(oldrel); - index = OidIsValid(OIDindex) ? index_open(OIDindex, lockmode) : NULL; - /* * Prepare a BulkInsertState and options for table_tuple_insert. Because * we're building a new heap, we can skip WAL-logging and fsync it to disk @@ -5165,8 +5131,7 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode) ExprContext *econtext; TupleTableSlot *oldslot; TupleTableSlot *newslot; - TableScanDesc tblscan; - IndexScanDesc indscan; + TableScanDesc scan; MemoryContext oldCxt; List *dropped_attrs = NIL; ListCell *lc; @@ -5242,13 +5207,7 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode) * checking all the constraints. */ snapshot = RegisterSnapshot(GetLatestSnapshot()); - if (index) { - indscan = index_beginscan(oldrel, index, snapshot, 0, 0); // SnapshotAny? - tblscan = NULL; - } else { - tblscan = table_beginscan(oldrel, snapshot, 0, NULL); - indscan = NULL; - } + scan = table_beginscan(oldrel, snapshot, 0, NULL); /* * Switch to per-tuple memory context and reset it for each tuple @@ -5256,19 +5215,10 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode) */ oldCxt = MemoryContextSwitchTo(GetPerTupleMemoryContext(estate)); - for (;;) + while (table_scan_getnextslot(scan, ForwardScanDirection, oldslot)) { TupleTableSlot *insertslot; - if (tblscan) { - if (!table_scan_getnextslot(tblscan, ForwardScanDirection, oldslot)) - break; - } else { - /* indscan */ - if (!index_getnext_slot(indscan, ForwardScanDirection, oldslot)) - break; - } - if (tab->rewrite > 0) { /* Extract data from old tuple */ @@ -5403,10 +5353,15 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode) RelationGetRelationName(oldrel)))); } - /* Write the tuple out to the new relation */ - if (newrel) - table_tuple_insert(newrel, insertslot, mycid, + /* Write the tuple out to the new relation, or to tuplesort */ + if (newrel) { + if (tab->tuplesort) { + HeapTuple tuple = ExecFetchSlotHeapTuple(insertslot, false, NULL); + tuplesort_putheaptuple(tab->tuplesort, tuple); + } else + table_tuple_insert(newrel, insertslot, mycid, ti_options, bistate); + } ResetExprContext(econtext); @@ -5414,10 +5369,47 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode) } MemoryContextSwitchTo(oldCxt); - if (tblscan) - table_endscan(tblscan); - if (indscan) - index_endscan(indscan); + table_endscan(scan); + + if (newrel && tab->tuplesort) + { + TupleDesc oldTupDesc = RelationGetDescr(oldrel); + TupleDesc newTupDesc = RelationGetDescr(newrel); + + int natts; + Datum *values; + bool *isnull; + HeapTuple tuple; + + /* Sort and then write out from the tuplesort to the table */ + tuplesort_performsort(tab->tuplesort); + + /* Preallocate values/isnull arrays */ + natts = newTupDesc->natts; + values = (Datum *) palloc(natts * sizeof(Datum)); + isnull = (bool *) palloc(natts * sizeof(bool)); + + while ((tuple = tuplesort_getheaptuple(tab->tuplesort, true)) != NULL) + { + CHECK_FOR_INTERRUPTS(); + + /* stolen from reform_and_rewrite_tuple */ + heap_deform_tuple(tuple, oldTupDesc, values, isnull); + /* Be sure to null out any dropped columns */ + for (int i = 0; i < newTupDesc->natts; i++) + if (TupleDescAttr(newTupDesc, i)->attisdropped) + isnull[i] = true; + + ExecClearTuple(newslot); + memcpy(newslot->tts_values, values, sizeof(Datum)*natts); + memcpy(newslot->tts_isnull, isnull, sizeof(bool)*natts); + ExecStoreVirtualTuple(newslot); + table_tuple_insert(newrel, newslot, mycid, ti_options, bistate); + + } + tuplesort_end(tab->tuplesort); + } + UnregisterSnapshot(snapshot); ExecDropSingleTupleTableSlot(oldslot); @@ -5428,8 +5420,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode) FreeExecutorState(estate); table_close(oldrel, NoLock); - if (index) - table_close(index, NoLock); if (newrel) { FreeBulkInsertState(bistate); @@ -11785,12 +11775,27 @@ ATPostAlterTypeCleanup(List **wqueue, AlteredTableInfo *tab, LOCKMODE lockmode) { Oid oldId = lfirst_oid(oid_item); Oid relid; + Relation ind; relid = IndexGetRelation(oldId, false); ATPostAlterTypeParse(oldId, relid, InvalidOid, (char *) lfirst(def_item), wqueue, lockmode, tab->rewrite); + + ind = index_open(oldId, ShareUpdateExclusiveLock); + if (ind->rd_index->indisclustered) { + Relation rel = table_open(relid, ShareUpdateExclusiveLock); + TupleDesc oldDesc = RelationGetDescr(rel); + table_close(rel, NoLock); + /* TODO: parallelize ? */ + tab->tuplesort = tuplesort_begin_cluster(oldDesc, ind, maintenance_work_mem, NULL, false); + index_close(ind, NoLock); + } else { + tab->tuplesort = NULL; + index_close(ind, NoLock); + } + ObjectAddressSet(obj, RelationRelationId, oldId); add_exact_object_address(&obj, objects); } -- 2.7.4