The current way to expand inherited tables, including partitioned tables,
is to use either find_all_inheritors() or find_inheritance_children()
depending on the context.  They return child table OIDs in the (ascending)
order of those OIDs, which means the callers that need to lock the child
tables can do so without worrying about the possibility of deadlock in
some concurrent execution of that piece of code.  That's good.

For partitioned tables, there is a possibility of returning child table
(partition) OIDs in the partition bound order, which in addition to
preventing the possibility of deadlocks during concurrent locking, seems
potentially useful for other caller-specific optimizations.  For example,
tuple-routing code can utilize that fact to implement binary-search based
partition-searching algorithm.  For one more example, refer to the "UPDATE
partition key" thread where it's becoming clear that it would be nice if
the planner had put the partitions in bound order in the ModifyTable that
it creates for UPDATE of partitioned tables [1].

So attached are two WIP patches:

0001 implements two interface functions:

  List *get_all_partition_oids(Oid, LOCKMODE)
  List *get_partition_oids(Oid, LOCKMODE)

that resemble find_all_inheritors() and find_inheritance_children(),
respectively, but expect that users call them only for partitioned tables.
 Needless to mention, OIDs are returned with canonical order determined by
that of the partition bounds and they way partition tree structure is
traversed (top-down, breadth-first-left-to-right).  Patch replaces all the
calls of the old interface functions with the respective new ones for
partitioned table parents.  That means expand_inherited_rtentry (among
others) now calls get_all_partition_oids() if the RTE is for a partitioned
table and find_all_inheritors() otherwise.

In its implementation, get_all_partition_oids() calls
RelationGetPartitionDispatchInfo(), which is useful to generate the result
list in the desired partition bound order.  But the current interface and
implementation of RelationGetPartitionDispatchInfo() needs some rework,
because it's too closely coupled with the executor's tuple routing code.

Applying just 0001 will satisfy the requirements stated in [1], but it
won't look pretty as is for too long.

So, 0002 is a patch to refactor RelationGetPartitionDispatchInfo() and
relevant data structures.  For example, PartitionDispatchData has now been
simplified to contain only the partition key, partition descriptor and
indexes array, whereas previously it also stored the relation descriptor,
partition key execution state, tuple table slot, tuple conversion map
which are required for tuple-routing.  RelationGetPartitionDispatchInfo()
no longer generates those things, but returns just enough information so
that a caller can generate and manage those things by itself.  This
simplification makes it less cumbersome to call
RelationGetPartitionDispatchInfo() in other places.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CA%2BTgmoajC0J50%3D2FqnZLvB10roY%2B68HgFWhso%3DV_StkC6PWujQ%40mail.gmail.com
From 9674053fd1e57a480d8a42585cb10421e2c76a70 Mon Sep 17 00:00:00 2001
From: amit <amitlangot...@gmail.com>
Date: Wed, 2 Aug 2017 17:14:59 +0900
Subject: [PATCH 1/3] Add get_all_partition_oids and get_partition_oids

Their respective counterparts find_all_inheritors() and
find_inheritance_children() read the pg_inherits catalog directly and
frame the result list in some order determined by the order of OIDs.

get_all_partition_oids() and get_partition_oids() form their result
by reading the partition OIDs from the PartitionDesc contained in the
relcache.  Hence, the order of OIDs in the resulting list is based
on that of the partition bounds.  In the case of get_all_partition_oids
which traverses the whole-tree, the order is also determined by the
fact that the tree is traversed in a breadth-first manner.
---
 contrib/sepgsql/dml.c                  |   4 +-
 src/backend/catalog/partition.c        |  84 ++++++++++++++++++++++
 src/backend/commands/analyze.c         |   8 ++-
 src/backend/commands/lockcmds.c        |   6 +-
 src/backend/commands/publicationcmds.c |   9 ++-
 src/backend/commands/tablecmds.c       | 124 +++++++++++++++++++++++++--------
 src/backend/commands/vacuum.c          |   7 +-
 src/backend/optimizer/prep/prepunion.c |   6 +-
 src/include/catalog/partition.h        |   3 +
 9 files changed, 213 insertions(+), 38 deletions(-)

diff --git a/contrib/sepgsql/dml.c b/contrib/sepgsql/dml.c
index b643720e36..62d6610c43 100644
--- a/contrib/sepgsql/dml.c
+++ b/contrib/sepgsql/dml.c
@@ -332,8 +332,10 @@ sepgsql_dml_privileges(List *rangeTabls, bool 
abort_on_violation)
                 */
                if (!rte->inh)
                        tableIds = list_make1_oid(rte->relid);
-               else
+               else if (rte->relkind != RELKIND_PARTITIONED_TABLE)
                        tableIds = find_all_inheritors(rte->relid, NoLock, 
NULL);
+               else
+                       tableIds = get_all_partition_oids(rte->relid, NoLock);
 
                foreach(li, tableIds)
                {
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index dcc7f8af27..614b2f79f2 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -1150,6 +1150,90 @@ RelationGetPartitionDispatchInfo(Relation rel, int 
lockmode,
        return pd;
 }
 
+/*
+ * get_all_partition_oids - returns the list of all partitions in the
+ *                                                     partition tree rooted 
at relid
+ *
+ * OIDs in the list are ordered canonically using the partition bound order,
+ * while the tree is being traversed in a breadth-first manner.  Actually,
+ * this's just a wrapper on top of RelationGetPartitionDispatchInfo.
+ *
+ * All the partitions are locked with lockmode.  We assume that relid has been
+ * locked by the caller with lockmode.
+ */
+List *get_all_partition_oids(Oid relid, int lockmode)
+{
+       List   *result = NIL;
+       List   *leaf_part_oids = NIL;
+       ListCell *lc;
+       Relation        rel;
+       int                     num_parted;
+       PartitionDispatch *pds;
+       int                     i;
+
+       /* caller should've locked already */
+       rel = heap_open(relid, NoLock);
+       pds = RelationGetPartitionDispatchInfo(rel, lockmode, &num_parted,
+                                                                               
   &leaf_part_oids);
+
+       /*
+        * First append the OIDs of all the partitions that are partitioned
+        * tables themselves, starting with relid itself.
+        */
+       result = lappend_oid(result, relid);
+       for (i = 1; i < num_parted; i++)
+       {
+               result = lappend_oid(result, RelationGetRelid(pds[i]->reldesc));
+
+               /*
+                * To avoid leaking resources, release them.  This is to work 
around
+                * the existing interface of RelationGetPartitionDispatchInfo() 
that
+                * acquires these resources at the mercy of the caller to 
release
+                * them.
+                */
+               heap_close(pds[i]->reldesc, NoLock);
+               if (pds[i]->tupmap)
+                       pfree(pds[i]->tupmap);
+               ExecDropSingleTupleTableSlot(pds[i]->tupslot);
+       }
+       heap_close(rel, NoLock);
+
+       /* Leaf partitions were not locked; do so now. */
+       foreach(lc, leaf_part_oids)
+       {
+               if (lockmode != NoLock)
+               LockRelationOid(lfirst_oid(lc), lockmode);
+       }
+
+       /* Return after concatening the leaf partition OIDs. */
+       return list_concat(result, leaf_part_oids);
+}
+
+/*
+ * get_partition_oids - returns a list of OIDs of partitions of relid
+ *
+ * OIDs are returned from the PartitionDesc contained in the relcache, so they
+ * are ordered canonically using partition bound order.
+ */
+List *get_partition_oids(Oid relid, int lockmode)
+{
+       List   *result = NIL;
+       Relation rel;
+       int             i;
+       PartitionDesc partdesc;
+
+       /* caller should've locked already */
+       rel = heap_open(relid, NoLock);
+       partdesc = RelationGetPartitionDesc(rel);
+       for (i = 0; i < partdesc->nparts; i++)
+       {
+               result = lappend_oid(result, partdesc->oids[i]);
+       }
+       heap_close(rel, NoLock);
+
+       return result;
+}
+
 /* Module-local functions */
 
 /*
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 2b638271b3..f3c1893b12 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1281,8 +1281,12 @@ acquire_inherited_sample_rows(Relation onerel, int 
elevel,
         * Find all members of inheritance set.  We only need AccessShareLock on
         * the children.
         */
-       tableOIDs =
-               find_all_inheritors(RelationGetRelid(onerel), AccessShareLock, 
NULL);
+       if (onerel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+               tableOIDs = find_all_inheritors(RelationGetRelid(onerel),
+                                                                               
AccessShareLock, NULL);
+       else
+               tableOIDs = get_all_partition_oids(RelationGetRelid(onerel),
+                                                                               
   AccessShareLock);
 
        /*
         * Check that there's at least one descendant, else fail.  This could
diff --git a/src/backend/commands/lockcmds.c b/src/backend/commands/lockcmds.c
index 9fe9e022b0..29a9ef82b2 100644
--- a/src/backend/commands/lockcmds.c
+++ b/src/backend/commands/lockcmds.c
@@ -15,6 +15,7 @@
 #include "postgres.h"
 
 #include "catalog/namespace.h"
+#include "catalog/partition.h"
 #include "catalog/pg_inherits_fn.h"
 #include "commands/lockcmds.h"
 #include "miscadmin.h"
@@ -112,7 +113,10 @@ LockTableRecurse(Oid reloid, LOCKMODE lockmode, bool 
nowait)
        List       *children;
        ListCell   *lc;
 
-       children = find_inheritance_children(reloid, NoLock);
+       if (get_rel_relkind(reloid) != RELKIND_PARTITIONED_TABLE)
+               children = find_inheritance_children(reloid, NoLock);
+       else
+               children = get_partition_oids(reloid, NoLock);
 
        foreach(lc, children)
        {
diff --git a/src/backend/commands/publicationcmds.c 
b/src/backend/commands/publicationcmds.c
index 610cb499d2..ab7423577f 100644
--- a/src/backend/commands/publicationcmds.c
+++ b/src/backend/commands/publicationcmds.c
@@ -515,8 +515,13 @@ OpenTableList(List *tables)
                        ListCell   *child;
                        List       *children;
 
-                       children = find_all_inheritors(myrelid, 
ShareUpdateExclusiveLock,
-                                                                               
   NULL);
+                       if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+                               children = find_all_inheritors(myrelid,
+                                                                               
           ShareUpdateExclusiveLock,
+                                                                               
           NULL);
+                       else
+                               children = get_all_partition_oids(myrelid,
+                                                                               
                  ShareUpdateExclusiveLock);
 
                        foreach(child, children)
                        {
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 7859ef13ac..332697c095 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -1231,7 +1231,12 @@ ExecuteTruncate(TruncateStmt *stmt)
                        ListCell   *child;
                        List       *children;
 
-                       children = find_all_inheritors(myrelid, 
AccessExclusiveLock, NULL);
+                       if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+                               children = find_all_inheritors(myrelid,
+                                                                               
           AccessExclusiveLock, NULL);
+                       else
+                               children = get_all_partition_oids(myrelid,
+                                                                               
                  AccessExclusiveLock);
 
                        foreach(child, children)
                        {
@@ -2555,8 +2560,11 @@ renameatt_internal(Oid myrelid,
                 * calls to renameatt() can determine whether there are any 
parents
                 * outside the inheritance hierarchy being processed.
                 */
-               child_oids = find_all_inheritors(myrelid, AccessExclusiveLock,
-                                                                               
 &child_numparents);
+               if (targetrelation->rd_rel->relkind != 
RELKIND_PARTITIONED_TABLE)
+                       child_oids = find_all_inheritors(myrelid, 
AccessExclusiveLock,
+                                                                               
         &child_numparents);
+               else
+                       child_oids = get_all_partition_oids(myrelid, 
AccessExclusiveLock);
 
                /*
                 * find_all_inheritors does the recursive search of the 
inheritance
@@ -2581,6 +2589,10 @@ renameatt_internal(Oid myrelid,
                 * tables; else the rename would put them out of step.
                 *
                 * expected_parents will only be 0 if we are not already 
recursing.
+                *
+                * We don't bother to distinguish between 
find_inheritance_children's
+                * and get_partition_oids's results unlike in most other places,
+                * because we're not concerned about the order of OIDs here.
                 */
                if (expected_parents == 0 &&
                        find_inheritance_children(myrelid, NoLock) != NIL)
@@ -2765,8 +2777,13 @@ rename_constraint_internal(Oid myrelid,
                        ListCell   *lo,
                                           *li;
 
-                       child_oids = find_all_inheritors(myrelid, 
AccessExclusiveLock,
-                                                                               
         &child_numparents);
+                       Assert(targetrelation != NULL);
+                       if (targetrelation->rd_rel->relkind != 
RELKIND_PARTITIONED_TABLE)
+                               child_oids = find_all_inheritors(myrelid, 
AccessExclusiveLock,
+                                                                               
                 &child_numparents);
+                       else
+                               child_oids = get_all_partition_oids(myrelid,
+                                                                               
                        AccessExclusiveLock);
 
                        forboth(lo, child_oids, li, child_numparents)
                        {
@@ -2781,6 +2798,12 @@ rename_constraint_internal(Oid myrelid,
                }
                else
                {
+                       /*
+                        * We don't bother to distinguish between
+                        * find_inheritance_children's and get_partition_oids's 
results
+                        * unlike in most other places, because we're not 
concerned about
+                        * the order of OIDs here.
+                        */
                        if (expected_parents == 0 &&
                                find_inheritance_children(myrelid, NoLock) != 
NIL)
                                ereport(ERROR,
@@ -4790,7 +4813,10 @@ ATSimpleRecursion(List **wqueue, Relation rel,
                ListCell   *child;
                List       *children;
 
-               children = find_all_inheritors(relid, lockmode, NULL);
+               if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+                       children = find_all_inheritors(relid, lockmode, NULL);
+               else
+                       children = get_all_partition_oids(relid, lockmode);
 
                /*
                 * find_all_inheritors does the recursive search of the 
inheritance
@@ -5183,6 +5209,10 @@ ATExecAddColumn(List **wqueue, AlteredTableInfo *tab, 
Relation rel,
        /*
         * Cannot add identity column if table has children, because identity 
does
         * not inherit.  (Adding column and identity separately will work.)
+        *
+        * We don't bother to distinguish between find_inheritance_children's 
and
+        * get_partition_oids's results unlike in most other places, because 
we're
+        * not concerned about the order of OIDs here.
         */
        if (colDef->identity &&
                recurse &&
@@ -5390,9 +5420,12 @@ ATExecAddColumn(List **wqueue, AlteredTableInfo *tab, 
Relation rel,
        /*
         * Propagate to children as appropriate.  Unlike most other ALTER
         * routines, we have to do this one level of recursion at a time; we 
can't
-        * use find_all_inheritors to do it in one pass.
+        * use find_all_inheritors or get_all_partition_oids to do it in one 
pass.
         */
-       children = find_inheritance_children(RelationGetRelid(rel), lockmode);
+       if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+               children = find_inheritance_children(RelationGetRelid(rel), 
lockmode);
+       else
+               children = get_partition_oids(RelationGetRelid(rel), lockmode);
 
        /*
         * If we are told not to recurse, there had better not be any child
@@ -6509,9 +6542,12 @@ ATExecDropColumn(List **wqueue, Relation rel, const char 
*colName,
        /*
         * Propagate to children as appropriate.  Unlike most other ALTER
         * routines, we have to do this one level of recursion at a time; we 
can't
-        * use find_all_inheritors to do it in one pass.
+        * use find_all_inheritors or get_all_partition_oids to do it in one 
pass.
         */
-       children = find_inheritance_children(RelationGetRelid(rel), lockmode);
+       if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+               children = find_inheritance_children(RelationGetRelid(rel), 
lockmode);
+       else
+               children = get_partition_oids(RelationGetRelid(rel), lockmode);
 
        if (children)
        {
@@ -6943,9 +6979,12 @@ ATAddCheckConstraint(List **wqueue, AlteredTableInfo 
*tab, Relation rel,
        /*
         * Propagate to children as appropriate.  Unlike most other ALTER
         * routines, we have to do this one level of recursion at a time; we 
can't
-        * use find_all_inheritors to do it in one pass.
+        * use find_all_inheritors or get_all_partition_oids to do it in one 
pass.
         */
-       children = find_inheritance_children(RelationGetRelid(rel), lockmode);
+       if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+               children = find_inheritance_children(RelationGetRelid(rel), 
lockmode);
+       else
+               children = get_partition_oids(RelationGetRelid(rel), lockmode);
 
        /*
         * Check if ONLY was specified with ALTER TABLE.  If so, allow the
@@ -7663,8 +7702,14 @@ ATExecValidateConstraint(Relation rel, char *constrName, 
bool recurse,
                         * shouldn't try to look for it in the children.
                         */
                        if (!recursing && !con->connoinherit)
-                               children = 
find_all_inheritors(RelationGetRelid(rel),
-                                                                               
           lockmode, NULL);
+                       {
+                               if (rel->rd_rel->relkind != 
RELKIND_PARTITIONED_TABLE)
+                                       children = 
find_all_inheritors(RelationGetRelid(rel),
+                                                                               
                   lockmode, NULL);
+                               else
+                                       children = 
get_all_partition_oids(RelationGetRelid(rel),
+                                                                               
                          lockmode);
+                       }
 
                        /*
                         * For CHECK constraints, we must ensure that we only 
mark the
@@ -8544,12 +8589,14 @@ ATExecDropConstraint(Relation rel, const char 
*constrName,
        /*
         * Propagate to children as appropriate.  Unlike most other ALTER
         * routines, we have to do this one level of recursion at a time; we 
can't
-        * use find_all_inheritors to do it in one pass.
+        * use find_all_inheritors or get_all_partition_oids to do it in one 
pass.
         */
-       if (!is_no_inherit_constraint)
+       if (is_no_inherit_constraint)
+               children = NIL;
+       else if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
                children = find_inheritance_children(RelationGetRelid(rel), 
lockmode);
        else
-               children = NIL;
+               children = get_partition_oids(RelationGetRelid(rel), lockmode);
 
        /*
         * For a partitioned table, if partitions exist and we are told not to
@@ -8836,7 +8883,10 @@ ATPrepAlterColumnType(List **wqueue,
                ListCell   *child;
                List       *children;
 
-               children = find_all_inheritors(relid, lockmode, NULL);
+               if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+                       children = find_all_inheritors(relid, lockmode, NULL);
+               else
+                       children = get_all_partition_oids(relid, lockmode);
 
                /*
                 * find_all_inheritors does the recursive search of the 
inheritance
@@ -8886,6 +8936,11 @@ ATPrepAlterColumnType(List **wqueue,
                        relation_close(childrel, NoLock);
                }
        }
+       /*
+        * We don't bother to distinguish between find_inheritance_children's 
and
+        * get_partition_oids's results unlike in most other places, because 
we're
+        * not concerned about the order of OIDs here.
+        */
        else if (!recursing &&
                         find_inheritance_children(RelationGetRelid(rel), 
NoLock) != NIL)
                ereport(ERROR,
@@ -10996,6 +11051,7 @@ ATExecAddInherit(Relation child_rel, RangeVar *parent, 
LOCKMODE lockmode)
         *
         * We use weakest lock we can on child's children, namely 
AccessShareLock.
         */
+       Assert(child_rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE);
        children = find_all_inheritors(RelationGetRelid(child_rel),
                                                                   
AccessShareLock, NULL);
 
@@ -13421,7 +13477,7 @@ ATExecAttachPartition(List **wqueue, Relation rel, 
PartitionCmd *cmd)
 {
        Relation        attachrel,
                                catalog;
-       List       *attachrel_children;
+       List       *attachrel_children = NIL;
        TupleConstr *attachrel_constr;
        List       *partConstraint,
                           *existConstraint;
@@ -13501,15 +13557,20 @@ ATExecAttachPartition(List **wqueue, Relation rel, 
PartitionCmd *cmd)
         * table, nor its partitions.  But we cannot risk a deadlock by taking a
         * weaker lock now and the stronger one only when needed.
         */
-       attachrel_children = find_all_inheritors(RelationGetRelid(attachrel),
-                                                                               
         AccessExclusiveLock, NULL);
-       if (list_member_oid(attachrel_children, RelationGetRelid(rel)))
-               ereport(ERROR,
-                               (errcode(ERRCODE_DUPLICATE_TABLE),
-                                errmsg("circular inheritance not allowed"),
-                                errdetail("\"%s\" is already a child of 
\"%s\".",
-                                                  RelationGetRelationName(rel),
-                                                  
RelationGetRelationName(attachrel))));
+       if (attachrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+       {
+               Oid             attachrel_oid = RelationGetRelid(attachrel);
+
+               attachrel_children = get_all_partition_oids(attachrel_oid,
+                                                                               
                        AccessExclusiveLock);
+               if (list_member_oid(attachrel_children, RelationGetRelid(rel)))
+                       ereport(ERROR,
+                                       (errcode(ERRCODE_DUPLICATE_TABLE),
+                                        errmsg("circular inheritance not 
allowed"),
+                                        errdetail("\"%s\" is already a child 
of \"%s\".",
+                                                          
RelationGetRelationName(rel),
+                                                          
RelationGetRelationName(attachrel))));
+       }
 
        /* Temp parent cannot have a partition that is itself not a temp */
        if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP &&
@@ -13707,6 +13768,13 @@ ATExecAttachPartition(List **wqueue, Relation rel, 
PartitionCmd *cmd)
                /* Constraints proved insufficient, so we need to scan the 
table. */
                ListCell   *lc;
 
+               /*
+                * If attachrel isn't partitioned, attachrel_children would be 
empty.
+                * We still need to process attachrel itself, so initialize.
+                */
+               if (attachrel_children == NIL)
+                       attachrel_children = 
list_make1_oid(RelationGetRelid(attachrel));
+
                foreach(lc, attachrel_children)
                {
                        AlteredTableInfo *tab;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index faa181207a..7bea95d9c5 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -31,6 +31,7 @@
 #include "access/transam.h"
 #include "access/xact.h"
 #include "catalog/namespace.h"
+#include "catalog/partition.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_inherits_fn.h"
 #include "catalog/pg_namespace.h"
@@ -423,14 +424,14 @@ get_rel_oids(Oid relid, const RangeVar *vacrel)
 
                /*
                 * Make relation list entries for this guy and its partitions, 
if any.
-                * Note that the list returned by find_all_inheritors() include 
the
-                * passed-in OID at its head.  Also note that we did not 
request a
+                * Note that the list returned by get_all_partition_oids() 
includes
+                * the passed-in OID at its head.  Also note that we did not 
request a
                 * lock to be taken to match what would be done otherwise.
                 */
                oldcontext = MemoryContextSwitchTo(vac_context);
                if (include_parts)
                        oid_list = list_concat(oid_list,
-                                                                  
find_all_inheritors(relid, NoLock, NULL));
+                                                                  
get_all_partition_oids(relid, NoLock));
                else
                        oid_list = lappend_oid(oid_list, relid);
                MemoryContextSwitchTo(oldcontext);
diff --git a/src/backend/optimizer/prep/prepunion.c 
b/src/backend/optimizer/prep/prepunion.c
index cf46b74782..398bdd598a 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -33,6 +33,7 @@
 #include "access/heapam.h"
 #include "access/htup_details.h"
 #include "access/sysattr.h"
+#include "catalog/partition.h"
 #include "catalog/pg_inherits_fn.h"
 #include "catalog/pg_type.h"
 #include "miscadmin.h"
@@ -1418,7 +1419,10 @@ expand_inherited_rtentry(PlannerInfo *root, 
RangeTblEntry *rte, Index rti)
                lockmode = AccessShareLock;
 
        /* Scan for all members of inheritance set, acquire needed locks */
-       inhOIDs = find_all_inheritors(parentOID, lockmode, NULL);
+       if (rte->relkind != RELKIND_PARTITIONED_TABLE)
+               inhOIDs = find_all_inheritors(parentOID, lockmode, NULL);
+       else
+               inhOIDs = get_all_partition_oids(parentOID, lockmode);
 
        /*
         * Check that there's at least one descendant, else treat as no-child
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index 434ded37d7..e6314fbaa2 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -85,6 +85,9 @@ extern List *map_partition_varattnos(List *expr, int 
target_varno,
 extern List *RelationGetPartitionQual(Relation rel);
 extern Expr *get_partition_qual_relid(Oid relid);
 
+extern List *get_all_partition_oids(Oid relid, int lockmode);
+extern List *get_partition_oids(Oid relid, int lockmode);
+
 /* For tuple routing */
 extern PartitionDispatch *RelationGetPartitionDispatchInfo(Relation rel,
                                                                 int lockmode, 
int *num_parted,
-- 
2.11.0

From f869287c25397a39a50acadd34e5e1677e3ce858 Mon Sep 17 00:00:00 2001
From: amit <amitlangot...@gmail.com>
Date: Mon, 24 Jul 2017 18:59:57 +0900
Subject: [PATCH 2/3] Decouple RelationGetPartitionDispatchInfo() from executor

Currently it and the structure it generates viz. PartitionDispatch
objects are too coupled with the executor's tuple-routing code.  In
particular, it's pretty undesirable that it makes it the responsibility
of the caller to release some resources, such as relcache references
and tuple table slots.  That makes it harder to use in places other
than where it's currently being used.

After this refactoring, ExecSetupPartitionTupleRouting() now needs to
do some of the work that was previously done in
RelationGetPartitionDispatchInfo() and get_all_partition_oids() no
longer needs to do some things that it used to.
---
 src/backend/catalog/partition.c        | 367 +++++++++++++++++----------------
 src/backend/commands/copy.c            |  35 ++--
 src/backend/executor/execMain.c        | 158 ++++++++++++--
 src/backend/executor/nodeModifyTable.c |  29 ++-
 src/include/catalog/partition.h        |  52 ++---
 src/include/executor/executor.h        |   4 +-
 src/include/nodes/execnodes.h          |  53 ++++-
 7 files changed, 426 insertions(+), 272 deletions(-)

diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 614b2f79f2..2a6ad70719 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -105,6 +105,24 @@ typedef struct PartitionRangeBound
        bool            lower;                  /* this is the lower (vs upper) 
bound */
 } PartitionRangeBound;
 
+/*-----------------------
+ * PartitionDispatchData - information of partitions of one partitioned table
+ *                                                in a partition tree
+ *
+ *     partkey         Partition key of the table
+ *     partdesc        Partition descriptor of the table
+ *     indexes         Array with partdesc->nparts members (for details on 
what the
+ *                             individual value represents, see the comments in
+ *                             RelationGetPartitionDispatchInfo())
+ *-----------------------
+ */
+typedef struct PartitionDispatchData
+{
+       PartitionKey    partkey;        /* Points into the table's relcache 
entry */
+       PartitionDesc   partdesc;       /* Ditto */
+       int                        *indexes;
+} PartitionDispatchData;
+
 static int32 qsort_partition_list_value_cmp(const void *a, const void *b,
                                                           void *arg);
 static int32 qsort_partition_rbound_cmp(const void *a, const void *b,
@@ -976,178 +994,167 @@ get_partition_qual_relid(Oid relid)
 }
 
 /*
- * Append OIDs of rel's partitions to the list 'partoids' and for each OID,
- * append pointer rel to the list 'parents'.
- */
-#define APPEND_REL_PARTITION_OIDS(rel, partoids, parents) \
-       do\
-       {\
-               int             i;\
-               for (i = 0; i < (rel)->rd_partdesc->nparts; i++)\
-               {\
-                       (partoids) = lappend_oid((partoids), 
(rel)->rd_partdesc->oids[i]);\
-                       (parents) = lappend((parents), (rel));\
-               }\
-       } while(0)
-
-/*
  * RelationGetPartitionDispatchInfo
- *             Returns information necessary to route tuples down a partition 
tree
+ *             Returns necessary information for each partition in the 
partition
+ *             tree rooted at rel
  *
- * All the partitions will be locked with lockmode, unless it is NoLock.
- * A list of the OIDs of all the leaf partitions of rel is returned in
- * *leaf_part_oids.
+ * Information returned includes the following: *ptinfos contains a list of
+ * PartitionedTableInfo objects, one for each partitioned table (with at least
+ * one member, that is, one for the root partitioned table), *leaf_part_oids
+ * contains a list of the OIDs of of all the leaf partitions.
+ *
+ * Note that we lock only those partitions that are partitioned tables, because
+ * we need to look at its relcache entry to get its PartitionKey and its
+ * PartitionDesc. It's the caller's responsibility to lock the leaf partitions
+ * that will actually be accessed during a given query.
  */
-PartitionDispatch *
+void
 RelationGetPartitionDispatchInfo(Relation rel, int lockmode,
-                                                                int 
*num_parted, List **leaf_part_oids)
+                                                                List 
**ptinfos, List **leaf_part_oids)
 {
-       PartitionDispatchData **pd;
-       List       *all_parts = NIL,
-                          *all_parents = NIL,
-                          *parted_rels,
-                          *parted_rel_parents;
+       List       *all_parts,
+                          *all_parents;
        ListCell   *lc1,
                           *lc2;
        int                     i,
-                               k,
                                offset;
 
        /*
-        * Lock partitions and make a list of the partitioned ones to prepare
-        * their PartitionDispatch objects below.
+        * We rely on the relcache to traverse the partition tree, building
+        * both the leaf partition OIDs list and the PartitionedTableInfo list.
+        * Starting with the root partitioned table for which we already have 
the
+        * relcache entry, we look at its partition descriptor to get the
+        * partition OIDs.  For partitions that are themselves partitioned 
tables,
+        * we get their relcache entries after locking them with lockmode and
+        * queue their partitions to be looked at later.  Leaf partitions are
+        * added to the result list without locking.  For each partitioned 
table,
+        * we build a PartitionedTableInfo object and add it to the other result
+        * list.
         *
-        * Cannot use find_all_inheritors() here, because then the order of OIDs
-        * in parted_rels list would be unknown, which does not help, because we
-        * assign indexes within individual PartitionDispatch in an order that 
is
-        * predetermined (determined by the order of OIDs in individual 
partition
-        * descriptors).
+        * Since RelationBuildPartitionDescriptor() puts partitions in a 
canonical
+        * order determined by comparing partition bounds, we can rely that
+        * concurrent backends see the partitions in the same order, ensuring 
that
+        * there are no deadlocks when locking the partitions.
         */
-       *num_parted = 1;
-       parted_rels = list_make1(rel);
-       /* Root partitioned table has no parent, so NULL for parent */
-       parted_rel_parents = list_make1(NULL);
-       APPEND_REL_PARTITION_OIDS(rel, all_parts, all_parents);
+       i = offset = 0;
+       *ptinfos = *leaf_part_oids = NIL;
+
+       /* Start with the root table. */
+       all_parts = list_make1_oid(RelationGetRelid(rel));
+       all_parents = list_make1_oid(InvalidOid);
        forboth(lc1, all_parts, lc2, all_parents)
        {
-               Relation        partrel = heap_open(lfirst_oid(lc1), lockmode);
-               Relation        parent = lfirst(lc2);
-               PartitionDesc partdesc = RelationGetPartitionDesc(partrel);
+               Oid             partrelid = lfirst_oid(lc1);
+               Oid             parentrelid = lfirst_oid(lc2);
 
-               /*
-                * If this partition is a partitioned table, add its children 
to the
-                * end of the list, so that they are processed as well.
-                */
-               if (partdesc)
+               if (get_rel_relkind(partrelid) == RELKIND_PARTITIONED_TABLE)
                {
-                       (*num_parted)++;
-                       parted_rels = lappend(parted_rels, partrel);
-                       parted_rel_parents = lappend(parted_rel_parents, 
parent);
-                       APPEND_REL_PARTITION_OIDS(partrel, all_parts, 
all_parents);
-               }
-               else
-                       heap_close(partrel, NoLock);
+                       int             j,
+                                       k;
+                       Relation                partrel;
+                       PartitionKey    partkey;
+                       PartitionDesc   partdesc;
+                       PartitionedTableInfo   *ptinfo;
+                       PartitionDispatch               pd;
+
+                       if (partrelid != RelationGetRelid(rel))
+                               partrel = heap_open(partrelid, lockmode);
+                       else
+                               partrel = rel;
 
-               /*
-                * We keep the partitioned ones open until we're done using the
-                * information being collected here (for example, see
-                * ExecEndModifyTable).
-                */
-       }
+                       partkey = RelationGetPartitionKey(partrel);
+                       partdesc = RelationGetPartitionDesc(partrel);
+
+                       ptinfo = (PartitionedTableInfo *)
+                                                                       
palloc0(sizeof(PartitionedTableInfo));
+                       ptinfo->relid = partrelid;
+                       ptinfo->parentid = parentrelid;
+
+                       ptinfo->pd = pd = (PartitionDispatchData *)
+                                                                       
palloc0(sizeof(PartitionDispatchData));
+                       pd->partkey = partkey;
 
-       /*
-        * We want to create two arrays - one for leaf partitions and another 
for
-        * partitioned tables (including the root table and internal 
partitions).
-        * While we only create the latter here, leaf partition array of 
suitable
-        * objects (such as, ResultRelInfo) is created by the caller using the
-        * list of OIDs we return.  Indexes into these arrays get assigned in a
-        * breadth-first manner, whereby partitions of any given level are 
placed
-        * consecutively in the respective arrays.
-        */
-       pd = (PartitionDispatchData **) palloc(*num_parted *
-                                                                               
   sizeof(PartitionDispatchData *));
-       *leaf_part_oids = NIL;
-       i = k = offset = 0;
-       forboth(lc1, parted_rels, lc2, parted_rel_parents)
-       {
-               Relation        partrel = lfirst(lc1);
-               Relation        parent = lfirst(lc2);
-               PartitionKey partkey = RelationGetPartitionKey(partrel);
-               TupleDesc       tupdesc = RelationGetDescr(partrel);
-               PartitionDesc partdesc = RelationGetPartitionDesc(partrel);
-               int                     j,
-                                       m;
-
-               pd[i] = (PartitionDispatch) 
palloc(sizeof(PartitionDispatchData));
-               pd[i]->reldesc = partrel;
-               pd[i]->key = partkey;
-               pd[i]->keystate = NIL;
-               pd[i]->partdesc = partdesc;
-               if (parent != NULL)
-               {
                        /*
-                        * For every partitioned table other than root, we must 
store a
-                        * tuple table slot initialized with its tuple 
descriptor and a
-                        * tuple conversion map to convert a tuple from its 
parent's
-                        * rowtype to its own. That is to make sure that we are 
looking at
-                        * the correct row using the correct tuple descriptor 
when
-                        * computing its partition key for tuple routing.
-                        */
-                       pd[i]->tupslot = MakeSingleTupleTableSlot(tupdesc);
-                       pd[i]->tupmap = 
convert_tuples_by_name(RelationGetDescr(parent),
-                                                                               
                   tupdesc,
-                                                                               
                   gettext_noop("could not convert row type"));
-               }
-               else
-               {
-                       /* Not required for the root partitioned table */
-                       pd[i]->tupslot = NULL;
-                       pd[i]->tupmap = NULL;
-               }
-               pd[i]->indexes = (int *) palloc(partdesc->nparts * sizeof(int));
+                        * Pin the partition descriptor before stashing the 
references to the
+                        * information contained in it into this 
PartitionDispatch object.
+                        *
+                       PinPartitionDesc(partdesc);*/
+                       pd->partdesc = partdesc;
 
-               /*
-                * Indexes corresponding to the internal partitions are 
multiplied by
-                * -1 to distinguish them from those of leaf partitions.  
Encountering
-                * an index >= 0 means we found a leaf partition, which is 
immediately
-                * returned as the partition we are looking for.  A negative 
index
-                * means we found a partitioned table, whose PartitionDispatch 
object
-                * is located at the above index multiplied back by -1.  Using 
the
-                * PartitionDispatch object, search is continued further down 
the
-                * partition tree.
-                */
-               m = 0;
-               for (j = 0; j < partdesc->nparts; j++)
-               {
-                       Oid                     partrelid = partdesc->oids[j];
+                       /*
+                        * The values contained in the following array 
correspond to
+                        * indexes of this table's partitions in the global 
sequence of
+                        * all the partitions contained in the partition tree 
rooted at
+                        * rel, traversed in a breadh-first manner.  The values 
should be
+                        * such that we will be able to distinguish the leaf 
partitions
+                        * from the non-leaf partitions, because they are 
returned to
+                        * to the caller in separate structures from where they 
will be
+                        * accessed.  The way that's done is described below:
+                        *
+                        * Leaf partition OIDs are put into the global 
leaf_part_oids list,
+                        * and for each one, the value stored is its ordinal 
position in
+                        * the list minus 1.
+                        *
+                        * PartitionedTableInfo objects corresponding to 
partitions that
+                        * are partitioned tables are put into the global 
ptinfos[] list,
+                        * and for each one, the value stored is its ordinal 
position in
+                        * the list multiplied by -1.
+                        *
+                        * So while looking at the values in the indexes array, 
if one
+                        * gets zero or a positive value, then it's a leaf 
partition,
+                        * Otherwise, it's a partitioned table.
+                        */
+                       pd->indexes = (int *) palloc(partdesc->nparts * 
sizeof(int));
 
-                       if (get_rel_relkind(partrelid) != 
RELKIND_PARTITIONED_TABLE)
-                       {
-                               *leaf_part_oids = lappend_oid(*leaf_part_oids, 
partrelid);
-                               pd[i]->indexes[j] = k++;
-                       }
-                       else
+                       k = 0;
+                       for (j = 0; j < partdesc->nparts; j++)
                        {
+                               Oid                     partrelid = 
partdesc->oids[j];
+
                                /*
-                                * offset denotes the number of partitioned 
tables of upper
-                                * levels including those of the current level. 
 Any partition
-                                * of this table must belong to the next level 
and hence will
-                                * be placed after the last partitioned table 
of this level.
+                                * Queue this partition so that it will be 
processed later
+                                * by the outer loop.
                                 */
-                               pd[i]->indexes[j] = -(1 + offset + m);
-                               m++;
+                               all_parts = lappend_oid(all_parts, partrelid);
+                               all_parents = lappend_oid(all_parents,
+                                                                               
  RelationGetRelid(partrel));
+
+                               if (get_rel_relkind(partrelid) != 
RELKIND_PARTITIONED_TABLE)
+                               {
+                                       *leaf_part_oids = 
lappend_oid(*leaf_part_oids, partrelid);
+                                       pd->indexes[j] = i++;
+                               }
+                               else
+                               {
+                                       /*
+                                        * offset denotes the number of 
partitioned tables that
+                                        * we have already processed.  k counts 
the number of
+                                        * partitions of this table that were 
found to be
+                                        * partitioned tables.
+                                        */
+                                       pd->indexes[j] = -(1 + offset + k);
+                                       k++;
+                               }
                        }
-               }
-               i++;
 
-               /*
-                * This counts the number of partitioned tables at upper levels
-                * including those of the current level.
-                */
-               offset += m;
+                       offset += k;
+
+                       /*
+                        * Release the relation descriptor.  Lock that we have 
on the
+                        * table will keep the PartitionDesc that is pointing 
into
+                        * RelationData intact, a pointer to which hope to keep
+                        * through this transaction's commit.
+                        * (XXX - how true is that?)
+                        */
+                       if (partrel != rel)
+                               heap_close(partrel, NoLock);
+
+                       *ptinfos = lappend(*ptinfos, ptinfo);
+               }
        }
 
-       return pd;
+       Assert(i == list_length(*leaf_part_oids));
+       Assert((offset + 1) == list_length(*ptinfos));
 }
 
 /*
@@ -1164,45 +1171,38 @@ RelationGetPartitionDispatchInfo(Relation rel, int 
lockmode,
 List *get_all_partition_oids(Oid relid, int lockmode)
 {
        List   *result = NIL;
+       List   *ptinfo = NIL;
        List   *leaf_part_oids = NIL;
        ListCell *lc;
-       Relation        rel;
-       int                     num_parted;
-       PartitionDispatch *pds;
-       int                     i;
+       Relation rel;
 
        /* caller should've locked already */
        rel = heap_open(relid, NoLock);
-       pds = RelationGetPartitionDispatchInfo(rel, lockmode, &num_parted,
-                                                                               
   &leaf_part_oids);
+
+       /*
+        * Get the information about the partition tree.  All the partitioned
+        * tables in the tree are locked, but not the leaf partitions, which
+        * we lock below.
+        */
+       RelationGetPartitionDispatchInfo(rel, lockmode, &ptinfo, 
&leaf_part_oids);
+       heap_close(rel, NoLock);
 
        /*
         * First append the OIDs of all the partitions that are partitioned
-        * tables themselves, starting with relid itself.
+        * tables themselves.
         */
-       result = lappend_oid(result, relid);
-       for (i = 1; i < num_parted; i++)
+       foreach (lc, ptinfo)
        {
-               result = lappend_oid(result, RelationGetRelid(pds[i]->reldesc));
+               PartitionedTableInfo *ptinfo = lfirst(lc);
 
-               /*
-                * To avoid leaking resources, release them.  This is to work 
around
-                * the existing interface of RelationGetPartitionDispatchInfo() 
that
-                * acquires these resources at the mercy of the caller to 
release
-                * them.
-                */
-               heap_close(pds[i]->reldesc, NoLock);
-               if (pds[i]->tupmap)
-                       pfree(pds[i]->tupmap);
-               ExecDropSingleTupleTableSlot(pds[i]->tupslot);
+               result = lappend_oid(result, ptinfo->relid);
        }
-       heap_close(rel, NoLock);
 
-       /* Leaf partitions were not locked; do so now. */
-       foreach(lc, leaf_part_oids)
+       /* Lock leaf partitions, if requested. */
+       foreach (lc, leaf_part_oids)
        {
                if (lockmode != NoLock)
-               LockRelationOid(lfirst_oid(lc), lockmode);
+                       LockRelationOid(lfirst_oid(lc), lockmode);
        }
 
        /* Return after concatening the leaf partition OIDs. */
@@ -1948,7 +1948,7 @@ generate_partition_qual(Relation rel)
  * ----------------
  */
 void
-FormPartitionKeyDatum(PartitionDispatch pd,
+FormPartitionKeyDatum(PartitionKeyInfo *keyinfo,
                                          TupleTableSlot *slot,
                                          EState *estate,
                                          Datum *values,
@@ -1957,20 +1957,21 @@ FormPartitionKeyDatum(PartitionDispatch pd,
        ListCell   *partexpr_item;
        int                     i;
 
-       if (pd->key->partexprs != NIL && pd->keystate == NIL)
+       if (keyinfo->key->partexprs != NIL && keyinfo->keystate == NIL)
        {
                /* Check caller has set up context correctly */
                Assert(estate != NULL &&
                           GetPerTupleExprContext(estate)->ecxt_scantuple == 
slot);
 
                /* First time through, set up expression evaluation state */
-               pd->keystate = ExecPrepareExprList(pd->key->partexprs, estate);
+               keyinfo->keystate = ExecPrepareExprList(keyinfo->key->partexprs,
+                                                                               
                estate);
        }
 
-       partexpr_item = list_head(pd->keystate);
-       for (i = 0; i < pd->key->partnatts; i++)
+       partexpr_item = list_head(keyinfo->keystate);
+       for (i = 0; i < keyinfo->key->partnatts; i++)
        {
-               AttrNumber      keycol = pd->key->partattrs[i];
+               AttrNumber      keycol = keyinfo->key->partattrs[i];
                Datum           datum;
                bool            isNull;
 
@@ -2007,13 +2008,13 @@ FormPartitionKeyDatum(PartitionDispatch pd,
  * the latter case.
  */
 int
-get_partition_for_tuple(PartitionDispatch *pd,
+get_partition_for_tuple(PartitionTupleRoutingInfo **ptrinfos,
                                                TupleTableSlot *slot,
                                                EState *estate,
-                                               PartitionDispatchData 
**failed_at,
+                                               PartitionTupleRoutingInfo 
**failed_at,
                                                TupleTableSlot **failed_slot)
 {
-       PartitionDispatch parent;
+       PartitionTupleRoutingInfo *parent;
        Datum           values[PARTITION_MAX_KEYS];
        bool            isnull[PARTITION_MAX_KEYS];
        int                     cur_offset,
@@ -2024,11 +2025,11 @@ get_partition_for_tuple(PartitionDispatch *pd,
        TupleTableSlot *ecxt_scantuple_old = ecxt->ecxt_scantuple;
 
        /* start with the root partitioned table */
-       parent = pd[0];
+       parent = ptrinfos[0];
        while (true)
        {
-               PartitionKey key = parent->key;
-               PartitionDesc partdesc = parent->partdesc;
+               PartitionKey  key = parent->pd->partkey;
+               PartitionDesc partdesc = parent->pd->partdesc;
                TupleTableSlot *myslot = parent->tupslot;
                TupleConversionMap *map = parent->tupmap;
 
@@ -2060,7 +2061,7 @@ get_partition_for_tuple(PartitionDispatch *pd,
                 * So update ecxt_scantuple accordingly.
                 */
                ecxt->ecxt_scantuple = slot;
-               FormPartitionKeyDatum(parent, slot, estate, values, isnull);
+               FormPartitionKeyDatum(parent->keyinfo, slot, estate, values, 
isnull);
 
                if (key->strategy == PARTITION_STRATEGY_RANGE)
                {
@@ -2131,13 +2132,13 @@ get_partition_for_tuple(PartitionDispatch *pd,
                        *failed_slot = slot;
                        break;
                }
-               else if (parent->indexes[cur_index] >= 0)
+               else if (parent->pd->indexes[cur_index] >= 0)
                {
-                       result = parent->indexes[cur_index];
+                       result = parent->pd->indexes[cur_index];
                        break;
                }
                else
-                       parent = pd[-parent->indexes[cur_index]];
+                       parent = ptrinfos[-parent->pd->indexes[cur_index]];
        }
 
 error_exit:
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 53e296559a..b3de3de454 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -165,8 +165,8 @@ typedef struct CopyStateData
        bool            volatile_defexprs;      /* is any of defexprs volatile? 
*/
        List       *range_table;
 
-       PartitionDispatch *partition_dispatch_info;
-       int                     num_dispatch;   /* Number of entries in the 
above array */
+       PartitionTupleRoutingInfo **ptrinfos;
+       int                     num_parted;             /* Number of entries in 
the above array */
        int                     num_partitions; /* Number of members in the 
following arrays */
        ResultRelInfo *partitions;      /* Per partition result relation */
        TupleConversionMap **partition_tupconv_maps;
@@ -1425,7 +1425,7 @@ BeginCopy(ParseState *pstate,
                /* Initialize state for CopyFrom tuple routing. */
                if (is_from && rel->rd_rel->relkind == 
RELKIND_PARTITIONED_TABLE)
                {
-                       PartitionDispatch *partition_dispatch_info;
+                       PartitionTupleRoutingInfo **ptrinfos;
                        ResultRelInfo *partitions;
                        TupleConversionMap **partition_tupconv_maps;
                        TupleTableSlot *partition_tuple_slot;
@@ -1434,13 +1434,13 @@ BeginCopy(ParseState *pstate,
 
                        ExecSetupPartitionTupleRouting(rel,
                                                                                
   1,
-                                                                               
   &partition_dispatch_info,
+                                                                               
   &ptrinfos,
                                                                                
   &partitions,
                                                                                
   &partition_tupconv_maps,
                                                                                
   &partition_tuple_slot,
                                                                                
   &num_parted, &num_partitions);
-                       cstate->partition_dispatch_info = 
partition_dispatch_info;
-                       cstate->num_dispatch = num_parted;
+                       cstate->ptrinfos = ptrinfos;
+                       cstate->num_parted = num_parted;
                        cstate->partitions = partitions;
                        cstate->num_partitions = num_partitions;
                        cstate->partition_tupconv_maps = partition_tupconv_maps;
@@ -2495,7 +2495,7 @@ CopyFrom(CopyState cstate)
        if ((resultRelInfo->ri_TrigDesc != NULL &&
                 (resultRelInfo->ri_TrigDesc->trig_insert_before_row ||
                  resultRelInfo->ri_TrigDesc->trig_insert_instead_row)) ||
-               cstate->partition_dispatch_info != NULL ||
+               cstate->ptrinfos != NULL ||
                cstate->volatile_defexprs)
        {
                useHeapMultiInsert = false;
@@ -2573,7 +2573,7 @@ CopyFrom(CopyState cstate)
                ExecStoreTuple(tuple, slot, InvalidBuffer, false);
 
                /* Determine the partition to heap_insert the tuple into */
-               if (cstate->partition_dispatch_info)
+               if (cstate->ptrinfos)
                {
                        int                     leaf_part_index;
                        TupleConversionMap *map;
@@ -2587,7 +2587,7 @@ CopyFrom(CopyState cstate)
                         * partition, respectively.
                         */
                        leaf_part_index = ExecFindPartition(resultRelInfo,
-                                                                               
                cstate->partition_dispatch_info,
+                                                                               
                cstate->ptrinfos,
                                                                                
                slot,
                                                                                
                estate);
                        Assert(leaf_part_index >= 0 &&
@@ -2818,23 +2818,20 @@ CopyFrom(CopyState cstate)
 
        ExecCloseIndices(resultRelInfo);
 
-       /* Close all the partitioned tables, leaf partitions, and their indices 
*/
-       if (cstate->partition_dispatch_info)
+       /* Close all the leaf partitions and their indices */
+       if (cstate->ptrinfos)
        {
                int                     i;
 
                /*
-                * Remember cstate->partition_dispatch_info[0] corresponds to 
the root
-                * partitioned table, which we must not try to close, because 
it is
-                * the main target table of COPY that will be closed eventually 
by
-                * DoCopy().  Also, tupslot is NULL for the root partitioned 
table.
+                * cstate->ptrinfo[0] corresponds to the root partitioned 
table, for
+                * which we didn't create tupslot.
                 */
-               for (i = 1; i < cstate->num_dispatch; i++)
+               for (i = 1; i < cstate->num_parted; i++)
                {
-                       PartitionDispatch pd = 
cstate->partition_dispatch_info[i];
+                       PartitionTupleRoutingInfo *ptrinfo = 
cstate->ptrinfos[i];
 
-                       heap_close(pd->reldesc, NoLock);
-                       ExecDropSingleTupleTableSlot(pd->tupslot);
+                       ExecDropSingleTupleTableSlot(ptrinfo->tupslot);
                }
                for (i = 0; i < cstate->num_partitions; i++)
                {
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index c11aa4fe21..0379e489d9 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -3214,8 +3214,8 @@ EvalPlanQualEnd(EPQState *epqstate)
  * tuple routing for partitioned tables
  *
  * Output arguments:
- * 'pd' receives an array of PartitionDispatch objects with one entry for
- *             every partitioned table in the partition tree
+ * 'ptrinfos' receives an array of PartitionTupleRoutingInfo objects with one
+ *             entry for each partitioned table in the partition tree
  * 'partitions' receives an array of ResultRelInfo objects with one entry for
  *             every leaf partition in the partition tree
  * 'tup_conv_maps' receives an array of TupleConversionMap objects with one
@@ -3237,7 +3237,7 @@ EvalPlanQualEnd(EPQState *epqstate)
 void
 ExecSetupPartitionTupleRouting(Relation rel,
                                                           Index resultRTindex,
-                                                          PartitionDispatch 
**pd,
+                                                          
PartitionTupleRoutingInfo ***ptrinfos,
                                                           ResultRelInfo 
**partitions,
                                                           TupleConversionMap 
***tup_conv_maps,
                                                           TupleTableSlot 
**partition_tuple_slot,
@@ -3245,13 +3245,135 @@ ExecSetupPartitionTupleRouting(Relation rel,
 {
        TupleDesc       tupDesc = RelationGetDescr(rel);
        List       *leaf_parts;
+       List       *ptinfos = NIL;
        ListCell   *cell;
        int                     i;
        ResultRelInfo *leaf_part_rri;
+       Relation        parent;
 
-       /* Get the tuple-routing information and lock partitions */
-       *pd = RelationGetPartitionDispatchInfo(rel, RowExclusiveLock, 
num_parted,
-                                                                               
   &leaf_parts);
+       /*
+        * Get information about the partition tree.  All the partitioned
+        * tables in the tree are locked, but not the leaf partitions.  We
+        * lock them while building their ResultRelInfos below.
+        */
+       RelationGetPartitionDispatchInfo(rel, RowExclusiveLock,
+                                                                        
&ptinfos, &leaf_parts);
+
+       /*
+        * The ptinfos list contains PartitionedTableInfo objects for all the
+        * partitioned tables in the partition tree.  Using the information
+        * therein, we construct an array of PartitionTupleRoutingInfo objects
+        * to be used during tuple-routing.
+        */
+       *num_parted = list_length(ptinfos);
+       *ptrinfos = (PartitionTupleRoutingInfo **) palloc0(*num_parted *
+                                                                               
sizeof(PartitionTupleRoutingInfo *));
+       /*
+        * Free the ptinfos List structure itself as we go through (open-coded
+        * list_free).
+        */
+       i = 0;
+       cell = list_head(ptinfos);
+       parent = NULL;
+       while (cell)
+       {
+               ListCell   *tmp = cell;
+               PartitionedTableInfo *ptinfo = lfirst(tmp),
+                                                        *next_ptinfo;
+               Relation                partrel;
+               PartitionTupleRoutingInfo *ptrinfo;
+
+               if (lnext(tmp))
+                       next_ptinfo = lfirst(lnext(tmp));
+
+               /* As mentioned above, the partitioned tables have been locked. 
*/
+               if (ptinfo->relid != RelationGetRelid(rel))
+                       partrel = heap_open(ptinfo->relid, NoLock);
+               else
+                       partrel = rel;
+
+               ptrinfo = (PartitionTupleRoutingInfo *)
+                                                       
palloc0(sizeof(PartitionTupleRoutingInfo));
+               ptrinfo->relid = ptinfo->relid;
+
+               /* Stash a reference to this PartitionDispatch. */
+               ptrinfo->pd = ptinfo->pd;
+
+               /* State for extracting partition key from tuples will go here. 
*/
+               ptrinfo->keyinfo = (PartitionKeyInfo *)
+                                                               
palloc0(sizeof(PartitionKeyInfo));
+               ptrinfo->keyinfo->key = RelationGetPartitionKey(partrel);
+               ptrinfo->keyinfo->keystate = NIL;
+
+               /*
+                * For every partitioned table other than root, we must store a 
tuple
+                * table slot initialized with its tuple descriptor and a tuple
+                * conversion map to convert a tuple from its parent's rowtype 
to its
+                * own.  That is to make sure that we are looking at the 
correct row
+                * using the correct tuple descriptor when computing its 
partition key
+                * for tuple routing.
+                */
+               if (ptinfo->parentid != InvalidOid)
+               {
+                       TupleDesc       tupdesc = RelationGetDescr(partrel);
+
+                       /* Open the parent relation descriptor if not already 
done. */
+                       if (ptinfo->parentid == RelationGetRelid(rel))
+                       {
+                               parent = rel;
+                       }
+                       else if (parent == NULL)
+                       {
+                               /* Locked by 
RelationGetPartitionDispatchInfo(). */
+                               parent = heap_open(ptinfo->parentid, NoLock);
+                       }
+
+                       ptrinfo->tupslot = MakeSingleTupleTableSlot(tupdesc);
+                       ptrinfo->tupmap = 
convert_tuples_by_name(RelationGetDescr(parent),
+                                                                               
                         tupdesc,
+                                                                 
gettext_noop("could not convert row type"));
+
+                       /*
+                        * Close the parent descriptor, if the next partitioned 
table in
+                        * the list is not a sibling, because it will have a 
different
+                        * parent if so.
+                        */
+                       if (parent && parent != rel &&
+                               next_ptinfo->parentid != ptinfo->parentid)
+                       {
+                               heap_close(parent, NoLock);
+                               parent = NULL;
+                       }
+
+                       /*
+                        * Release the relation descriptor.  Lock that we have 
on the
+                        * table will keep the PartitionDesc that is pointing 
into
+                        * RelationData intact, a pointer to which hope to keep
+                        * through this transaction's commit.
+                        * (XXX - how true is that?)
+                        */
+                       if (partrel != rel)
+                               heap_close(partrel, NoLock);
+               }
+               else
+               {
+                       /* Not required for the root partitioned table */
+                       ptrinfo->tupslot = NULL;
+                       ptrinfo->tupmap = NULL;
+               }
+
+               (*ptrinfos)[i++] = ptrinfo;
+
+               /* Free the ListCell. */
+               cell = lnext(cell);
+               pfree(tmp);
+       }
+
+       /* Free the List itself. */
+       if (ptinfos)
+               pfree(ptinfos);
+
+       /* For leaf partitions, we build ResultRelInfos and 
TupleConversionMaps. */
        *num_partitions = list_length(leaf_parts);
        *partitions = (ResultRelInfo *) palloc(*num_partitions *
                                                                                
   sizeof(ResultRelInfo));
@@ -3274,11 +3396,11 @@ ExecSetupPartitionTupleRouting(Relation rel,
                TupleDesc       part_tupdesc;
 
                /*
-                * We locked all the partitions above including the leaf 
partitions.
-                * Note that each of the relations in *partitions are eventually
-                * closed by the caller.
+                * RelationGetPartitionDispatchInfo didn't lock the leaf 
partitions,
+                * so lock here.  Note that each of the relations in 
*partitions are
+                * eventually closed (when the plan is shut down, for instance).
                 */
-               partrel = heap_open(lfirst_oid(cell), NoLock);
+               partrel = heap_open(lfirst_oid(cell), RowExclusiveLock);
                part_tupdesc = RelationGetDescr(partrel);
 
                /*
@@ -3291,7 +3413,7 @@ ExecSetupPartitionTupleRouting(Relation rel,
                 * partition from the parent's type to the partition's.
                 */
                (*tup_conv_maps)[i] = convert_tuples_by_name(tupDesc, 
part_tupdesc,
-                                                                               
                         gettext_noop("could not convert row type"));
+                                                                
gettext_noop("could not convert row type"));
 
                InitResultRelInfo(leaf_part_rri,
                                                  partrel,
@@ -3325,11 +3447,13 @@ ExecSetupPartitionTupleRouting(Relation rel,
  * by get_partition_for_tuple() unchanged.
  */
 int
-ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
-                                 TupleTableSlot *slot, EState *estate)
+ExecFindPartition(ResultRelInfo *resultRelInfo,
+                                 PartitionTupleRoutingInfo **ptrinfos,
+                                 TupleTableSlot *slot,
+                                 EState *estate)
 {
        int                     result;
-       PartitionDispatchData *failed_at;
+       PartitionTupleRoutingInfo *failed_at;
        TupleTableSlot *failed_slot;
 
        /*
@@ -3339,7 +3463,7 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, 
PartitionDispatch *pd,
        if (resultRelInfo->ri_PartitionCheck)
                ExecPartitionCheck(resultRelInfo, slot, estate);
 
-       result = get_partition_for_tuple(pd, slot, estate,
+       result = get_partition_for_tuple(ptrinfos, slot, estate,
                                                                         
&failed_at, &failed_slot);
        if (result < 0)
        {
@@ -3349,9 +3473,9 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, 
PartitionDispatch *pd,
                char       *val_desc;
                ExprContext *ecxt = GetPerTupleExprContext(estate);
 
-               failed_rel = failed_at->reldesc;
+               failed_rel = heap_open(failed_at->relid, NoLock);
                ecxt->ecxt_scantuple = failed_slot;
-               FormPartitionKeyDatum(failed_at, failed_slot, estate,
+               FormPartitionKeyDatum(failed_at->keyinfo, failed_slot, estate,
                                                          key_values, 
key_isnull);
                val_desc = ExecBuildSlotPartitionKeyDescription(failed_rel,
                                                                                
                                key_values,
diff --git a/src/backend/executor/nodeModifyTable.c 
b/src/backend/executor/nodeModifyTable.c
index 30add8e3c7..00cbee4fb6 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -277,7 +277,7 @@ ExecInsert(ModifyTableState *mtstate,
        resultRelInfo = estate->es_result_relation_info;
 
        /* Determine the partition to heap_insert the tuple into */
-       if (mtstate->mt_partition_dispatch_info)
+       if (mtstate->mt_ptrinfos)
        {
                int                     leaf_part_index;
                TupleConversionMap *map;
@@ -291,7 +291,7 @@ ExecInsert(ModifyTableState *mtstate,
                 * respectively.
                 */
                leaf_part_index = ExecFindPartition(resultRelInfo,
-                                                                               
        mtstate->mt_partition_dispatch_info,
+                                                                               
        mtstate->mt_ptrinfos,
                                                                                
        slot,
                                                                                
        estate);
                Assert(leaf_part_index >= 0 &&
@@ -1486,7 +1486,7 @@ ExecSetupTransitionCaptureState(ModifyTableState 
*mtstate, EState *estate)
                int             numResultRelInfos;
 
                /* Find the set of partitions so that we can find their 
TupleDescs. */
-               if (mtstate->mt_partition_dispatch_info != NULL)
+               if (mtstate->mt_ptrinfos != NULL)
                {
                        /*
                         * For INSERT via partitioned table, so we need 
TupleDescs based
@@ -1910,7 +1910,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, 
int eflags)
        if (operation == CMD_INSERT &&
                rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
        {
-               PartitionDispatch *partition_dispatch_info;
+               PartitionTupleRoutingInfo **ptrinfos;
                ResultRelInfo *partitions;
                TupleConversionMap **partition_tupconv_maps;
                TupleTableSlot *partition_tuple_slot;
@@ -1919,13 +1919,13 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, 
int eflags)
 
                ExecSetupPartitionTupleRouting(rel,
                                                                           
node->nominalRelation,
-                                                                          
&partition_dispatch_info,
+                                                                          
&ptrinfos,
                                                                           
&partitions,
                                                                           
&partition_tupconv_maps,
                                                                           
&partition_tuple_slot,
                                                                           
&num_parted, &num_partitions);
-               mtstate->mt_partition_dispatch_info = partition_dispatch_info;
-               mtstate->mt_num_dispatch = num_parted;
+               mtstate->mt_ptrinfos = ptrinfos;
+               mtstate->mt_num_parted = num_parted;
                mtstate->mt_partitions = partitions;
                mtstate->mt_num_partitions = num_partitions;
                mtstate->mt_partition_tupconv_maps = partition_tupconv_maps;
@@ -2335,19 +2335,16 @@ ExecEndModifyTable(ModifyTableState *node)
        }
 
        /*
-        * Close all the partitioned tables, leaf partitions, and their indices
+        * Close all the leaf partitions and their indices.
         *
-        * Remember node->mt_partition_dispatch_info[0] corresponds to the root
-        * partitioned table, which we must not try to close, because it is the
-        * main target table of the query that will be closed by ExecEndPlan().
-        * Also, tupslot is NULL for the root partitioned table.
+        * node->mt_partition_dispatch_info[0] corresponds to the root 
partitioned
+        * table, for which we didn't create tupslot.
         */
-       for (i = 1; i < node->mt_num_dispatch; i++)
+       for (i = 1; i < node->mt_num_parted; i++)
        {
-               PartitionDispatch pd = node->mt_partition_dispatch_info[i];
+               PartitionTupleRoutingInfo *ptrinfo = node->mt_ptrinfos[i];
 
-               heap_close(pd->reldesc, NoLock);
-               ExecDropSingleTupleTableSlot(pd->tupslot);
+               ExecDropSingleTupleTableSlot(ptrinfo->tupslot);
        }
        for (i = 0; i < node->mt_num_partitions; i++)
        {
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index e6314fbaa2..98dcd246b4 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -39,36 +39,23 @@ typedef struct PartitionDescData
 
 typedef struct PartitionDescData *PartitionDesc;
 
-/*-----------------------
- * PartitionDispatch - information about one partitioned table in a partition
- * hierarchy required to route a tuple to one of its partitions
- *
- *     reldesc         Relation descriptor of the table
- *     key                     Partition key information of the table
- *     keystate        Execution state required for expressions in the 
partition key
- *     partdesc        Partition descriptor of the table
- *     tupslot         A standalone TupleTableSlot initialized with this 
table's tuple
- *                             descriptor
- *     tupmap          TupleConversionMap to convert from the parent's rowtype 
to
- *                             this table's rowtype (when extracting the 
partition key of a
- *                             tuple just before routing it through this table)
- *     indexes         Array with partdesc->nparts members (for details on what
- *                             individual members represent, see how they are 
set in
- *                             RelationGetPartitionDispatchInfo())
- *-----------------------
+typedef struct PartitionDispatchData *PartitionDispatch;
+
+/*
+ * Information about one partitioned table in a given partition tree
  */
-typedef struct PartitionDispatchData
+typedef struct PartitionedTableInfo
 {
-       Relation        reldesc;
-       PartitionKey key;
-       List       *keystate;           /* list of ExprState */
-       PartitionDesc partdesc;
-       TupleTableSlot *tupslot;
-       TupleConversionMap *tupmap;
-       int                *indexes;
-} PartitionDispatchData;
+       Oid                             relid;
+       Oid                             parentid;
 
-typedef struct PartitionDispatchData *PartitionDispatch;
+       /*
+        * This contains information about bounds of the partitions of this
+        * table and about where individual partitions are placed in the global
+        * partition tree.
+        */
+       PartitionDispatch pd;
+} PartitionedTableInfo;
 
 extern void RelationBuildPartitionDesc(Relation relation);
 extern bool partition_bounds_equal(PartitionKey key,
@@ -85,21 +72,20 @@ extern List *map_partition_varattnos(List *expr, int 
target_varno,
 extern List *RelationGetPartitionQual(Relation rel);
 extern Expr *get_partition_qual_relid(Oid relid);
 
+extern void RelationGetPartitionDispatchInfo(Relation rel, int lockmode,
+                                                                List 
**ptinfos, List **leaf_part_oids);
 extern List *get_all_partition_oids(Oid relid, int lockmode);
 extern List *get_partition_oids(Oid relid, int lockmode);
 
 /* For tuple routing */
-extern PartitionDispatch *RelationGetPartitionDispatchInfo(Relation rel,
-                                                                int lockmode, 
int *num_parted,
-                                                                List 
**leaf_part_oids);
-extern void FormPartitionKeyDatum(PartitionDispatch pd,
+extern void FormPartitionKeyDatum(PartitionKeyInfo *keyinfo,
                                          TupleTableSlot *slot,
                                          EState *estate,
                                          Datum *values,
                                          bool *isnull);
-extern int get_partition_for_tuple(PartitionDispatch *pd,
+extern int get_partition_for_tuple(PartitionTupleRoutingInfo **pd,
                                                TupleTableSlot *slot,
                                                EState *estate,
-                                               PartitionDispatchData 
**failed_at,
+                                               PartitionTupleRoutingInfo 
**failed_at,
                                                TupleTableSlot **failed_slot);
 #endif                                                 /* PARTITION_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 60326f9d03..6e1d3a6d2f 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -208,13 +208,13 @@ extern void EvalPlanQualSetTuple(EPQState *epqstate, 
Index rti,
 extern HeapTuple EvalPlanQualGetTuple(EPQState *epqstate, Index rti);
 extern void ExecSetupPartitionTupleRouting(Relation rel,
                                                           Index resultRTindex,
-                                                          PartitionDispatch 
**pd,
+                                                          
PartitionTupleRoutingInfo ***ptrinfos,
                                                           ResultRelInfo 
**partitions,
                                                           TupleConversionMap 
***tup_conv_maps,
                                                           TupleTableSlot 
**partition_tuple_slot,
                                                           int *num_parted, int 
*num_partitions);
 extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
-                                 PartitionDispatch *pd,
+                                 PartitionTupleRoutingInfo **ptrinfos,
                                  TupleTableSlot *slot,
                                  EState *estate);
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 35c28a6143..1514d62f52 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -414,6 +414,55 @@ typedef struct ResultRelInfo
        Relation        ri_PartitionRoot;
 } ResultRelInfo;
 
+/* Forward declarations, to avoid including other headers */
+typedef struct PartitionKeyData *PartitionKey;
+typedef struct PartitionDispatchData *PartitionDispatch;
+
+/*
+ * PartitionKeyInfoData - execution state for the partition key of a
+ *                                               partitioned table
+ *
+ * keystate is the execution state required for expressions contained in the
+ * partition key.  It is NIL until initialized by FormPartitionKeyDatum() if
+ * and when it is called; for example, during tuple routing through a given
+ * partitioned table.
+ */
+typedef struct PartitionKeyInfo
+{
+       PartitionKey    key;            /* Points into the table's relcache 
entry */
+       List               *keystate;
+} PartitionKeyInfo;
+
+/*
+ * PartitionTupleRoutingInfo - information required for tuple-routing
+ *                                                        through one 
partitioned table in a partition
+ *                                                        tree
+ */
+typedef struct PartitionTupleRoutingInfo
+{
+       /* OID of the table */
+       Oid                             relid;
+
+       /* Information about the table's partitions */
+       PartitionDispatch       pd;
+
+       /* See comment above the definition of PartitionKeyInfo */
+       PartitionKeyInfo   *keyinfo;
+
+       /*
+        * A standalone TupleTableSlot initialized with this table's tuple
+        * descriptor
+        */
+       TupleTableSlot *tupslot;
+
+       /*
+        * TupleConversionMap to convert from the parent's rowtype to this 
table's
+        * rowtype (when extracting the partition key of a tuple just before
+        * routing it through this table)
+        */
+       TupleConversionMap *tupmap;
+} PartitionTupleRoutingInfo;
+
 /* ----------------
  *       EState information
  *
@@ -970,9 +1019,9 @@ typedef struct ModifyTableState
        TupleTableSlot *mt_existing;    /* slot to store existing target tuple 
in */
        List       *mt_excludedtlist;   /* the excluded pseudo relation's tlist 
 */
        TupleTableSlot *mt_conflproj;   /* CONFLICT ... SET ... projection 
target */
-       struct PartitionDispatchData **mt_partition_dispatch_info;
        /* Tuple-routing support info */
-       int                     mt_num_dispatch;        /* Number of entries in 
the above array */
+       struct PartitionTupleRoutingInfo **mt_ptrinfos;
+       int                     mt_num_parted;          /* Number of entries in 
the above array */
        int                     mt_num_partitions;      /* Number of members in 
the following
                                                                         * 
arrays */
        ResultRelInfo *mt_partitions;   /* Per partition result relation */
-- 
2.11.0

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to