Hi Amit, I was just reading through your patches and here are some quick review comments for 0001-Catalog-and-DDL-for-partitioned-tables-17.patch.
Review comments for 0001-Catalog-and-DDL-for-partitioned-tables-17.patch: 1) @@ -1102,9 +1104,10 @@ heap_create_with_catalog(const char *relname, { /* Use binary-upgrade override for pg_class.oid/relfilenode? */ if (IsBinaryUpgrade && - (relkind == RELKIND_RELATION || relkind == RELKIND_SEQUENCE || - relkind == RELKIND_VIEW || relkind == RELKIND_MATVIEW || - relkind == RELKIND_COMPOSITE_TYPE || relkind == RELKIND_FOREIGN_TABLE)) + (relkind == RELKIND_RELATION || relkind == RELKIND_PARTITIONED_TABLE || + relkind == RELKIND_SEQUENCE || relkind == RELKIND_VIEW || + relkind == RELKIND_MATVIEW || relkind == RELKIND_COMPOSITE_TYPE || + relkind == RELKIND_FOREIGN_TABLE)) You should add the RELKIND_PARTITIONED_TABLE at the end of if condition that will make diff minimal. While reading through the patch I noticed that there is inconsistency - someplace its been added at the end and at few places its at the start. I think you can make add it at the end of condition and be consistent with each place. 2) + /* + * We need to transform the raw parsetrees corresponding to partition + * expressions into executable expression trees. Like column defaults + * and CHECK constraints, we could not have done the transformation + * earlier. + */ Additional space before "Like column defaults". 3) - char relkind; + char relkind, + expected_relkind; Newly added variable should be define separately with its type. Something like: char relkind; + char expected_relkind; 4) a) + /* Prevent partitioned tables from becoming inheritance parents */ + if (parent_rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) + ereport(ERROR, + (errcode(ERRCODE_WRONG_OBJECT_TYPE), + errmsg("cannot inherit from partitioned table \"%s\"", + parent->relname))); + need alignment for last line. b) + atttuple = SearchSysCacheAttName(RelationGetRelid(rel), pelem->name); + if (!HeapTupleIsValid(atttuple)) + ereport(ERROR, + (errcode(ERRCODE_UNDEFINED_COLUMN), + errmsg("column \"%s\" named in partition key does not exist", + pelem->name))); + attform = (Form_pg_attribute) GETSTRUCT(atttuple); + + if (attform->attnum <= 0) + ereport(ERROR, + (errcode(ERRCODE_UNDEFINED_COLUMN), + errmsg("cannot use system column \"%s\" in partition key", + pelem->name))); need alignment for last line of ereport c) + /* Disallow ROW triggers on partitioned tables */ + if (stmt->row && rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) + ereport(ERROR, + (errcode(ERRCODE_WRONG_OBJECT_TYPE), + errmsg("\"%s\" is a partitioned table", + RelationGetRelationName(rel)), + errdetail("Partitioned tables cannot have ROW triggers."))); need alignment 5) @@ -2512,6 +2579,7 @@ transformAlterTableStmt(Oid relid, AlterTableStmt *stmt, cxt.blist = NIL; cxt.alist = NIL; cxt.pkey = NULL; + cxt.ispartitioned = rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE; I think adding bracket will look code more clear. + cxt.ispartitioned = (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE); 6) + * RelationBuildPartitionKey + * Build and attach to relcache partition key data of relation + * + * Partitioning key data is stored in CacheMemoryContext to ensure it survives + * as long as the relcache. To avoid leaking memory in that context in case + * of an error partway through this function, we build the structure in the + * working context (which must be short-lived) and copy the completed + * structure into the cache memory. extra space before "To avoid leaking memory" 7) + /* variable-length fields start here, but we allow direct access to partattrs */ + int2vector partattrs; /* attribute numbers of columns in the Why partattrs is allow direct access - its not really clear from the comments. I will continue reading more patch and testing functionality.. will share the comments as I have it. Thanks, On Tue, Nov 22, 2016 at 2:45 PM, Amit Langote <langote_amit...@lab.ntt.co.jp > wrote: > > Updated patches attached. I merged what used to be 0006 and 0007 into one. > > On 2016/11/19 2:23, Robert Haas wrote: > > On Fri, Nov 18, 2016 at 5:59 AM, Amit Langote > > <langote_amit...@lab.ntt.co.jp> wrote: > >> Oh but wait, that means I can insert rows with NULLs in the range > >> partition key if I choose to insert it directly into the partition, > >> whereas I have been thinking all this while that there could never be > >> NULLs in the partition key of a range partition. What's more, > >> get_qual_for_partbound() (patch 0003) emits a IS NOT NULL constraint for > >> every partition key column in case of a range partition. Is that > >> wrongheaded altogether? (also see my reply to your earlier message > about > >> NULLs in the range partition key) > > > > The easiest thing to do might be to just enforce that all of the > > partition key columns have to be not-null when the range-partitioned > > table is defined, and reject any attempt to DROP NOT NULL on them > > later. That's probably better that shoehorning it into the table > > constraint. > > Agreed that forcing range partitioning columns to be NOT NULL during table > creation would be a better approach. But then we would have to reject > using expressions in the range partition key, right? > > >> Thanks for the idea below! > >> > >>> 1. Forget the idea of a tree. Instead, let the total number of tables > >>> in the partitioning hierarchy be N and let the number of those that > > [ ... ] > > >>> > >>> No recursion, minimal pointer chasing, no linked lists. The whole > >>> thing is basically trivial aside from the cost of > >>> list/range_partition_for_tuple itself; optimizing that is a different > >>> project. I might have some details slightly off here, but hopefully > >>> you can see what I'm going for: you want to keep the computation that > >>> happens in get_partition_for_tuple() to an absolute minimum, and > >>> instead set things up in advance so that getting the partition for a > >>> tuple is FAST. And you want the data structures that you are using in > >>> that process to be very compact, hence arrays instead of linked lists. > >> > >> This sounds *much* better. Here is a quick attempt at coding the design > >> you have outlined above in the attached latest set of patches. > > > > That shrank both 0006 and 0007 substantially, and it should be faster, > > too. I bet you can shrink them further: > > Some changes described below have reduced the size to a certain degree. > > > > > - Why is PartitionKeyExecInfo a separate structure and why does it > > have a NodeTag? I bet you can dump the node tag, merge it into > > PartitionDispatch, and save some more code and some more > > pointer-chasing. > > OK, I merged the fields of what used to be PartitionKeyExecInfo into > PartitionDispatchData as the latter's new fields key and keystate. > > > - I still think it's a seriously bad idea for list partitioning and > > range partitioning to need different code-paths all over the place > > here. List partitions support nulls but not multi-column partitioning > > keys and range partitions support multi-column partitioning keys but > > not nulls, but you could use an internal structure that supports both. > > Then you wouldn't need partition_list_values_bsearch and also > > partition_rbound_bsearch; you could have one kind of bound structure > > that can be bsearch'd for either list or range. You might even be > > able to unify list_partition_for_tuple and range_partition_for_tuple > > although that looks a little harder. In either case, you bsearch for > > the greatest value <= the value you have. The only difference is that > > for list partitioning, you have to enforce at the end that it is an > > equal value, whereas for range partitioning less-than-or-equal-to is > > enough. But you should still be able to arrange for more code > > sharing. > > I have considered these suggestions in the latest patch. Now instead of > PartitionListInfo, PartitionRangeInfo, and BoundCollectionData structs, > there is only one PartitionBoundInfo which consolidates the partition > bound information of a partitioned table. Some of the fields are > applicable only to one of list or range case; for example, null-accepting > list partition index, infinite status of individual range datums. > > Also, there is now only one binary search function named > partition_bound_bsearch() which invokes a comparison function named > partition_bound_cmp(). The former searches a probe (a partition bound or > tuple) within a PartitionBoundInfo, which is passed all the way down to > the comparison function. > > Also, we no longer have list_partition_for_tuple() and > range_partition_for_tuple(). Instead, in get_partition_for_tuple() > itself, there is a bsearch followed by list and range partitioning > specific steps based on the returned offset. > > > - I don't see why you need the bound->lower stuff any more. If > > rangeinfo.bounds[offset] is a lower bound for a partition, then > > rangeinfo.bounds[offset+1] is either (a) the upper bound for that > > partition and the partition is followed by a "gap" or (b) both the > > upper bound for that partition and the lower bound for the next > > partition. With the inclusive/exclusive bound stuff gone, every range > > bound has the same sense: if the probed value is <= the bound then > > we're supposed to be a lower-numbered partition, but if > then we're > > supposed to be in this partition or a higher-numbered one. > > OK, I've managed to get rid of lower. At least it is no longer kept in > the new relcache struct PartitionBoundInfo. It is still kept in > PartitionRangeBound which is used to hold individual range bounds when > sorting them (during relcache build). Comparisons invoked during the > aforementioned sorting step still need to distinguish between lower and > upper bounds (such that '1)' < '[1'). > > Tuple-routing no longer needs to look at lower. In that case, what you > described above applies. > > As a result, one change became necessary: to how we flag individual range > bound datum as infinite or not. Previously, it was a regular Boolean > value (either infinite or not) and to distinguish +infinity from > -infinity, we looked at whether the bound is lower or upper (the lower > flag). Now, instead, the variable holding the status of individual range > bound datum is set to a ternary value: RANGE_DATUM_FINITE (0), > RANGE_DATUM_NEG_INF (1), and RANGE_DATUM_POS_INF (2), which still fits in > a bool. Upon encountering an infinite range bound datum, whether it's > negative or positive infinity derives the comparison result. Consider the > following example: > > partition p1 from (1, unbounded) to (1, 1); > partition p2 from (1, 1) to (1, 10); > partition p3 from (1, 10) to (1, unbounded); > partition p4 from (2, unbounded) to (2, 1); > ... so on > > In this case, we need to be able to conclude, say, (1, -inf) < (1, 15) < > (1, +inf), so that tuple (1, 15) is assigned to the proper partition. > > Does this last thing sound reasonable? > > Thanks, > Amit > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > > -- Rushabh Lathia