Apologies if I've made some of these comments before and/or missed comments you've made on these topics. The size of this patch set is so large that it's hard to keep track of everything.
Re-reviewing 0001: + indicate which table columns are used as partition key. For example, s/are used as/are part of the/ + third table columns make up the partition key. A zero in this array + indicates that the corresponding partition key column is an expression + over the table columns, rather than a simple column reference. I think you can leave out "over the table columns". + columns or expressions forms the <firstterm>partitioning key</firstterm> s/forms/form/ + The table itself is empty. A data row inserted into the table is routed s/The table/The partitioned table/ + * Anything mentioned in the expressions. We must ignore the column + * references which will count as self-dependency items; in this case, + * the depender is the table itself (there is no such thing as partition + * key object). "depender" is not really a word, and the parenthetical note isn't very clear. Maybe: We must ignore the column references, which will depend on the table itself; there is no separate partition key object. + heap_close(pg_partitioned_table, RowExclusiveLock); It seems like it would be safe to do this right after CatalogUpdateIndexes(pg_partitioned_table, tuple), and I'd recommend that you do. Not for performance or anything, but just to keep related code together. /* * Resolve possibly-defaulted operator class specification */ -static Oid +Oid GetIndexOpClass(List *opclass, Oid attrType, Perhaps we should rename this function to ResolveOpClass, since it's now going to be used for both indexes and partitioning keys. + * Sets *is_expr if attnum is found to be referenced in some partition key + * expression. is_expr doesn't seem quite as clear as, say, used_by_expr or used_in_expr. Also, the specification for this function doesn't seem to be very clear about what this is supposed to do if the same column is both an explicit partitioning column and also used in an expression, and the code looks like it'll return with *is_expr set based on whichever use it encounters first. If that's intended behavior, maybe add a comment like: It's possible for a column to be used both directly and as part of a partition key expression; if that happens, *is_expr may end up as either true or false. That's OK for current uses of this function, because *is_expr is only used to tailor the error message text. + if (is_expr) + *is_expr = false; + if (attnum == partattno) + return true; I think you should adjust this (and the other bit in the same function) so that you don't set *is_expr until you're committed to returning. + index = -1; + while ((index = bms_next_member(expr_attrs, index)) > 0) + { + AttrNumber attno = index + FirstLowInvalidHeapAttributeNumber; + + if (attno == attnum) + return true; + } How about bms_is_member(expr_attrs, attnum - FirstLowInvalidHeapAttributeNumber), instead of looping? + errmsg("cannot reference relation \"%s\"", RelationGetRelationName(pkrel)), + errdetail("Referencing partitioned tables in foreign key constraints is not supported."))); I think you could get rid of the errdetail and just have the error message be "cannot reference partitioned table \"%s\"". + errmsg("column \"%s\" appears twice in partition key", pelem->name), It could be there three times! How about column \"%s\" appears more than once in partition key? (I see that you seem to have adapted this from some code in parse_utilcmd.c, which perhaps should also be adjusted, but that's a job for another patch.) + /* + * Strip any top-level COLLATE clause. This ensures that we treat + * "x COLLATE y" and "(x COLLATE y)" alike. + */ But you don't, right? Unless I am confused, which is possible, the latter COLLATE will be ignored, while the former one will set the collation to be used in the context of partitioning comparisons. Re-reviewing 0002: + if (fout->remoteVersion >= 100000) + { + PQExpBuffer acl_subquery = createPQExpBuffer(); + PQExpBuffer racl_subquery = createPQExpBuffer(); + PQExpBuffer initacl_subquery = createPQExpBuffer(); + PQExpBuffer initracl_subquery = createPQExpBuffer(); + + PQExpBuffer attacl_subquery = createPQExpBuffer(); + PQExpBuffer attracl_subquery = createPQExpBuffer(); + PQExpBuffer attinitacl_subquery = createPQExpBuffer(); + PQExpBuffer attinitracl_subquery = createPQExpBuffer(); It seems unnecessary to repeat all of this. The only differences between the 10000 version and the 9600 version are: 60,61c60 < "AS changed_acl, " < "CASE WHEN c.relkind = 'P' THEN pg_catalog.pg_get_partkeydef(c.oid) ELSE NULL END AS partkeydef " --- > "AS changed_acl " 73c72 < "WHERE c.relkind in ('%c', '%c', '%c', '%c', '%c', '%c', '%c') " --- > "WHERE c.relkind in ('%c', '%c', '%c', '%c', '%c', '%c') " 87,88c86 < RELKIND_MATVIEW, RELKIND_FOREIGN_TABLE, < RELKIND_PARTITIONED_TABLE); --- > RELKIND_MATVIEW, RELKIND_FOREIGN_TABLE); But none of that is really a problem. Sure, the 'P' case will never arise in 9.6, but so what? I'd really like to not keep duplicating these increasingly-complex hunks of code if we can find some way to avoid that. /* find all the inheritance information */ - appendPQExpBufferStr(query, "SELECT inhrelid, inhparent FROM pg_inherits"); + appendPQExpBufferStr(query, + "SELECT inhrelid, inhparent " + "FROM pg_inherits " + "WHERE inhparent NOT IN (SELECT oid FROM pg_class WHERE relkind = 'P')"); I think you need to update the comment. "Find inheritance information, excluding implicit inheritance via partitioning. We're not interested in that case because $REASON." Re-reviewing 0003: + <para> + If this table is a partition, one cannot perform <literal>DROP NOT NULL</> + on a column if it is marked not null in the parent table. + </para> This would, presumably, also be true for inheritance. I think we could just leave this out. + as partition of the target table. The partition bound specification must s/as partition/as a partition/ + correspond to the partitioning method and partitioning key of the target I think that in most places were are referring to the "partitioning method" (with ing) but the "partition key" (without ing). Let's try to be consistent. + table. The table to be attached must have all the columns as the target + table and no more; moreover, the column types must also match. Also, it + must have all the matching constraints as the target table. s/all the columns/all of the same columns/ The second sentence doesn't seem quite grammatical. And why would that be true anyway? Partitions can have extra constraints, and if they lack constraints that are present on the partitioned table, those constraints will be added and validated, right? + A full table scan is performed on the table being attached to check that + existing rows in the table are within the specified partition bound. + If it is known in advance that no partition bound violating rows are + present in the table, the above scan can be skipped by specifying the + <literal>NO VALIDATE</> option. + Default behavior is to perform the scan, as if the <literal>VALIDATE</literal> + option were specified. I don't think it's OK to allow the partition to be added if it contains rows that might not be valid. We are generally vary wary about adding options that can cause data integrity failures and I think we shouldn't start here, either. On the other hand, it's also not desirable for adding a partition to take O(n) time in all cases. So what would be nice is if the new partition could self-certify that contains no problematic rows by having a constraint that matches the new partitioning constraint. Then we can skip the scan if that constraint is already present. + inherited columns. One can also specify table constraints, in addition Delete comma. + to those inherited from the parent. If a check constraint with the name + matching one of the parent's constraint is specified, it is merged with + the latter, provided the specified condition is same. Doesn't that effectively delete the merged constraint? Suggest: "If the parent already has a check constraint with the same name as a constraint specified for the child, the conditions must be the same." +) FOR VALUES IN ('los angeles', 'san fransisco'); That's not you you spell San Francisco. + Create partition of a list partitioned table that itself is further + partitioned and then create its partition: s/itself is/is itself/ s/then create its partition/then add a partition to it/ + if (!is_local || con->coninhcount == 0) + con->coninhcount++; I would think that you could get rid of the "if" and just say con->coninhcount = 1. It seems to me (and perhaps the comment could say this) that for a partitioned table, we can simplify the handling of coninhcount and conislocal. Since partitioning hierarchies don't allow multiple inheritance, any inherited constraint must be inherited exactly once. Since a partition cannot have inheritance children -- except by being partitioned itself -- there is no point in tracking conislocal, so we always store it as false. + +void +StorePartitionBound(Relation rel, Node *bound) Header comment, please! + (void) SysCacheGetAttr(RELOID, tuple, Anum_pg_class_relpartbound, + &isnull); + Assert(isnull); We try not to do unnecessary work in non-assert-enabled builds solely for the benefit of assert-enabled builds. We also try not to end up with variables that are only used in assert-enabled builds but not marked PG_USED_FOR_ASSERTS_ONLY, because those tend to cause compiler warnings. I'm not sure an compiler would be smart enough to warn about this, but I suggest adding an #ifdef USE_ASSERT_CHECKING with a block inside where the offending variable is declared. Actually, I think you need to move a bit more code. Hmm. Something like: #ifdef USE_ASSERT_CHECKING { Form_pg_class classForm; bool isnull; classForm = (Form_pg_class) GETSTRUCT(tuple); Assert(!classForm->relispartition); (void) SysCacheGetAttr(RELOID, tuple, Anum_pg_class_relpartbound, &isnull); Assert(isnull); } #endif + * are same in common cases, of which we only store one. "and we only store one of them." + /* + * Never put a null into the values array, flag + * instead for the code further down below where + * we construct the actual relcache struct. + */ + found_null_partition = true; + null_partition_index = i; How about we elog(ERROR, ...) if found_null_partition is already set? + foreach(cell, non_null_values) + { + PartitionListValue *src = lfirst(cell); + + all_values[i] = (PartitionListValue *) + palloc(sizeof(PartitionListValue)); + all_values[i]->value = datumCopy(src->value, + key->parttypbyval[0], + key->parttyplen[0]); + all_values[i]->index = src->index; + i++; + } Why do we need to datumCopy() here when we just did that in the previous loop? Why did the previous loop need to datumCopy(), anyway? Can't we just pass the same datums around? I understand that you need to copy the data into the correct context at the end, but doing two copies prior to that seems excessive. +get_leaf_partition_oids(Oid relid, int lockmode) Instead of implementing this, can you just use find_all_inheritors()? + /* + * Is lower_val = upper_val? + */ + if (lower_val && upper_val) So, that comment does not actually match that code. That if-test is just checking whether both bounds are finite. What I think you should be explaining here is that if lower_val and upper_val are both non-infinite, and if the happen to be equal, we want to emit an equality test rather than a >= test plus a <= test because $REASON. + operoid = get_opfamily_member(key->partopfamily[col], + key->parttypid[col], + key->parttypid[col], + strategy); + if (!OidIsValid(operoid)) + { + operoid = get_opfamily_member(key->partopfamily[col], + key->partopcintype[col], + key->partopcintype[col], + strategy); + *need_relabel = true; + } There's a comment explaining THAT you do this ("Use either the column type as the operator datatype or opclass's declared input type.") but, to repeat a complaint that I've made often, nothing explaining why. In this particular case, what's not clear - in my opinion - is why you need to try two different possibilities, and why at least one of those possibilities is guaranteed to work. I gather that if the opfamily doesn't contain an operator for the actual type of the partitioning column, you think it will certainly contain one for the input type of the operator class (which seems right), and that the input type of the operator class will be binary-compatible with the type of the partitioning column (which is less-obviously true, and needs some explanation). I also think that this function should elog(ERROR, ...) if by any chance the second get_opfamily_member() call also fails. Otherwise the error might occur quite a bit downstream and be hard to understand. + ReleaseSysCache(tuple); + heap_close(parent, AccessShareLock); I think you had better not release the lock here - i.e. pass NoLock. We don't release locks on tables until transaction commit, except for catalog tables. This also comes up in, at least, transformPartitionOf(). + PartitionRangeBound *b1 = (*(PartitionRangeBound *const *) a); + PartitionRangeBound *b2 = (*(PartitionRangeBound *const *) b); + PartitionKey key = (PartitionKey) arg; + + return partition_rbound_cmp(key, b1, b2); Whitespace. + * as partition and schema consists of columns definitions corresponding the schema consists - if (recurse) + /* Force inheritance recursion, if partitioned table. */ + if (recurse || rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) I would instead error out if ONLY is specified for a partitioned table. + /* + * If the table is source table of ATTACH PARTITION command, following + * check is unnecessary. + */ As usual, comment should say WHY. + if (partqualstate && !ExecQual(partqualstate, econtext, true)) + ereport(ERROR, + (errcode(ERRCODE_CHECK_VIOLATION), + errmsg("child table contains a row violating partition bound specification"))); Why not mimic the existing phrasing? "partition constraint is violated by some row" What happens if you try to attach a table as a partition of itself or one of its ancestors? - errmsg("column \"%s\" in child table must be marked NOT NULL", - attributeName))); + errmsg("column \"%s\" in child table must be marked NOT NULL", + attributeName))); Whitespace-only hunk; revert. You cannot fight the power of pgindent. + errmsg("cannot attach table that is a inheritance child as partition"))); an inheritance child + errmsg("cannot attach a temporary relation of another session as partition "))); Extra space. + errdetail("Table being attached should contain only the columns" + " present in parent."))); Suggest: "New partition should contain only..." Also, don't break error messages into multiple strings. Make it one long string and let pgindent deal. + | FOR VALUES START '(' range_datum_list ')' lb_inc + END_P '(' range_datum_list ')' ub_inc Just a random idea. Would for VALUES FROM ( ... ) TO ( ... ) be more idiomatic than START and END? +static void +transformPartitionOf(CreateStmtContext *cxt, Node *bound) +{ + TupleDesc tupdesc; + int i; + RangeVar *part = cxt->relation; + RangeVar *partof = linitial(cxt->inhRelations); + Relation parentRel; + + parentRel = heap_openrv(partof, AccessShareLock); + If there's anyway that we might do heap_openrv() on the same RangeVar at multiple places in the code, it presents security and integrity hazards because the referent of that RangeVar might change in the meantime. I suspect there is such a risk here - won't we need to open the relation again when we actually want to do the operation? Also, we should never acquire a lower-level lock on a relation and then, further downstream, acquire a higher-level lock on the same object. To do so creates a deadlock hazard. That seems like it might be a problem here, too. + /*if (ldatum->infinite && rdatum->infinite) + ereport(ERROR, + (errcode(ERRCODE_INVALID_TABLE_DEFINITION), + errmsg("both START and END datum for column \"%s\" cannot be UNBOUNDED", + colname), + parser_errposition(cxt->pstate, rdatum->location)));*/ Commented-out code is bad. + if (list_length(spec->lowerdatums) > partnatts) + ereport(ERROR, + (errcode(ERRCODE_INVALID_TABLE_DEFINITION), + errmsg("START has more values than the length of the partition key"), + parser_errposition(cxt->pstate, + exprLocation(list_nth(spec->lowerdatums, + list_length(spec->lowerdatums) - 1))))); + else if (list_length(spec->lowerdatums) < partnatts) + ereport(ERROR, + (errcode(ERRCODE_INVALID_TABLE_DEFINITION), + errmsg("START has fewer values than the length of the partition key"), + parser_errposition(cxt->pstate, + exprLocation(list_nth(spec->lowerdatums, + list_length(spec->lowerdatums) - 1))))); It would be worth looking for a way to unify these cases. Like "START must specify exactly one value per partitioning column". + * Same oids? No mind order - in the list case, it matches the order + * in which partition oids are returned by a pg_inherits scan, whereas + * in the range case, they are in order of ranges of individual + * partitions. XXX - is the former unsafe? Probably. It likely depends on the physical ordering of tuples in the table, which can change. + * BoundCollection encapsulates a set of partition bounds of either physical + * or logical relations. It is associated with a partitioned relation of + * which the aforementioned relations are partitions. "physical or logical relations" is unfamiliar terminology. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers