On 06-01-2015 PM 03:40, Amit Langote wrote: > > I agree that while we are discussing these points, we could also be > discussing how we solve problems of existing partitioning implementation > using whatever the above things end up being. Proposed approaches to > solve those problems might be useful to drive the first step as well or > perhaps that's how it should be done anyway. >
I realize the discussion has not quite brought us to *conclusions* so far though surely there has been valuable input from people. Anyway, starting a new thread with the summary of what has been (please note that the order of listing the points does not necessarily connote the priority): * It has been repeatedly pointed out that we may want to decouple partitioning from inheritance because implementing partitioning as an extension of inheritance mechanism means that we have to keep all the existing semantics which might limit what we want to do with the special case of it which is partitioning; in other words, we would find ourselves in difficult position where we have to inject a special case code into a very generalized mechanism that is inheritance. Specifically, do we regard a partitions as pg_inherits children of its partitioning parent? * Syntax: do we want to make it similar to one of the many other databases out there? Or we could invent our own? I like the syntax that Robert suggested that covers the cases of RANGE and LIST partitioning without actually having to use those keywords explicitly; something like the following: CREATE TABLE parent PARTITION ON (column [ USING opclass ] [, ... ]); CREATE TABLE child PARTITION OF parent_name FOR VALUES { (value, ...) [ TO (value, ...) ] } So instead of making a hard distinction between range and list partitioning, you can say: CREATE TABLE child_name PARTITION OF parent_name FOR VALUES (3, 5, 7); wherein, child is effectively a LIST partition CREATE TABLE child PARTITION OF parent_name FOR VALUES (8) TO (12); wherein, child is effectively a RANGE partition on one column CREATE TABLE child PARTITION OF parent_name FOR VALUES(20, 120) TO (30, 130); wherein, child is effectively a RANGE partition on two columns I wonder if we could add a clause like DISTRIBUTED BY to complement PARTITION ON that represents a hash distributed/partitioned table (that could be a syntax to support sharded tables maybe; we would definitely want to move ahead in that direction I guess) * Catalog: We would like to have a catalog structure suitable to implement capabilities like multi-column partitioning, sub-partitioning (with arbitrary number of levels in the hierarchy). I had suggested that we create two new catalogs viz. pg_partitioned_rel, pg_partition_def to store metadata about a partition key of a partitioned relation and partition bound info of a partition, respectively. Also, see the point about on-disk representation of partition bounds * It is desirable to treat partitions as pg_class relations with perhaps a new relkind(s). We may want to choose an implementation where only the lowest level relations in a partitioning hierarchy have storage; those at the upper layers are mere placeholder relations though of course with associated constraints determined by partitioning criteria (with appropriate metadata entered into the additional catalogs). I am not quite sure if each kind of the relations involved in the partitioning scheme have separate namespaces and, if they are, how we implement that * In the initial implementation, we could just live with partitioning on a set of columns (and not arbitrary expressions of them) * We perhaps do not need multi-column LIST partitions as they are not very widely used and may complicate the implementation * There are a number of suggestions about how we represent partition bounds (on-disk) - pg_node_tree, RECORD (a composite type or the rowtype associated with the relation itself), etc. Important point to consider here may be that partition key may contain more than one column * How we represent partition definition in memory (for a given partitioned relation) - important point to remember is that such a representation should be efficient to iterate through or binary-searchable. Also see the points about tuple-routing and partition-pruning * Overflow/catchall partition: it seems we do not want/need them. It might seem desirable for example in cases where a big transaction enters a large number of tuples all but one of which find a defined partition; we may not want to error out in such case but instead enter that erring tuple into the overflow partition instead. If we choose to implement that, we would like to also implement the capability to move the tuples into the appropriate partition once it's defined. Related is the notion of automatically creating partitions if one is not already defined for a just entered tuple; but there may be locking troubles if many concurrent sessions try to do that * Tuple-routing: based on the internal representation of partition bounds for the partitions of a given partitioned table, there should be a way to map a just entered tuple to partition id it belongs to. Below mentioned BRIN-like machinery could be made to work * Partition-pruning: again, based on the internal representation of partition bounds for the partitions of a given partitioned table, there should be a way to prune partitions deemed unnecessary per scan quals. One notable suggestion is to consider BRIN (-like) machinery. For example, it is able to tell from the scan quals whether a particular block range of a given heap needs to be scanned or not based on summary info index tuple for the block range. Though, the interface is currently suitable to cover a single heap with blocks in range 0 to N-1 of that heap. What we are looking for here is a hypothetical PartitionMemTuple (or PartitionBound) that is a summary of a whole relation (in this case, the partition) NOT a block range. But I guess the infrastructure is generalized enough that we could make that work. Related then would be an equivalent of ScanKey for the partitioning case. Just as ScanKeyData has correspondence with the index being used, the hypothetical PartitionScanKeyData (which may be an entirely bad/half-baked idea!) would represent the application of comparison operator between table column (partitioning key column) and a constant (as per quals). Please help bridge the gap in my understanding of these points. I hope we can put the discussion on a concrete footing so that it leads to a way towards implementation sooner than later. Some points need more immediate attention as we would like to first tackle the issue of partition metadata. Reusing existing infrastructure should be encouraged with obvious enhancements as we find fit. I am beginning to feel there is a need to prototype a good enough solution that incorporates the suggestions that have been already provided or will be provided. It may be the only way forward though I think it definitely worthwhile to spend some time to arrive at such a set of good enough ideas on various aspects. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers