On Fri, Nov 18, 2016 at 5:59 AM, Amit Langote <langote_amit...@lab.ntt.co.jp> wrote: > Oh but wait, that means I can insert rows with NULLs in the range > partition key if I choose to insert it directly into the partition, > whereas I have been thinking all this while that there could never be > NULLs in the partition key of a range partition. What's more, > get_qual_for_partbound() (patch 0003) emits a IS NOT NULL constraint for > every partition key column in case of a range partition. Is that > wrongheaded altogether? (also see my reply to your earlier message about > NULLs in the range partition key)
The easiest thing to do might be to just enforce that all of the partition key columns have to be not-null when the range-partitioned table is defined, and reject any attempt to DROP NOT NULL on them later. That's probably better that shoehorning it into the table constraint. > Thanks for the idea below! > >> 1. Forget the idea of a tree. Instead, let the total number of tables >> in the partitioning hierarchy be N and let the number of those that >> are partitioned be K. Assign each partitioned table in the hierarchy >> an index between 0 and K-1. Make your top level data structure (in >> lieu of PartitionTreeNodeData) be an array of K PartitionDispatch >> objects, with the partitioning root in entry 0 and the rest in the >> remaining entries. >> >> 2. Within each PartitionDispatch object, store (a) a pointer to a >> PartitionDesc and (b) an array of integers of length equal to the >> PartitionDesc's nparts value. Each integer i, if non-negative, is the >> final return value for get_partition_for_tuple. If i == -1, tuple >> routing fails. If i < -1, we must next route using the subpartition >> whose PartitionDesc is at index -(i+1). Arrange for the array to be >> in the same order the PartitionDesc's OID list. >> >> 3. Now get_partition_for_tuple looks something like this: >> >> K = 0 >> loop: >> pd = PartitionDispatch[K] >> idx = list/range_partition_for_tuple(pd->partdesc, ...) >> if (idx >= -1) >> return idx >> K = -(idx + 1) >> >> No recursion, minimal pointer chasing, no linked lists. The whole >> thing is basically trivial aside from the cost of >> list/range_partition_for_tuple itself; optimizing that is a different >> project. I might have some details slightly off here, but hopefully >> you can see what I'm going for: you want to keep the computation that >> happens in get_partition_for_tuple() to an absolute minimum, and >> instead set things up in advance so that getting the partition for a >> tuple is FAST. And you want the data structures that you are using in >> that process to be very compact, hence arrays instead of linked lists. > > This sounds *much* better. Here is a quick attempt at coding the design > you have outlined above in the attached latest set of patches. That shrank both 0006 and 0007 substantially, and it should be faster, too. I bet you can shrink them further: - Why is PartitionKeyExecInfo a separate structure and why does it have a NodeTag? I bet you can dump the node tag, merge it into PartitionDispatch, and save some more code and some more pointer-chasing. - I still think it's a seriously bad idea for list partitioning and range partitioning to need different code-paths all over the place here. List partitions support nulls but not multi-column partitioning keys and range partitions support multi-column partitioning keys but not nulls, but you could use an internal structure that supports both. Then you wouldn't need partition_list_values_bsearch and also partition_rbound_bsearch; you could have one kind of bound structure that can be bsearch'd for either list or range. You might even be able to unify list_partition_for_tuple and range_partition_for_tuple although that looks a little harder. In either case, you bsearch for the greatest value <= the value you have. The only difference is that for list partitioning, you have to enforce at the end that it is an equal value, whereas for range partitioning less-than-or-equal-to is enough. But you should still be able to arrange for more code sharing. - I don't see why you need the bound->lower stuff any more. If rangeinfo.bounds[offset] is a lower bound for a partition, then rangeinfo.bounds[offset+1] is either (a) the upper bound for that partition and the partition is followed by a "gap" or (b) both the upper bound for that partition and the lower bound for the next partition. With the inclusive/exclusive bound stuff gone, every range bound has the same sense: if the probed value is <= the bound then we're supposed to be a lower-numbered partition, but if > then we're supposed to be in this partition or a higher-numbered one. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers