On Tue, Aug 7, 2018 at 9:29 AM Andres Freund <and...@anarazel.de> wrote: > One approach would be to make sure that everything relying on > rt_partdesc staying the same stores its value in a local variable, and > then *not* free the old version of rt_partdesc (etc) when the refcount > > 0, but delay that to the RelationClose() that makes refcount reach > 0. That'd be the start of a framework for more such concurrenct > handling.
Some analysis of possible trouble spots: - get_partition_dispatch_recurse and ExecCreatePartitionPruneState both call RelationGetPartitionDesc. Presumably, this means that if the partition descriptor gets updated on the fly, the tuple routing and partition dispatch code could end up with different ideas about which partitions exist. I think this should be fixed somehow, so that we only call RelationGetPartitionDesc once per query and use the result for everything. - expand_inherited_rtentry checks RelationGetPartitionDesc(oldrelation) != NULL. If so, it calls expand_partitioned_rtentry which fetches the same PartitionDesc again. We can probably just do this once in the caller and pass the result down. - set_relation_partition_info also calls RelationGetPartitionDesc. Off-hand, I think this code runs after expand_inherited_rtentry. Not sure what to do about this. I'm not sure what the consequences would be if this function and that one had different ideas about the partition descriptor. - tablecmds.c is pretty free about calling RelationGetPartitionDesc repeatedly, but it probably doesn't matter. If we're doing some kind of DDL that depends on the contents of the partition descriptor, we *had better* be holding a lock strong enough to prevent the partition descriptor from being changed by somebody else at the same time. Allowing a partition to be added concurrently with DML is one thing; allowing a partition to be added concurrently with adding another partition is a whole different level of insanity. I think we'd be best advised not to go down that rathole - among other concerns, how would you even guarantee that the partitions being added didn't overlap? Generally: Is it really OK to postpone freeing the old partition descriptor until the relation reference count goes to 0? I wonder if there are cases where this could lead to tons of copies of the partition descriptor floating around at the same time, gobbling up precious cache memory. My first thought was that this would be pretty easy: just create a lot of new partitions one by one while some long-running transaction is open. But the actual result in that case depends on the behavior of the backend running the transaction. If it just ignores the new partitions and sticks with the partition descriptor it has got, then probably nothing else will request the new partition descriptor either and there will be no accumulation of memory. However, if it tries to absorb the updated partition descriptor, but without being certain that the old one can be freed, then we'd have a query-lifespan memory leak which is quadratic in the number of new partitions. Maybe even that would be OK -- we could suppose that the number of new partitions would probably be all THAT crazy large, and the constant factor not too bad, so maybe you'd leak a could of MB for the length of the query, but no more. However, I wonder if it would better to give each PartitionDescData its own refcnt, so that it can be freed immediately when the refcnt goes to zero. That would oblige every caller of RelationGetPartitionDesc() to later call something like ReleasePartitionDesc(). We could catch failures to do that by keeping all the PartitionDesc objects so far created in a list. When the main entry's refcnt goes to 0, cross-check that this list is empty; if not, then the remaining entries have non-zero refcnts that were leaked. We could emit a WARNING as we do in similar cases. In general, I think something along the lines you are suggesting here is the right place to start attacking this problem. Effectively, we immunize the system against the possibility of new entries showing up in the partition descriptor while concurrent DML is running; the semantics are that the new partitions are ignored for the duration of currently-running queries. This seems to allow for painless creation or addition of new partitions in normal cases, but not when a default partition exists. In that case, using the old PartitionDesc is outright wrong, because adding a new toplevel partition changes the default partition's partition constraint. We can't insert into the default partition a tuple that under the updated table definition needs to go someplace else. It seems like the best way to account for that is to reduce the lock level on the partitioned table to ShareUpdateExclusiveLock, but leave the lock level on any default partition as AccessExclusiveLock (because we are modifying a constraint on it). We would also need to leave the lock level on the new partition as AccessExclusiveLock (because we are adding a constraint on it). Not perfect, for sure, but not bad for a first patch, either; it would improve things for users in a bunch of practical cases. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company