Robert Haas <robertmh...@gmail.com> writes: > On Fri, Mar 15, 2019 at 3:45 PM Tom Lane <t...@sss.pgh.pa.us> wrote: >> I agree that copying data isn't great. What I don't agree is that this >> solution is better.
> I am not very convinced that it makes sense to lump something like > RelationGetIndexAttrBitmap in with something like > RelationGetPartitionDesc. RelationGetIndexAttrBitmap is returning a > single Bitmapset, whereas the data structure RelationGetPartitionDesc > is vastly larger and more complex. You can't say "well, if it's best > to copy 32 bytes of data out of the relcache every time we need it, it > must also be right to copy 10k or 100k of data out of the relcache > every time we need it." I did not say that. What I did say is that they're both correct solutions. Copying large data structures is clearly a potential performance problem, but that doesn't mean we can take correctness shortcuts. > If we want an at-least-somewhat unified solution here, I think we need > to bite the bullet and make the planner hold a reference to the > relcache throughout planning. (The executor already keeps it open, I > believe.) Otherwise, how is the relcache supposed to know when it can > throw stuff away and when it can't? The real problem with that is that we *still* won't know whether it's okay to throw stuff away or not. The issue with these subsidiary data structures is exactly that we're trying to reduce the lock strength required for changing one of them, and as soon as you make that lock strength less than AEL, you have the problem that that value may need to change in relcache entries despite them being pinned. The code I'm complaining about is trying to devise a shortcut solution to exactly that situation ... and it's not a good shortcut. > The only alternative seems to be to have each subsystem hold its own > reference count, as I did with the PartitionDirectory stuff, which is > not something we'd want to scale out. Well, we clearly don't want to devise a separate solution for every subsystem, but it doesn't seem out of the question to me that we could build a "reference counted cache sub-structure" mechanism and apply it to each of these relcache fields. Maybe it could be unified with the existing code in the typcache that does a similar thing. Sure, this'd be a fair amount of work, but we've done it before. Syscache entries didn't use to have reference counts, for example. BTW, the problem I have with the PartitionDirectory stuff is exactly that it *isn't* a reference-counted solution. If it were, we'd not have this problem of not knowing when we could free rd_pdcxt. >> I especially don't care for the much-less-than-half-baked kluge of >> chaining the old rd_pdcxt onto the new one and hoping that it will go away >> at a suitable time. > It will go away at a suitable time, but maybe not at the soonest > suitable time. It wouldn't be hard to improve that, though. The > easiest thing to do, I think, would be to have an rd_oldpdcxt and > stuff the old context there. If there already is one there, parent > the new one under it. When RelationDecrementReferenceCount reduces > the reference count to zero, destroy anything found in rd_oldpdcxt. Meh. While it seems likely that that would mask most practical problems, it still seems like covering up a wart with a dirty bandage. In particular, that fix doesn't fix anything unless relcache reference counts go to zero pretty quickly --- which I'll just note is directly contrary to your enthusiasm for holding relcache pins longer. I think that what we ought to do for v12 is have PartitionDirectory copy the data, and then in v13 work on creating real reference-count infrastructure that would allow eliminating the copy steps with full safety. The $64 question is whether that really would cause unacceptable performance problems. To look into that, I made the attached WIP patches. (These are functionally complete, but I didn't bother for instance with removing the hunk that 898e5e329 added to relcache.c, and the comments need work, etc.) The first one just changes the PartitionDirectory code to do that, and then the second one micro-optimizes partition_bounds_copy() to make it somewhat less expensive, mostly by collapsing lots of small palloc's into one big one. What I get for test cases like [1] is single-partition SELECT, hash partitioning: N tps, HEAD tps, patch 2 11426.243754 11448.615193 8 11254.833267 11374.278861 32 11288.329114 11371.942425 128 11222.329256 11185.845258 512 11001.177137 10572.917288 1024 10612.456470 9834.172965 4096 8819.110195 7021.864625 8192 7372.611355 5276.130161 single-partition SELECT, range partitioning: N tps, HEAD tps, patch 2 11037.855338 11153.595860 8 11085.218022 11019.132341 32 10994.348207 10935.719951 128 10884.417324 10532.685237 512 10635.583411 9578.108915 1024 10407.286414 8689.585136 4096 8361.463829 5139.084405 8192 7075.880701 3442.542768 Now certainly these numbers suggest that avoiding the copy could be worth our trouble, but these results are still several orders of magnitude better than where we were two weeks ago [2]. Plus, this is an extreme case that's not really representative of real-world usage, since the test tables have neither indexes nor any data. In practical situations the baseline would be lower and the dropoff less bad. So I don't feel bad about shipping v12 with these sorts of numbers and hoping for more improvement later. regards, tom lane [1] https://www.postgresql.org/message-id/3529.1554051867%40sss.pgh.pa.us [2] https://www.postgresql.org/message-id/0F97FA9ABBDBE54F91744A9B37151A512BAC60%40g01jpexmbkw24
diff --git a/src/backend/partitioning/partdesc.c b/src/backend/partitioning/partdesc.c index 4d6595b..96129e7 100644 --- a/src/backend/partitioning/partdesc.c +++ b/src/backend/partitioning/partdesc.c @@ -43,7 +43,6 @@ typedef struct PartitionDirectoryData typedef struct PartitionDirectoryEntry { Oid reloid; - Relation rel; PartitionDesc pd; } PartitionDirectoryEntry; @@ -235,6 +234,34 @@ RelationBuildPartitionDesc(Relation rel) } /* + * copyPartitionDesc + * + * Should be somewhere else perhaps? + */ +static PartitionDesc +copyPartitionDesc(PartitionDesc src, PartitionKey key) +{ + PartitionDesc partdesc; + int nparts = src->nparts; + + partdesc = (PartitionDescData *) palloc0(sizeof(PartitionDescData)); + partdesc->nparts = nparts; + /* If there are no partitions, the rest of the partdesc can stay zero */ + if (nparts > 0) + { + partdesc->oids = (Oid *) palloc(nparts * sizeof(Oid)); + memcpy(partdesc->oids, src->oids, nparts * sizeof(Oid)); + + partdesc->is_leaf = (bool *) palloc(nparts * sizeof(bool)); + memcpy(partdesc->is_leaf, src->is_leaf, nparts * sizeof(bool)); + + partdesc->boundinfo = partition_bounds_copy(src->boundinfo, key); + } + + return partdesc; +} + +/* * CreatePartitionDirectory * Create a new partition directory object. */ @@ -265,7 +292,7 @@ CreatePartitionDirectory(MemoryContext mcxt) * * The purpose of this function is to ensure that we get the same * PartitionDesc for each relation every time we look it up. In the - * face of current DDL, different PartitionDescs may be constructed with + * face of concurrent DDL, different PartitionDescs may be constructed with * different views of the catalog state, but any single particular OID * will always get the same PartitionDesc for as long as the same * PartitionDirectory is used. @@ -281,13 +308,16 @@ PartitionDirectoryLookup(PartitionDirectory pdir, Relation rel) if (!found) { /* - * We must keep a reference count on the relation so that the - * PartitionDesc to which we are pointing can't get destroyed. + * Currently, neither RelationGetPartitionDesc nor + * RelationGetPartitionKey can leak any memory; but if they start + * doing so, do those calls before switching into pdir_mcxt. */ - RelationIncrementReferenceCount(rel); - pde->rel = rel; - pde->pd = RelationGetPartitionDesc(rel); + MemoryContext oldcontext = MemoryContextSwitchTo(pdir->pdir_mcxt); + + pde->pd = copyPartitionDesc(RelationGetPartitionDesc(rel), + RelationGetPartitionKey(rel)); Assert(pde->pd != NULL); + MemoryContextSwitchTo(oldcontext); } return pde->pd; } @@ -296,17 +326,11 @@ PartitionDirectoryLookup(PartitionDirectory pdir, Relation rel) * DestroyPartitionDirectory * Destroy a partition directory. * - * Release the reference counts we're holding. + * This is a no-op at present; caller is supposed to release the memory context. */ void DestroyPartitionDirectory(PartitionDirectory pdir) { - HASH_SEQ_STATUS status; - PartitionDirectoryEntry *pde; - - hash_seq_init(&status, pdir->pdir_hash); - while ((pde = hash_seq_search(&status)) != NULL) - RelationDecrementReferenceCount(pde->rel); } /*
diff --git a/src/backend/partitioning/partbounds.c b/src/backend/partitioning/partbounds.c index c8770cc..b7ddbe5 100644 --- a/src/backend/partitioning/partbounds.c +++ b/src/backend/partitioning/partbounds.c @@ -782,10 +782,13 @@ partition_bounds_copy(PartitionBoundInfo src, PartitionKey key) { PartitionBoundInfo dest; - int i; + bool hash_part = (key->strategy == PARTITION_STRATEGY_HASH); + Datum *alldatums; int ndatums; int partnatts; + int natts; int num_indexes; + int i; dest = (PartitionBoundInfo) palloc(sizeof(PartitionBoundInfoData)); @@ -793,65 +796,69 @@ partition_bounds_copy(PartitionBoundInfo src, ndatums = dest->ndatums = src->ndatums; partnatts = key->partnatts; - num_indexes = get_partition_bound_num_indexes(src); - /* List partitioned tables have only a single partition key. */ Assert(key->strategy != PARTITION_STRATEGY_LIST || partnatts == 1); - dest->datums = (Datum **) palloc(sizeof(Datum *) * ndatums); - if (src->kind != NULL) { + PartitionRangeDatumKind *allkinds; + dest->kind = (PartitionRangeDatumKind **) palloc(ndatums * sizeof(PartitionRangeDatumKind *)); + allkinds = (PartitionRangeDatumKind *) palloc(ndatums * partnatts * + sizeof(PartitionRangeDatumKind)); + for (i = 0; i < ndatums; i++) { - dest->kind[i] = (PartitionRangeDatumKind *) palloc(partnatts * - sizeof(PartitionRangeDatumKind)); + dest->kind[i] = allkinds; + allkinds += partnatts; memcpy(dest->kind[i], src->kind[i], - sizeof(PartitionRangeDatumKind) * key->partnatts); + sizeof(PartitionRangeDatumKind) * partnatts); } } else dest->kind = NULL; + dest->datums = (Datum **) palloc(sizeof(Datum *) * ndatums); + + /* + * For hash partitioning, datums arrays will have two elements - modulus + * and remainder. + */ + natts = hash_part ? 2 : partnatts; + + alldatums = (Datum *) palloc(sizeof(Datum) * natts * ndatums); + for (i = 0; i < ndatums; i++) { - int j; + dest->datums[i] = alldatums; + alldatums += natts; /* - * For a corresponding to hash partition, datums array will have two - * elements - modulus and remainder. + * For hash partitioning, we know that both datums are present and + * pass-by-value (since they're int4s), so just memcpy the datum + * array. Otherwise we have to apply datumCopy. */ - bool hash_part = (key->strategy == PARTITION_STRATEGY_HASH); - int natts = hash_part ? 2 : partnatts; - - dest->datums[i] = (Datum *) palloc(sizeof(Datum) * natts); - - for (j = 0; j < natts; j++) + if (hash_part) { - bool byval; - int typlen; - - if (hash_part) - { - typlen = sizeof(int32); /* Always int4 */ - byval = true; /* int4 is pass-by-value */ - } - else + memcpy(dest->datums[i], src->datums[i], 2 * sizeof(Datum)); + } + else + { + for (int j = 0; j < natts; j++) { - byval = key->parttypbyval[j]; - typlen = key->parttyplen[j]; + if (dest->kind == NULL || + dest->kind[i][j] == PARTITION_RANGE_DATUM_VALUE) + dest->datums[i][j] = datumCopy(src->datums[i][j], + key->parttypbyval[j], + key->parttyplen[j]); } - - if (dest->kind == NULL || - dest->kind[i][j] == PARTITION_RANGE_DATUM_VALUE) - dest->datums[i][j] = datumCopy(src->datums[i][j], - byval, typlen); } } + num_indexes = get_partition_bound_num_indexes(src); + dest->indexes = (int *) palloc(sizeof(int) * num_indexes); memcpy(dest->indexes, src->indexes, sizeof(int) * num_indexes);