Re: UniqueKey on Partitioned table.
On Tue, Apr 6, 2021 at 6:55 PM David Rowley wrote: > > On Tue, 6 Apr 2021 at 22:31, Andy Fan wrote: > > On Wed, Mar 31, 2021 at 9:12 PM Ashutosh Bapat > > wrote: > >> I think the reason we add ECs for sort expressions is to use > >> transitive relationship. The EC may start with a single member but > >> later in the planning that member might find partners which are all > >> equivalent. Result ordered by one is also ordered by the other. The > >> same logic applies to UniqueKey as well, isn't it. In a result if a > >> set of columns makes a row unique, the set of columns represented by > >> the other EC member should be unique. Though a key will start as a > >> singleton it might EC partners later and thus thus unique key will > >> transition to all the members. With that logic UniqueKey should use > >> just ECs instead of bare expressions. > > > > > > TBH, I haven't thought about this too hard, but I think when we build the > > UniqueKey, all the ECs have been built already. So can you think out an > > case we start with an EC with a single member at the beginning and > > have more members later for UniqueKey cases? > > I don't really know if it matters which order things happen. We still > end up with a single EC containing {a,b} whether we process ORDER BY b > or WHERE a=b first. > I think it is time to talk about this again. Take the below query as example: SELECT * FROM t1, t2 WHERE t1.pk = t2.pk; Then when I populate_baserel_uniquekeys for t1, we already have EC{Members={t1.pk, t2.pk}} in root->eq_classes already. Then I use this EC directly for t1's UniqueKey. The result is: T1's UniqueKey : [ EC{Members={t1.pk, t2.pk}} ]. *Would this be OK since at the baserel level, the "t1.pk = t2.pk" is not executed yet?* I tried the below example to test how PathKey is maintained. CREATE TABLE t1 (a INT, b INT); CREATE TABLE t2 (a INT, b INT); CREATE INDEX ON t1(b); SELECT * FROM t1, t2 WHERE t1.b = t2.b and t1.b > 3; then we can get t1's Path: Index Scan on (b), PathKey.pk_class include 2 members (t1.b, t2.b} even before the Join. So looks the answer for my question should be "yes"? Hope I have made myself clear. -- Best Regards Andy Fan (https://www.aliyun.com/)
Re: UniqueKey on Partitioned table.
On Sat, 17 Jul 2021 at 19:32, Andy Fan wrote: > SELECT * FROM t1, t2 WHERE t1.pk = t2.pk; > > Then when I populate_baserel_uniquekeys for t1, we already have > EC{Members={t1.pk, t2.pk}} in root->eq_classes already. Then I use > this EC directly for t1's UniqueKey. The result is: > > T1's UniqueKey : [ EC{Members={t1.pk, t2.pk}} ]. > > *Would this be OK since at the baserel level, the "t1.pk = t2.pk" is not > executed yet?* > > I tried the below example to test how PathKey is maintained. > CREATE TABLE t1 (a INT, b INT); > CREATE TABLE t2 (a INT, b INT); > CREATE INDEX ON t1(b); > > SELECT * FROM t1, t2 WHERE t1.b = t2.b and t1.b > 3; > > then we can get t1's Path: > > Index Scan on (b), PathKey.pk_class include 2 members (t1.b, t2.b} > even before the Join. > > So looks the answer for my question should be "yes"? Hope I have > made myself clear. I don't see the problem. The reason PathKeys use EquivalenceClasses is so that queries like: SELECT * FROM tab WHERE a=b ORDER BY b; can see that they're also ordered by a. This is useful because if there happens to be an index on tab(a) then we can use it to provide the required ordering for this query. We'll want the same with UniqueKeys. The same thing there looks like: CREATE TABLE tab (a int primary key, b int not null); select distinct b from tab where a=b; Since we have the EquivalenceClass with {a,b} stored in the UniqueKey, then we should be able to execute this without doing any distinct operation. David
Re: [PATCH] Use optimized single-datum tuplesort in ExecSort
On Sat, 17 Jul 2021 at 01:14, James Coleman wrote: > The only remaining question I have is whether or not costing needs to > change, given the very significant speedup for datum sort. I'm looking at cost_tuplesort and the only thing that I think might make sense would be to adjust how the input_bytes value is calculated. For now, that's done with the following function that's used in quite a number of places. static double relation_byte_size(double tuples, int width) { return tuples * (MAXALIGN(width) + MAXALIGN(SizeofHeapTupleHeader)); } It seems, at least in the case of Sort, that using SizeofHeapTupleHead is just always wrong as it should be SizeofMinimalTupleHeader. I know that's also the case for Memoize too. I've not checked the other locations. The only thing I can really see that we might do would be not add the MAXALIGN(SizeofHeapTupleHeader) when there's just a single column. We'd need to pass down the number of attributes from create_sort_path() so we'd know when and when not to add that. I'm not saying that we should do this. I'm just saying that I don't really see what else we might do. I can imagine another patch might just want to do a complete overhaul of all locations that use relation_byte_size(). There are various things that function just does not account for. e.g, the fact that we allocate chunks in powers of 2 and that there's a chunk header added on. Of course, "width" is just an estimate, so maybe trying to calculate something too precisely wouldn't be too wise. However, there's a bit of a chicken and the egg problem there as there'd be little incentive to improve "width" unless we started making more accurate use of the value. Anyway, none of the above take into account that the Datum sort is just a little faster, The only thing that exists in the existing cost modal that we could use to adjust the cost of an in memory sort is the comparison_cost. The problem there is that the comparison is exactly the same in both Datum and Tuple sorts. The only thing that really changes between Datum and Tuple sort is the fact that we don't make a MinimalTuple when doing a Datum sort. The cost modal, unfortunately, does not account for that. That kinda makes me think that we should do nothing as if we start to account for making MemoryTuples then we'll just penalise Tuple sorts and that might cause someone to be upset. David
Re: pg_dump new feature: exporting functions only. Bad or good idea ?
> On Sat, Jul 10, 2021 at 5:06 AM Tomas Vondra < tomas.von...@enterprisedb.com> wrote: > On 7/10/21 1:43 AM, Tom Lane wrote: >> Tomas Vondra writes: >>> The main question I have is whether this should include procedures. >> >> I feel a bit uncomfortable about sticking this sort of limited-purpose >> selectivity mechanism into pg_dump. I'd rather see a general filter >> method that can select object(s) of any type. Pavel was doing some >> work towards that awhile ago, though I think he got frustrated about >> the lack of consensus on details. Which is a problem, but I don't >> think the solution is to accrue more and more independently-designed- >> and-implemented features that each solve some subset of the big problem. >> > > I'm not against introducing such general filter mechanism, but why > should it block this patch? I'd understand it the patch was adding a lot > of code, but that's not the case - it's tiny. And we already have > multiple filter options (to pick tables, schemas, extensions, ...). > And if there's no consensus on details of Pavel's patch after multiple > commitfests, how likely is it it'll start moving forward? It seems to me that pg_dump already has plenty of limited-purpose options for selectivity, adding this patch seems to fit in with the norm. I'm in favor of this patch because it works in the same way the community is already used to and meets the need. The other patch referenced upstream appears to be intended to solve a specific problem encountered with very long commands. It is also introducing a new way of working with pg_dump via a config file, and honestly I've never wanted that type of feature. Not saying that wouldn't be useful, but that has not been a pain point for me and seems overly complex. For the use cases I imagine this patch will help with, being required to go through a config file seems excessive vs pg_dump --functions-only. > On Fri, Jul 9, 2021 at 4:43 PM Tomas Vondra wrote: > > The main question I have is whether this should include procedures. I'd > probably argue procedures should be considered different from functions > (i.e. requiring a separate --procedures-only option), because it pretty > much is meant to be a separate object type. We don't allow calling DROP > FUNCTION on a procedure, etc. It'd be silly to introduce an unnecessary > ambiguity in pg_dump and have to deal with it sometime later. +1 *Ryan Lambert* RustProof Labs
Re: postgres_fdw - make cached connection functions tests meaningful
On Thu, Jul 15, 2021 at 3:48 AM Tom Lane wrote: > > Bharath Rupireddy writes: > > While working on [1], I got to know that there is a new GUC > > debug_invalidate_system_caches_always that has been introduced in v14. > > It can be used to switch off cache invalidation in > > CLOBBER_CACHE_ALWAYS builds which makes cache sensitive tests stable. > > Using this GUC, it is quite possible to make cached connection > > management function tests more meaningful by returning original > > values(true/false, all the output columns) instead of SELECT 1. > > Note that this needs an update in the wake of d68a00391. > > More generally, though, I am not sure that I believe the premise of > this patch. AFAICS it's assuming that forcing debug_discard_caches > off guarantees zero cache flushes, which it does not. (If it could, > we wouldn't need the whole thing; the point of that variable is to > deterministically force flushes which would otherwise be > nondeterministic, not nonexistent.) Can the setting debug_discard_caches = 0 still make extra flushes/discards (not the regular cache flushes/discards that happen because of alters or changes in the cached elements)? My understanding was that debug_discard_caches = 0, disables all the extra flushes with clobber cache builds. If my understanding wasn't right, isn't it good to mention it somewhere in the documentation or in the source code? > Even in a contrib test that > seemingly has nothing else running, background activity such as > autovacuum could result in surprises. So I fear that what you have > got here is a patch that will work 99% of the time; which is not > good enough for the buildfarm. If the setting debug_discard_caches = 0 makes at least a few extra cache flushes, I don't mind withdrawing this patch. Regards, Bharath Rupireddy.
Re: What are exactly bootstrap processes, auxiliary processes, standalone backends, normal backends(user sessions)?
On 2021-Jul-17, Bharath Rupireddy wrote: > On Thu, Jul 15, 2021 at 8:17 PM Justin Pryzby wrote: > > > > On Fri, Jul 09, 2021 at 09:24:19PM +0530, Bharath Rupireddy wrote: > > > I've always had a hard time distinguishing various types of > > > processes/terms used in postgres. I look at the source code every time > > > to understand them, yet I don't feel satisfied with my understanding. > > > I request any hacker (having a better idea than me) to help me with > > > what each different process does and how they are different from each > > > other? Of course, I'm clear with normal backends (user sessions), bg > > > workers, but the others need a bit more understanding. > > > > It sounds like something that should be in the glossary, which currently > > refers > > to but doesn't define "auxiliary processes". > > Thanks. I strongly feel that it should be documented somewhere. I will > be happy if someone with a clear idea about these various processes > does it. Auxiliary process Process of an instance in charge of some specific, hardcoded background task. Examples are the startup process, the WAL receiver (but not the WAL senders), the WAL writer, the archiver, the background writer, the checkpointer, the statistics collector, and the logger. We should probably include individual glossary entries for those that don't already have one. Maybe revise the entries for ones that do so that they start with "An auxiliary process that ..." I just realized that the autovac launcher is not nominally an auxiliary process (per SubPostmasterMain), which is odd since notionally it clearly is. Maybe we should include it in the list anyway, with something like "and the autovacuum launcher (but not the autovacuum workers)". -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ "I am amazed at [the pgsql-sql] mailing list for the wonderful support, and lack of hesitasion in answering a lost soul's question, I just wished the rest of the mailing list could be like this." (Fotis) (http://archives.postgresql.org/pgsql-sql/2006-06/msg00265.php)
Re: What are exactly bootstrap processes, auxiliary processes, standalone backends, normal backends(user sessions)?
On Sat, Jul 17, 2021 at 10:45:52AM -0400, Alvaro Herrera wrote: > On 2021-Jul-17, Bharath Rupireddy wrote: > > > On Thu, Jul 15, 2021 at 8:17 PM Justin Pryzby wrote: > > > > > > On Fri, Jul 09, 2021 at 09:24:19PM +0530, Bharath Rupireddy wrote: > > > > I've always had a hard time distinguishing various types of > > > > processes/terms used in postgres. I look at the source code every time > > > > to understand them, yet I don't feel satisfied with my understanding. > > > > I request any hacker (having a better idea than me) to help me with > > > > what each different process does and how they are different from each > > > > other? Of course, I'm clear with normal backends (user sessions), bg > > > > workers, but the others need a bit more understanding. > > > > > > It sounds like something that should be in the glossary, which currently > > > refers > > > to but doesn't define "auxiliary processes". > > > > Thanks. I strongly feel that it should be documented somewhere. I will > > be happy if someone with a clear idea about these various processes > > does it. > > Auxiliary process > > Process of an linkend="glossary-instance">instance in I think "of an instance" is a distraction here, even if technically accurate. > charge of some specific, hardcoded background task. Examples are And I think "hardcoded" doesn't mean anything beyond what "specific" means. Maybe you'd say: .. process which handles a specific, central task for the cluster instance. Thanks, -- Justin
Re: What are exactly bootstrap processes, auxiliary processes, standalone backends, normal backends(user sessions)?
Justin Pryzby writes: > On Sat, Jul 17, 2021 at 10:45:52AM -0400, Alvaro Herrera wrote: >> Process of an instance in >> charge of some specific, hardcoded background task. Examples are > And I think "hardcoded" doesn't mean anything beyond what "specific" means. > Maybe you'd say: .. process which handles a specific, central task for the > cluster instance. Meh. "Specific" and "background" both seem to be useful terms here. I do not think "central" is a useful adjective. regards, tom lane
Re: postgres_fdw - make cached connection functions tests meaningful
Bharath Rupireddy writes: > On Thu, Jul 15, 2021 at 3:48 AM Tom Lane wrote: >> More generally, though, I am not sure that I believe the premise of >> this patch. AFAICS it's assuming that forcing debug_discard_caches >> off guarantees zero cache flushes, which it does not. > Can the setting debug_discard_caches = 0 still make extra > flushes/discards (not the regular cache flushes/discards that happen > because of alters or changes in the cached elements)? My understanding > was that debug_discard_caches = 0, disables all the extra flushes with > clobber cache builds. If my understanding wasn't right, isn't it good > to mention it somewhere in the documentation or in the source code? The reason for this mechanism is that cache flushes can be triggered at any time by sinval messages from other processes (e.g., background autovacuums). Setting debug_discard_caches allows us to exercise that possibility exhaustively and make sure that the code is proof against cache entries disappearing unexpectedly. Not setting debug_discard_caches doesn't mean that that can't happen, only that you can't predict when it will happen. regards, tom lane
Re: A micro-optimisation for ProcSendSignal()
Hi Thomas, You might have missed a spot to initialize SERIALIZABLE_XACT->pgprocno in InitPredicateLocks(), so: + PredXact->OldCommittedSxact->pgprocno = INVALID_PGPROCNO; Slightly tangential: we should add a comment to PGPROC.pgprocno, for more immediate understandability: + int pgprocno; /* index of this PGPROC in ProcGlobal->allProcs */ Also, why don't we take the opportunity to get rid of SERIALIZABLEXACT->pid? We took a stab. Attached is v2 of your patch with these changes. Regards, Ashwin and Deep From 08d5ab9f498140e1c3a8f6f00d1acd25b3fb8ba0 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Thu, 11 Mar 2021 23:09:11 +1300 Subject: [PATCH v2 1/1] Optimize ProcSendSignal(). Instead of referring to target processes by pid, use pgprocno so that we don't have to scan the ProcArray and keep track of the startup process. Discussion: https://postgr.es/m/CA%2BhUKGLYRyDaneEwz5Uya_OgFLMx5BgJfkQSD%3Dq9HmwsfRRb-w%40mail.gmail.com --- src/backend/access/transam/xlog.c | 1 - src/backend/storage/buffer/buf_init.c | 3 +- src/backend/storage/buffer/bufmgr.c | 10 ++--- src/backend/storage/lmgr/predicate.c | 23 ++- src/backend/storage/lmgr/proc.c | 49 ++- src/backend/utils/adt/lockfuncs.c | 8 +++- src/include/storage/buf_internals.h | 2 +- src/include/storage/predicate_internals.h | 2 +- src/include/storage/proc.h| 8 +--- 9 files changed, 34 insertions(+), 72 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index c7c928f50bd..2bbeaaf1b90 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -7309,7 +7309,6 @@ StartupXLOG(void) */ if (ArchiveRecoveryRequested && IsUnderPostmaster) { - PublishStartupProcessInformation(); EnableSyncRequestForwarding(); SendPostmasterSignal(PMSIGNAL_RECOVERY_STARTED); bgwriterLaunched = true; diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c index a299be10430..b9a83c0e9b9 100644 --- a/src/backend/storage/buffer/buf_init.c +++ b/src/backend/storage/buffer/buf_init.c @@ -16,6 +16,7 @@ #include "storage/buf_internals.h" #include "storage/bufmgr.h" +#include "storage/proc.h" BufferDescPadded *BufferDescriptors; char *BufferBlocks; @@ -118,7 +119,7 @@ InitBufferPool(void) CLEAR_BUFFERTAG(buf->tag); pg_atomic_init_u32(&buf->state, 0); - buf->wait_backend_pid = 0; + buf->wait_backend_pgprocno = INVALID_PGPROCNO; buf->buf_id = i; diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 86ef607ff38..26122418faf 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -1890,11 +1890,11 @@ UnpinBuffer(BufferDesc *buf, bool fixOwner) BUF_STATE_GET_REFCOUNT(buf_state) == 1) { /* we just released the last pin other than the waiter's */ -int wait_backend_pid = buf->wait_backend_pid; +int wait_backend_pgprocno = buf->wait_backend_pgprocno; buf_state &= ~BM_PIN_COUNT_WAITER; UnlockBufHdr(buf, buf_state); -ProcSendSignal(wait_backend_pid); +ProcSendSignal(wait_backend_pgprocno); } else UnlockBufHdr(buf, buf_state); @@ -3995,7 +3995,7 @@ UnlockBuffers(void) * got a cancel/die interrupt before getting the signal. */ if ((buf_state & BM_PIN_COUNT_WAITER) != 0 && - buf->wait_backend_pid == MyProcPid) + buf->wait_backend_pgprocno == MyProc->pgprocno) buf_state &= ~BM_PIN_COUNT_WAITER; UnlockBufHdr(buf, buf_state); @@ -4131,7 +4131,7 @@ LockBufferForCleanup(Buffer buffer) LockBuffer(buffer, BUFFER_LOCK_UNLOCK); elog(ERROR, "multiple backends attempting to wait for pincount 1"); } - bufHdr->wait_backend_pid = MyProcPid; + bufHdr->wait_backend_pgprocno = MyProc->pgprocno; PinCountWaitBuf = bufHdr; buf_state |= BM_PIN_COUNT_WAITER; UnlockBufHdr(bufHdr, buf_state); @@ -4202,7 +4202,7 @@ LockBufferForCleanup(Buffer buffer) */ buf_state = LockBufHdr(bufHdr); if ((buf_state & BM_PIN_COUNT_WAITER) != 0 && - bufHdr->wait_backend_pid == MyProcPid) + bufHdr->wait_backend_pgprocno == MyProc->pgprocno) buf_state &= ~BM_PIN_COUNT_WAITER; UnlockBufHdr(bufHdr, buf_state); diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c index d493aeef0fc..d76632bb56a 100644 --- a/src/backend/storage/lmgr/predicate.c +++ b/src/backend/storage/lmgr/predicate.c @@ -1276,7 +1276,7 @@ InitPredicateLocks(void) PredXact->OldCommittedSxact->finishedBefore = InvalidTransactionId; PredXact->OldCommittedSxact->xmin = InvalidTransactionId; PredXact->OldCommittedSxact->flags = SXACT_FLAG_COMMITTED; - PredXact->OldCommittedSxact->pid = 0; + PredXact->OldCommittedSxact->pgprocno = INVALID_PGPROCNO; } /* This never changes, so let's keep a local copy. */ OldCommittedSxact = PredXact->OldCommittedSxact; @@ -1635,8 +16
Re: UniqueKey on Partitioned table.
On Sat, Jul 17, 2021 at 3:45 PM David Rowley wrote: > > On Sat, 17 Jul 2021 at 19:32, Andy Fan wrote: > > SELECT * FROM t1, t2 WHERE t1.pk = t2.pk; > > > > Then when I populate_baserel_uniquekeys for t1, we already have > > EC{Members={t1.pk, t2.pk}} in root->eq_classes already. Then I use > > this EC directly for t1's UniqueKey. The result is: > > > > T1's UniqueKey : [ EC{Members={t1.pk, t2.pk}} ]. > > > > *Would this be OK since at the baserel level, the "t1.pk = t2.pk" is not > > executed yet?* > > > > I tried the below example to test how PathKey is maintained. > > CREATE TABLE t1 (a INT, b INT); > > CREATE TABLE t2 (a INT, b INT); > > CREATE INDEX ON t1(b); > > > > SELECT * FROM t1, t2 WHERE t1.b = t2.b and t1.b > 3; > > > > then we can get t1's Path: > > > > Index Scan on (b), PathKey.pk_class include 2 members (t1.b, t2.b} > > even before the Join. > > > > So looks the answer for my question should be "yes"? Hope I have > > made myself clear. > > I don't see the problem. Thanks for the double check, that removes a big blocker for my development. I'd submit a new patch very soon. > The reason PathKeys use EquivalenceClasses is > so that queries like: SELECT * FROM tab WHERE a=b ORDER BY b; can see > that they're also ordered by a. This is useful because if there > happens to be an index on tab(a) then we can use it to provide the > required ordering for this query. > > We'll want the same with UniqueKeys. The same thing there looks like: > > CREATE TABLE tab (a int primary key, b int not null); > > select distinct b from tab where a=b; > > Since we have the EquivalenceClass with {a,b} stored in the UniqueKey, > then we should be able to execute this without doing any distinct > operation. > > David -- Best Regards Andy Fan (https://www.aliyun.com/)
slab allocator performance issues
Hi, I just tried to use the slab allocator for a case where aset.c was bloating memory usage substantially. First: It worked wonders for memory usage, nearly eliminating overhead. But it turned out to cause a *substantial* slowdown. With aset the allocator is barely in the profile. With slab the profile is dominated by allocator performance. slab: NOTICE: 0: 1 ordered insertions in 5.216287 seconds, 19170724/sec LOCATION: bfm_test_insert_bulk, radix.c:2880 Overhead Command Shared Object Symbol + 28.27% postgres postgres [.] SlabAlloc +9.64% postgres bdbench.so[.] bfm_delete +9.03% postgres bdbench.so[.] bfm_set +8.39% postgres bdbench.so[.] bfm_lookup +8.36% postgres bdbench.so[.] bfm_set_leaf.constprop.0 +8.16% postgres libc-2.31.so [.] __memmove_avx_unaligned_erms +6.88% postgres bdbench.so[.] bfm_delete_leaf +3.24% postgres libc-2.31.so [.] _int_malloc +2.58% postgres bdbench.so[.] bfm_tests +2.33% postgres postgres [.] SlabFree +1.29% postgres libc-2.31.so [.] _int_free +1.09% postgres libc-2.31.so [.] unlink_chunk.constprop.0 aset: NOTICE: 0: 1 ordered insertions in 2.082602 seconds, 48016848/sec LOCATION: bfm_test_insert_bulk, radix.c:2880 + 16.43% postgres bdbench.so[.] bfm_lookup + 15.38% postgres bdbench.so[.] bfm_delete + 12.82% postgres libc-2.31.so [.] __memmove_avx_unaligned_erms + 12.65% postgres bdbench.so[.] bfm_set + 12.15% postgres bdbench.so[.] bfm_set_leaf.constprop.0 + 10.57% postgres bdbench.so[.] bfm_delete_leaf +4.05% postgres bdbench.so[.] bfm_tests +2.93% postgres [kernel.vmlinux] [k] clear_page_erms +1.59% postgres postgres [.] AllocSetAlloc +1.15% postgres bdbench.so[.] memmove@plt +1.06% postgres bdbench.so[.] bfm_grow_leaf_16 OS: NOTICE: 0: 1 ordered insertions in 2.089790 seconds, 47851690/sec LOCATION: bfm_test_insert_bulk, radix.c:2880 That is somewhat surprising - part of the promise of a slab allocator is that it's fast... This is caused by multiple issues, I think. Some of which seems fairly easy to fix. 1) If allocations are short-lived slab.c, can end up constantly freeing/initializing blocks. Which requires fairly expensively iterating over all potential chunks in the block and initializing it. Just to then free that memory again after a small number of allocations. The extreme case of this is when there are phases of alloc/free of a single allocation. I "fixed" this by adding a few && slab->nblocks > 1 in SlabFree() and the problem vastly reduced. Instead of a 0.4x slowdown it's 0.88x. Of course that only works if the problem is with the only, so it's not a great approach. Perhaps just keeping the last allocated block around would work? 2) SlabChunkIndex() in SlabFree() is slow. It requires a 64bit division, taking up ~50% of the cycles in SlabFree(). A 64bit div, according to [1] , has a latency of 35-88 cycles on skylake-x (and a reverse throughput of 21-83, i.e. no parallelism). While it's getting a bit faster on icelake / zen 3, it's still slow enough there to be very worrisome. I don't see a way to get around the division while keeping the freelist structure as is. But: ISTM that we only need the index because of the free-chunk list, right? Why don't we make the chunk list use actual pointers? Is it concern that that'd increase the minimum allocation size? If so, I see two ways around that: First, we could make the index just the offset from the start of the block, that's much cheaper to calculate. Second, we could store the next pointer in SlabChunk->slab/block instead (making it a union) - while on the freelist we don't need to dereference those, right? I suspect both would also make the block initialization a bit cheaper. That should also accelerate SlabBlockGetChunk(), which currently shows up as an imul, which isn't exactly fast either (and uses a lot of execution ports). 3) Every free/alloc needing to unlink from slab->freelist[i] and then relink shows up prominently in profiles. That's ~tripling the number of cachelines touched in the happy path, with unpredictable accesses to boot. Perhaps we could reduce the precision of slab->freelist indexing to amortize that cost? I.e. not have one slab->freelist entry for each nfree, but instead have an upper limit on the number of freelists? 4) Less of a performance, and more of a usability issue: The constant block size strikes me as problematic. Most users of an allocator can sometimes be used with a small amount of data, and sometimes with a large amount. Greetings, Andres Freund [1] https://www.agner.org/optimize/instruction_tables.pdf
Re: slab allocator performance issues
Hi, On 2021-07-17 12:43:33 -0700, Andres Freund wrote: > 2) SlabChunkIndex() in SlabFree() is slow. It requires a 64bit division, > taking > up ~50% of the cycles in SlabFree(). A 64bit div, according to [1] , has a > latency of 35-88 cycles on skylake-x (and a reverse throughput of 21-83, > i.e. no parallelism). While it's getting a bit faster on icelake / zen 3, it's > still slow enough there to be very worrisome. > > I don't see a way to get around the division while keeping the freelist > structure as is. But: > > ISTM that we only need the index because of the free-chunk list, right? Why > don't we make the chunk list use actual pointers? Is it concern that that'd > increase the minimum allocation size? If so, I see two ways around that: > First, we could make the index just the offset from the start of the block, > that's much cheaper to calculate. Second, we could store the next pointer in > SlabChunk->slab/block instead (making it a union) - while on the freelist we > don't need to dereference those, right? > > I suspect both would also make the block initialization a bit cheaper. > > That should also accelerate SlabBlockGetChunk(), which currently shows up as > an imul, which isn't exactly fast either (and uses a lot of execution ports). Oh - I just saw that effectively the allocation size already is a uintptr_t at minimum. I had only seen /* Make sure the linked list node fits inside a freed chunk */ if (chunkSize < sizeof(int)) chunkSize = sizeof(int); but it's followed by /* chunk, including SLAB header (both addresses nicely aligned) */ fullChunkSize = sizeof(SlabChunk) + MAXALIGN(chunkSize); which means we are reserving enough space for a pointer on just about any platform already? Seems we can just make that official and reserve space for a pointer as part of the chunk size rounding up, instead of fullChunkSize? Greetings, Andres Freund
corruption of WAL page header is never reported
Hello, I found that any corruption of WAL page header found during recovery is never reported in log messages. If wal page header is broken, it is detected in XLogReaderValidatePageHeader called from XLogPageRead, but the error messages are always reset and never reported. if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf)) { /* reset any error XLogReaderValidatePageHeader() might have set */ xlogreader->errormsg_buf[0] = '\0'; goto next_record_is_invalid; } Since the commit 06687198018, we call XLogReaderValidatePageHeader here so that we can check a page header and retry immediately if it's invalid, but the error message is reset immediately and not reported. I guess the reason why the error message is reset is because we might get the right WAL after some retries. However, I think it is better to report the error for each check in order to let users know the actual issues founded in the WAL. I attached a patch to fix it in this way. Or, if we wouldn't like to report an error for each check and also what we want to check here is just about old recycled WAL instead of header corruption itself, I wander that we could check just xlp_pageaddr instead of calling XLogReaderValidatePageHeader. Regards, Yugo Nagata -- Yugo NAGATA diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index edb15fe58d..4fdb23f061 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -12297,8 +12297,13 @@ retry: */ if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf)) { - /* reset any error XLogReaderValidatePageHeader() might have set */ - xlogreader->errormsg_buf[0] = '\0'; + if (StandbyMode && xlogreader->errormsg_buf[0] != '\0') + { + ereport(emode_for_corrupt_record(emode, EndRecPtr), + (errmsg_internal("%s", xlogreader->errormsg_buf) /* already translated */ )); + /* reset any error XLogReaderValidatePageHeader() might have set */ + xlogreader->errormsg_buf[0] = '\0'; + } goto next_record_is_invalid; }
Re: slab allocator performance issues
Hi, On 7/17/21 9:43 PM, Andres Freund wrote: Hi, I just tried to use the slab allocator for a case where aset.c was bloating memory usage substantially. First: It worked wonders for memory usage, nearly eliminating overhead. But it turned out to cause a *substantial* slowdown. With aset the allocator is barely in the profile. With slab the profile is dominated by allocator performance. slab: NOTICE: 0: 1 ordered insertions in 5.216287 seconds, 19170724/sec LOCATION: bfm_test_insert_bulk, radix.c:2880 Overhead Command Shared Object Symbol + 28.27% postgres postgres [.] SlabAlloc +9.64% postgres bdbench.so[.] bfm_delete +9.03% postgres bdbench.so[.] bfm_set +8.39% postgres bdbench.so[.] bfm_lookup +8.36% postgres bdbench.so[.] bfm_set_leaf.constprop.0 +8.16% postgres libc-2.31.so [.] __memmove_avx_unaligned_erms +6.88% postgres bdbench.so[.] bfm_delete_leaf +3.24% postgres libc-2.31.so [.] _int_malloc +2.58% postgres bdbench.so[.] bfm_tests +2.33% postgres postgres [.] SlabFree +1.29% postgres libc-2.31.so [.] _int_free +1.09% postgres libc-2.31.so [.] unlink_chunk.constprop.0 aset: NOTICE: 0: 1 ordered insertions in 2.082602 seconds, 48016848/sec LOCATION: bfm_test_insert_bulk, radix.c:2880 + 16.43% postgres bdbench.so[.] bfm_lookup + 15.38% postgres bdbench.so[.] bfm_delete + 12.82% postgres libc-2.31.so [.] __memmove_avx_unaligned_erms + 12.65% postgres bdbench.so[.] bfm_set + 12.15% postgres bdbench.so[.] bfm_set_leaf.constprop.0 + 10.57% postgres bdbench.so[.] bfm_delete_leaf +4.05% postgres bdbench.so[.] bfm_tests +2.93% postgres [kernel.vmlinux] [k] clear_page_erms +1.59% postgres postgres [.] AllocSetAlloc +1.15% postgres bdbench.so[.] memmove@plt +1.06% postgres bdbench.so[.] bfm_grow_leaf_16 OS: NOTICE: 0: 1 ordered insertions in 2.089790 seconds, 47851690/sec LOCATION: bfm_test_insert_bulk, radix.c:2880 That is somewhat surprising - part of the promise of a slab allocator is that it's fast... This is caused by multiple issues, I think. Some of which seems fairly easy to fix. 1) If allocations are short-lived slab.c, can end up constantly freeing/initializing blocks. Which requires fairly expensively iterating over all potential chunks in the block and initializing it. Just to then free that memory again after a small number of allocations. The extreme case of this is when there are phases of alloc/free of a single allocation. I "fixed" this by adding a few && slab->nblocks > 1 in SlabFree() and the problem vastly reduced. Instead of a 0.4x slowdown it's 0.88x. Of course that only works if the problem is with the only, so it's not a great approach. Perhaps just keeping the last allocated block around would work? +1 I think it makes perfect sense to not free the blocks immediately, and keep one (or a small number) as a cache. I'm not sure why we decided not to have a "keeper" block, but I suspect memory consumption was my main concern at that point. But I never expected the cost to be this high. 2) SlabChunkIndex() in SlabFree() is slow. It requires a 64bit division, taking up ~50% of the cycles in SlabFree(). A 64bit div, according to [1] , has a latency of 35-88 cycles on skylake-x (and a reverse throughput of 21-83, i.e. no parallelism). While it's getting a bit faster on icelake / zen 3, it's still slow enough there to be very worrisome. I don't see a way to get around the division while keeping the freelist structure as is. But: ISTM that we only need the index because of the free-chunk list, right? Why don't we make the chunk list use actual pointers? Is it concern that that'd increase the minimum allocation size? If so, I see two ways around that: First, we could make the index just the offset from the start of the block, that's much cheaper to calculate. Second, we could store the next pointer in SlabChunk->slab/block instead (making it a union) - while on the freelist we don't need to dereference those, right? I suspect both would also make the block initialization a bit cheaper. That should also accelerate SlabBlockGetChunk(), which currently shows up as an imul, which isn't exactly fast either (and uses a lot of execution ports). Hmm, I think you're right we could simply use the pointers, but I have not tried that. 3) Every free/alloc needing to unlink from slab->freelist[i] and then relink shows up prominently in profiles. That's ~tripling the number of cachelines touched in the happy path, with unpredictable accesses to boot. Perhaps we could reduce the precision of slab->freelist indexing to amortize that cost? I.e. not have one slab->freelist entry for each nfree, but instead have an upper limit on the number
Re: slab allocator performance issues
Hi, On 2021-07-17 22:35:07 +0200, Tomas Vondra wrote: > On 7/17/21 9:43 PM, Andres Freund wrote: > > 1) If allocations are short-lived slab.c, can end up constantly > > freeing/initializing blocks. Which requires fairly expensively iterating > > over > > all potential chunks in the block and initializing it. Just to then free > > that > > memory again after a small number of allocations. The extreme case of this > > is > > when there are phases of alloc/free of a single allocation. > > > > I "fixed" this by adding a few && slab->nblocks > 1 in SlabFree() and the > > problem vastly reduced. Instead of a 0.4x slowdown it's 0.88x. Of course > > that > > only works if the problem is with the only, so it's not a great > > approach. Perhaps just keeping the last allocated block around would work? > > > > +1 > > I think it makes perfect sense to not free the blocks immediately, and keep > one (or a small number) as a cache. I'm not sure why we decided not to have > a "keeper" block, but I suspect memory consumption was my main concern at > that point. But I never expected the cost to be this high. I think one free block might be too low in some cases. It's pretty common to have workloads where the number of allocations is "bursty", and it's imo one case where one might justifiably want to use a slab allocator... Perhaps a portion of a high watermark? Or a portion of the in use blocks? Hm. I wonder if we should just not populate the freelist eagerly, to drive down the initialization cost. I.e. have a separate allocation path for chunks that have never been allocated, by having a SlabBlock->free_offset or such. Sure, it adds a branch to the allocation happy path, but it also makes the first allocation for a chunk cheaper, because there's no need to get the next element from the freelist (adding a likely cache miss). And it should make the allocation of a new block faster by a lot. > > 2) SlabChunkIndex() in SlabFree() is slow. It requires a 64bit division, > > taking > > up ~50% of the cycles in SlabFree(). A 64bit div, according to [1] , has a > > latency of 35-88 cycles on skylake-x (and a reverse throughput of 21-83, > > i.e. no parallelism). While it's getting a bit faster on icelake / zen 3, > > it's > > still slow enough there to be very worrisome. > > > > I don't see a way to get around the division while keeping the freelist > > structure as is. But: > > > > ISTM that we only need the index because of the free-chunk list, right? Why > > don't we make the chunk list use actual pointers? Is it concern that that'd > > increase the minimum allocation size? If so, I see two ways around that: > > First, we could make the index just the offset from the start of the block, > > that's much cheaper to calculate. Second, we could store the next pointer in > > SlabChunk->slab/block instead (making it a union) - while on the freelist we > > don't need to dereference those, right? > > > > I suspect both would also make the block initialization a bit cheaper. > > > > That should also accelerate SlabBlockGetChunk(), which currently shows up as > > an imul, which isn't exactly fast either (and uses a lot of execution > > ports). > > > > Hmm, I think you're right we could simply use the pointers, but I have not > tried that. I quickly tried that, and it does seem to improve matters considerably. The block initialization still shows up as expensive, but not as bad. The div and imul are gone (exept in an assertion build right now). The list manipulation still is visible. > > 3) Every free/alloc needing to unlink from slab->freelist[i] and then relink > > shows up prominently in profiles. That's ~tripling the number of cachelines > > touched in the happy path, with unpredictable accesses to boot. > > > > Perhaps we could reduce the precision of slab->freelist indexing to amortize > > that cost? I.e. not have one slab->freelist entry for each nfree, but > > instead > > have an upper limit on the number of freelists? > > > > Yeah. The purpose of organizing the freelists like this is to prioritize the > "more full" blocks" when allocating new chunks, in the hope that the "less > full" blocks will end up empty and freed faster. > > But this is naturally imprecise, and strongly depends on the workload, of > course, and I bet for most cases a less precise approach would work just as > fine. > > I'm not sure how exactly would the upper limit you propose work, but perhaps > we could group the blocks for nfree ranges, say [0-15], [16-31] and so on. > So after the alloc/free we'd calculate the new freelist index as (nfree/16) > and only moved the block if the index changed. This would reduce the > overhead to 1/16 and probably even more in practice. > Of course, we could also say we have e.g. 8 freelists and work the ranges > backwards from that, I guess that's what you propose. Yea, I was thinking something along those lines. As you say, ow about there's always at most 8 freelists or such. During init
Re: corruption of WAL page header is never reported
Em sáb., 17 de jul. de 2021 às 16:57, Yugo NAGATA escreveu: > Hello, > > I found that any corruption of WAL page header found during recovery is > never > reported in log messages. If wal page header is broken, it is detected in > XLogReaderValidatePageHeader called from XLogPageRead, but the error > messages > are always reset and never reported. > > if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, > readBuf)) > { >/* reset any error XLogReaderValidatePageHeader() might > have set */ >xlogreader->errormsg_buf[0] = '\0'; >goto next_record_is_invalid; > } > > Since the commit 06687198018, we call XLogReaderValidatePageHeader here so > that > we can check a page header and retry immediately if it's invalid, but the > error > message is reset immediately and not reported. I guess the reason why the > error > message is reset is because we might get the right WAL after some retries. > However, I think it is better to report the error for each check in order > to let > users know the actual issues founded in the WAL. > > I attached a patch to fix it in this way. > I think to keep the same behavior as before, is necessary always run: /* reset any error XLogReaderValidatePageHeader() might have set */ xlogreader->errormsg_buf[0] = '\0'; not? regards, Ranier Vilela diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index edb15fe58d..4fdb23f061 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -12297,8 +12297,13 @@ retry: */ if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf)) { + if (StandbyMode && xlogreader->errormsg_buf[0] != '\0') + { + ereport(emode_for_corrupt_record(emode, EndRecPtr), + (errmsg_internal("%s", xlogreader->errormsg_buf) /* already translated */ )); + } /* reset any error XLogReaderValidatePageHeader() might have set */ xlogreader->errormsg_buf[0] = '\0'; goto next_record_is_invalid; }
Re: slab allocator performance issues
On 7/17/21 11:14 PM, Andres Freund wrote: Hi, On 2021-07-17 22:35:07 +0200, Tomas Vondra wrote: On 7/17/21 9:43 PM, Andres Freund wrote: 1) If allocations are short-lived slab.c, can end up constantly freeing/initializing blocks. Which requires fairly expensively iterating over all potential chunks in the block and initializing it. Just to then free that memory again after a small number of allocations. The extreme case of this is when there are phases of alloc/free of a single allocation. I "fixed" this by adding a few && slab->nblocks > 1 in SlabFree() and the problem vastly reduced. Instead of a 0.4x slowdown it's 0.88x. Of course that only works if the problem is with the only, so it's not a great approach. Perhaps just keeping the last allocated block around would work? +1 I think it makes perfect sense to not free the blocks immediately, and keep one (or a small number) as a cache. I'm not sure why we decided not to have a "keeper" block, but I suspect memory consumption was my main concern at that point. But I never expected the cost to be this high. I think one free block might be too low in some cases. It's pretty common to have workloads where the number of allocations is "bursty", and it's imo one case where one might justifiably want to use a slab allocator... Perhaps a portion of a high watermark? Or a portion of the in use blocks? I think the portion of watermark would be problematic for cases with one huge transaction - that'll set a high watermark, and we'll keep way too many free blocks. But the portion of in use blocks might work, I think. Hm. I wonder if we should just not populate the freelist eagerly, to drive down the initialization cost. I.e. have a separate allocation path for chunks that have never been allocated, by having a SlabBlock->free_offset or such. Sure, it adds a branch to the allocation happy path, but it also makes the first allocation for a chunk cheaper, because there's no need to get the next element from the freelist (adding a likely cache miss). And it should make the allocation of a new block faster by a lot. Not sure what you mean by 'not populate eagerly' so can't comment :-( 2) SlabChunkIndex() in SlabFree() is slow. It requires a 64bit division, taking up ~50% of the cycles in SlabFree(). A 64bit div, according to [1] , has a latency of 35-88 cycles on skylake-x (and a reverse throughput of 21-83, i.e. no parallelism). While it's getting a bit faster on icelake / zen 3, it's still slow enough there to be very worrisome. I don't see a way to get around the division while keeping the freelist structure as is. But: ISTM that we only need the index because of the free-chunk list, right? Why don't we make the chunk list use actual pointers? Is it concern that that'd increase the minimum allocation size? If so, I see two ways around that: First, we could make the index just the offset from the start of the block, that's much cheaper to calculate. Second, we could store the next pointer in SlabChunk->slab/block instead (making it a union) - while on the freelist we don't need to dereference those, right? I suspect both would also make the block initialization a bit cheaper. That should also accelerate SlabBlockGetChunk(), which currently shows up as an imul, which isn't exactly fast either (and uses a lot of execution ports). Hmm, I think you're right we could simply use the pointers, but I have not tried that. I quickly tried that, and it does seem to improve matters considerably. The block initialization still shows up as expensive, but not as bad. The div and imul are gone (exept in an assertion build right now). The list manipulation still is visible. Understood. I didn't expect this to be a full solution. 3) Every free/alloc needing to unlink from slab->freelist[i] and then relink shows up prominently in profiles. That's ~tripling the number of cachelines touched in the happy path, with unpredictable accesses to boot. Perhaps we could reduce the precision of slab->freelist indexing to amortize that cost? I.e. not have one slab->freelist entry for each nfree, but instead have an upper limit on the number of freelists? Yeah. The purpose of organizing the freelists like this is to prioritize the "more full" blocks" when allocating new chunks, in the hope that the "less full" blocks will end up empty and freed faster. But this is naturally imprecise, and strongly depends on the workload, of course, and I bet for most cases a less precise approach would work just as fine. I'm not sure how exactly would the upper limit you propose work, but perhaps we could group the blocks for nfree ranges, say [0-15], [16-31] and so on. So after the alloc/free we'd calculate the new freelist index as (nfree/16) and only moved the block if the index changed. This would reduce the overhead to 1/16 and probably even more in practice. Of course, we could also say we have e.g. 8 freelists and work the ranges backwards from that, I
Re: slab allocator performance issues
Hi, On 2021-07-18 00:46:09 +0200, Tomas Vondra wrote: > On 7/17/21 11:14 PM, Andres Freund wrote: > > Hm. I wonder if we should just not populate the freelist eagerly, to > > drive down the initialization cost. I.e. have a separate allocation path > > for chunks that have never been allocated, by having a > > SlabBlock->free_offset or such. > > > > Sure, it adds a branch to the allocation happy path, but it also makes the > > first allocation for a chunk cheaper, because there's no need to get the > > next > > element from the freelist (adding a likely cache miss). And it should make > > the > > allocation of a new block faster by a lot. > > > > Not sure what you mean by 'not populate eagerly' so can't comment :-( Instead of populating a linked list with all chunks upon creation of a block - which requires touching a fair bit of memory - keep a per-block pointer (or an offset) into "unused" area of the block. When allocating from the block and theres still "unused" memory left, use that, instead of bothering with the freelist. I tried that, and it nearly got slab up to the allocation/freeing performance of aset.c (while winning after allocation, due to the higher memory density). > > > > 4) Less of a performance, and more of a usability issue: The constant > > > > block size strikes me as problematic. Most users of an allocator can > > > > sometimes be used with a small amount of data, and sometimes with a > > > > large amount. > > > > > > > > > > I doubt this is worth the effort, really. The constant block size makes > > > various places much simpler (both to code and reason about), so this > > > should > > > not make a huge difference in performance. And IMHO the block size is > > > mostly > > > an implementation detail, so I don't see that as a usability issue. > > > > Hm? It's something the user has to specify, so I it's not really an > > implementation detail. It needs to be specified without sufficient > > information, as well, since externally one doesn't know how much memory the > > block header and chunk headers + rounding up will use, so computing a good > > block size isn't easy. I've wondered whether it should just be a count... > > > > I think this is mixing two problems - how to specify the block size, and > whether the block size is constant (as in slab) or grows over time (as in > allocset). That was in response to the "implementation detail" bit solely. > But growing the block size seems problematic for long-lived contexts with > workloads that change a lot over time - imagine e.g. decoding many small > transactions, with one huge transaction mixed in. The one huge transaction > will grow the block size, and we'll keep using it forever. But in that case > we might have just as well allocate the large blocks from the start, I > guess. I was thinking of capping the growth fairly low. I don't think after a 16x growth or so you're likely to still see allocation performance gains with slab. And I don't think that'd be too bad for decoding - we'd start with a small initial block size, and in many workloads that will be enough, and just workloads where that doesn't suffice will adapt performance wise. And: Medium term I wouldn't expect reorderbuffer.c to stay the only slab.c user... > > Why do you not think it's relevant for performance? Either one causes too > > much > > memory usage by using a too large block size, wasting memory, or one ends up > > loosing perf through frequent allocations? > > True. I simply would not expect this to make a huge difference - I may be > wrong, and I'm sure there are workloads where it matters. But I still think > it's easier to just use larger blocks than to make the slab code more > complex. IDK. I'm looking at using slab as part of a radix tree implementation right now. Which I'd hope to be used in various different situations. So it's hard to choose the right block size - and it does seem to matter for performance. Greetings, Andres Freund
Re: slab allocator performance issues
Hi, On 2021-07-17 16:10:19 -0700, Andres Freund wrote: > Instead of populating a linked list with all chunks upon creation of a block - > which requires touching a fair bit of memory - keep a per-block pointer (or an > offset) into "unused" area of the block. When allocating from the block and > theres still "unused" memory left, use that, instead of bothering with the > freelist. > > I tried that, and it nearly got slab up to the allocation/freeing performance > of aset.c (while winning after allocation, due to the higher memory density). Combining that with limiting the number of freelists, and some microoptimizations, allocation performance is now on par. Freeing still seems to be a tad slower, mostly because SlabFree() practically is immediately stalled on fetching the block, whereas AllocSetFree() can happily speculate ahead and do work like computing the freelist index. And then aset only needs to access memory inside the context - which is much more likely to be in cache than a freelist inside a block (there are many more). But that's ok, I think. It's close and it's only a small share of the overall runtime of my workload... Greetings, Andres Freund
Failure with 004_logrotate in prairiedog
Hi all, prairiedog has failed in a way that seems a bit obscure to me: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2021-07-18%2000%3A23%3A29 Here are the details of the failure: server signaled to rotate log file could not read "/Users/buildfarm/bf-data/HEAD/pgsql.build/src/bin/pg_ctl/tmp_check/t_004_logrotate_primary_data/pgdata/current_logfiles": No such file or directory at t/004_logrotate.pl line 78 ### Stopping node "primary" using mode immediate update_metainfo_datafile() creates a temporary file renamed to current_logfiles with rename(). It should be atomic, though this error points out that this is not the case? The previous steps of this test ensure that current_logfiles should exist. We could use some eval blocks in this area, but a non-atomic rename() would cause problems in more areas. Thoughts? -- Michael signature.asc Description: PGP signature
Re: Added documentation for cascade and restrict option of drop statistics
On Fri, Jul 16, 2021 at 08:12:56PM +0530, vignesh C wrote: > Cascade and restrict options are supported for drop statistics syntax: > drop statistics stat1 cascade; > drop statistics stat2 restrict; > > The documentation for this was missing, attached a patch which > includes the documentation for these options. Indeed, good catch. The other commands document that, so let's fix it. -- Michael signature.asc Description: PGP signature
Re: Failure with 004_logrotate in prairiedog
Michael Paquier writes: > prairiedog has failed in a way that seems a bit obscure to me: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2021-07-18%2000%3A23%3A29 > ... > We could use some eval blocks in this area, but a non-atomic rename() > would cause problems in more areas. Thoughts? Awhile back we discovered that old macOS versions have non-atomic rename [1]. I eventually shut down dromedary because that was causing failures often enough to be annoying. I'd not seen such a failure before on prairiedog, but it sure seems plausible that this is one. regards, tom lane [1] https://www.postgresql.org/message-id/2636.1569016...@sss.pgh.pa.us