Re: UniqueKey on Partitioned table.

2021-07-17 Thread Andy Fan
On Tue, Apr 6, 2021 at 6:55 PM David Rowley  wrote:
>
> On Tue, 6 Apr 2021 at 22:31, Andy Fan  wrote:
> > On Wed, Mar 31, 2021 at 9:12 PM Ashutosh Bapat 
> >  wrote:
> >> I think the reason we add ECs for sort expressions is to use
> >> transitive relationship. The EC may start with a single member but
> >> later in the planning that member might find partners which are all
> >> equivalent. Result ordered by one is also ordered by the other. The
> >> same logic applies to UniqueKey as well, isn't it. In a result if a
> >> set of columns makes a row unique, the set of columns represented by
> >> the other EC member should be unique. Though a key will start as a
> >> singleton it might EC partners later and thus thus unique key will
> >> transition to all the members. With that logic UniqueKey should use
> >> just ECs instead of bare expressions.
> >
> >
> > TBH, I haven't thought about this too hard, but I think when we build the
> > UniqueKey, all the ECs have been built already.  So can you think out an
> > case we start with an EC with a single member at the beginning and
> > have more members later for UniqueKey cases?
>
> I don't really know if it matters which order things happen. We still
> end up with a single EC containing {a,b} whether we process ORDER BY b
> or WHERE a=b first.
>

I think it is time to talk about this again.  Take the below query as example:

SELECT * FROM t1, t2 WHERE t1.pk = t2.pk;

Then when I populate_baserel_uniquekeys for t1, we already have
EC{Members={t1.pk, t2.pk}} in root->eq_classes already. Then I use
this EC directly for t1's UniqueKey.  The result is:

T1's UniqueKey : [ EC{Members={t1.pk, t2.pk}} ].

*Would this be OK since at the baserel level,  the "t1.pk = t2.pk" is not
executed yet?*

I tried the below example to test how PathKey is maintained.
CREATE TABLE t1 (a INT, b INT);
CREATE TABLE t2 (a INT, b INT);
CREATE INDEX ON t1(b);

SELECT * FROM t1, t2 WHERE t1.b = t2.b and t1.b > 3;

then we can get t1's Path:

Index Scan on (b),  PathKey.pk_class include 2 members (t1.b, t2.b}
even before the Join.

So looks the answer for my question should be "yes"? Hope I have
made myself clear.

-- 
Best Regards
Andy Fan (https://www.aliyun.com/)




Re: UniqueKey on Partitioned table.

2021-07-17 Thread David Rowley
On Sat, 17 Jul 2021 at 19:32, Andy Fan  wrote:
> SELECT * FROM t1, t2 WHERE t1.pk = t2.pk;
>
> Then when I populate_baserel_uniquekeys for t1, we already have
> EC{Members={t1.pk, t2.pk}} in root->eq_classes already. Then I use
> this EC directly for t1's UniqueKey.  The result is:
>
> T1's UniqueKey : [ EC{Members={t1.pk, t2.pk}} ].
>
> *Would this be OK since at the baserel level,  the "t1.pk = t2.pk" is not
> executed yet?*
>
> I tried the below example to test how PathKey is maintained.
> CREATE TABLE t1 (a INT, b INT);
> CREATE TABLE t2 (a INT, b INT);
> CREATE INDEX ON t1(b);
>
> SELECT * FROM t1, t2 WHERE t1.b = t2.b and t1.b > 3;
>
> then we can get t1's Path:
>
> Index Scan on (b),  PathKey.pk_class include 2 members (t1.b, t2.b}
> even before the Join.
>
> So looks the answer for my question should be "yes"? Hope I have
> made myself clear.

I don't see the problem. The reason PathKeys use EquivalenceClasses is
so that queries like: SELECT * FROM tab WHERE a=b ORDER BY b; can see
that they're also ordered by a.  This is useful because if there
happens to be an index on tab(a) then we can use it to provide the
required ordering for this query.

We'll want the same with UniqueKeys.  The same thing there looks like:

CREATE TABLE tab (a int primary key, b int not null);

select distinct b from tab where a=b;

Since we have the EquivalenceClass with {a,b} stored in the UniqueKey,
then we should be able to execute this without doing any distinct
operation.

David




Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-17 Thread David Rowley
 On Sat, 17 Jul 2021 at 01:14, James Coleman  wrote:
> The only remaining question I have is whether or not costing needs to
> change, given the very significant speedup for datum sort.

I'm looking at cost_tuplesort and the only thing that I think might
make sense would be to adjust how the input_bytes value is calculated.
For now, that's done with the following function that's used in quite
a number of places.

static double
relation_byte_size(double tuples, int width)
{
return tuples * (MAXALIGN(width) + MAXALIGN(SizeofHeapTupleHeader));
}

It seems, at least in the case of Sort, that using SizeofHeapTupleHead
is just always wrong as it should be SizeofMinimalTupleHeader. I know
that's also the case for Memoize too. I've not checked the other
locations.

The only thing I can really see that we might do would be not add the
MAXALIGN(SizeofHeapTupleHeader) when there's just a single column.
We'd need to pass down the number of attributes from
create_sort_path() so we'd know when and when not to add that. I'm not
saying that we should do this. I'm just saying that I don't really see
what else we might do.

I can imagine another patch might just want to do a complete overhaul
of all locations that use relation_byte_size().  There are various
things that function just does not account for. e.g, the fact that we
allocate chunks in powers of 2 and that there's a chunk header added
on.  Of course, "width" is just an estimate, so maybe trying to
calculate something too precisely wouldn't be too wise. However,
there's a bit of a chicken and the egg problem there as there'd be
little incentive to improve "width" unless we started making more
accurate use of the value.

Anyway, none of the above take into account that the Datum sort is
just a little faster, The only thing that exists in the existing cost
modal that we could use to adjust the cost of an in memory sort is the
comparison_cost.  The problem there is that the comparison is exactly
the same in both Datum and Tuple sorts. The only thing that really
changes between Datum and Tuple sort is the fact that we don't make a
MinimalTuple when doing a Datum sort.  The cost modal, unfortunately,
does not account for that.   That kinda makes me think that we should
do nothing as if we start to account for making MemoryTuples then
we'll just penalise Tuple sorts and that might cause someone to be
upset.

David




Re: pg_dump new feature: exporting functions only. Bad or good idea ?

2021-07-17 Thread Ryan Lambert
> On Sat, Jul 10, 2021 at 5:06 AM Tomas Vondra <
tomas.von...@enterprisedb.com> wrote:
> On 7/10/21 1:43 AM, Tom Lane wrote:
>> Tomas Vondra  writes:
>>> The main question I have is whether this should include procedures.
>>
>> I feel a bit uncomfortable about sticking this sort of limited-purpose
>> selectivity mechanism into pg_dump.  I'd rather see a general filter
>> method that can select object(s) of any type.  Pavel was doing some
>> work towards that awhile ago, though I think he got frustrated about
>> the lack of consensus on details.  Which is a problem, but I don't
>> think the solution is to accrue more and more independently-designed-
>> and-implemented features that each solve some subset of the big problem.
>>
>
> I'm not against introducing such general filter mechanism, but why
> should it block this patch? I'd understand it the patch was adding a lot
> of code, but that's not the case - it's tiny. And we already have
> multiple filter options (to pick tables, schemas, extensions, ...).

> And if there's no consensus on details of Pavel's patch after multiple
> commitfests, how likely is it it'll start moving forward?

It seems to me that pg_dump already has plenty of limited-purpose options
for selectivity, adding this patch seems to fit in with the norm. I'm in
favor of this patch because it works in the same way the community is
already used to and meets the need. The other patch referenced upstream
appears to be intended to solve a specific problem encountered with very
long commands.  It is also introducing a new way of working with pg_dump
via a config file, and honestly I've never wanted that type of feature. Not
saying that wouldn't be useful, but that has not been a pain point for me
and seems overly complex. For the use cases I imagine this patch will help
with, being required to go through a config file seems excessive vs pg_dump
--functions-only.

> On Fri, Jul 9, 2021 at 4:43 PM Tomas Vondra 
wrote:
>
> The main question I have is whether this should include procedures. I'd
> probably argue procedures should be considered different from functions
> (i.e. requiring a separate --procedures-only option), because it pretty
> much is meant to be a separate object type. We don't allow calling DROP
> FUNCTION on a procedure, etc. It'd be silly to introduce an unnecessary
> ambiguity in pg_dump and have to deal with it sometime later.

+1

*Ryan Lambert*
RustProof Labs


Re: postgres_fdw - make cached connection functions tests meaningful

2021-07-17 Thread Bharath Rupireddy
On Thu, Jul 15, 2021 at 3:48 AM Tom Lane  wrote:
>
> Bharath Rupireddy  writes:
> > While working on [1], I got to know that there is a new GUC
> > debug_invalidate_system_caches_always that has been introduced in v14.
> > It can be used to switch off cache invalidation in
> > CLOBBER_CACHE_ALWAYS builds which makes cache sensitive tests stable.
> > Using this GUC, it is quite possible to make cached connection
> > management function tests more meaningful by returning original
> > values(true/false, all the output columns) instead of SELECT 1.
>
> Note that this needs an update in the wake of d68a00391.
>
> More generally, though, I am not sure that I believe the premise of
> this patch.  AFAICS it's assuming that forcing debug_discard_caches
> off guarantees zero cache flushes, which it does not.  (If it could,
> we wouldn't need the whole thing; the point of that variable is to
> deterministically force flushes which would otherwise be
> nondeterministic, not nonexistent.)

Can the setting debug_discard_caches = 0 still make extra
flushes/discards (not the regular cache flushes/discards that happen
because of alters or changes in the cached elements)? My understanding
was that debug_discard_caches = 0, disables all the extra flushes with
clobber cache builds. If my understanding wasn't right, isn't it good
to mention it somewhere in the documentation or in the source code?

> Even in a contrib test that
> seemingly has nothing else running, background activity such as
> autovacuum could result in surprises.  So I fear that what you have
> got here is a patch that will work 99% of the time; which is not
> good enough for the buildfarm.

If the setting debug_discard_caches = 0 makes at least a few extra
cache flushes, I don't mind withdrawing this patch.

Regards,
Bharath Rupireddy.




Re: What are exactly bootstrap processes, auxiliary processes, standalone backends, normal backends(user sessions)?

2021-07-17 Thread Alvaro Herrera
On 2021-Jul-17, Bharath Rupireddy wrote:

> On Thu, Jul 15, 2021 at 8:17 PM Justin Pryzby  wrote:
> >
> > On Fri, Jul 09, 2021 at 09:24:19PM +0530, Bharath Rupireddy wrote:
> > > I've always had a hard time distinguishing various types of
> > > processes/terms used in postgres. I look at the source code every time
> > > to understand them, yet I don't feel satisfied with my understanding.
> > > I request any hacker (having a better idea than me) to help me with
> > > what each different process does and how they are different from each
> > > other? Of course, I'm clear with normal backends (user sessions), bg
> > > workers, but the others need a bit more understanding.
> >
> > It sounds like something that should be in the glossary, which currently 
> > refers
> > to but doesn't define "auxiliary processes".
> 
> Thanks. I strongly feel that it should be documented somewhere. I will
> be happy if someone with a clear idea about these various processes
> does it.

Auxiliary process

Process of an instance in
charge of some specific, hardcoded background task.  Examples are
the startup process,
the WAL receiver (but not the WAL senders),
the WAL writer,
the archiver,
the background 
writer,
the checkpointer,
the statistics 
collector,
and the logger.

We should probably include individual glossary entries for those that
don't already have one.  Maybe revise the entries for ones that do so
that they start with "An auxiliary process that ..."

I just realized that the autovac launcher is not nominally an auxiliary
process (per SubPostmasterMain), which is odd since notionally it
clearly is.  Maybe we should include it in the list anyway, with
something like

"and the autovacuum launcher (but not the autovacuum workers)".

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"I am amazed at [the pgsql-sql] mailing list for the wonderful support, and
lack of hesitasion in answering a lost soul's question, I just wished the rest
of the mailing list could be like this."   (Fotis)
   (http://archives.postgresql.org/pgsql-sql/2006-06/msg00265.php)




Re: What are exactly bootstrap processes, auxiliary processes, standalone backends, normal backends(user sessions)?

2021-07-17 Thread Justin Pryzby
On Sat, Jul 17, 2021 at 10:45:52AM -0400, Alvaro Herrera wrote:
> On 2021-Jul-17, Bharath Rupireddy wrote:
> 
> > On Thu, Jul 15, 2021 at 8:17 PM Justin Pryzby  wrote:
> > >
> > > On Fri, Jul 09, 2021 at 09:24:19PM +0530, Bharath Rupireddy wrote:
> > > > I've always had a hard time distinguishing various types of
> > > > processes/terms used in postgres. I look at the source code every time
> > > > to understand them, yet I don't feel satisfied with my understanding.
> > > > I request any hacker (having a better idea than me) to help me with
> > > > what each different process does and how they are different from each
> > > > other? Of course, I'm clear with normal backends (user sessions), bg
> > > > workers, but the others need a bit more understanding.
> > >
> > > It sounds like something that should be in the glossary, which currently 
> > > refers
> > > to but doesn't define "auxiliary processes".
> > 
> > Thanks. I strongly feel that it should be documented somewhere. I will
> > be happy if someone with a clear idea about these various processes
> > does it.
> 
>   Auxiliary process
> 
>   Process of an  linkend="glossary-instance">instance in

I think "of an instance" is a distraction here, even if technically accurate.

>   charge of some specific, hardcoded background task.  Examples are

And I think "hardcoded" doesn't mean anything beyond what "specific" means.

Maybe you'd say: .. process which handles a specific, central task for the
cluster instance.

Thanks,
-- 
Justin




Re: What are exactly bootstrap processes, auxiliary processes, standalone backends, normal backends(user sessions)?

2021-07-17 Thread Tom Lane
Justin Pryzby  writes:
> On Sat, Jul 17, 2021 at 10:45:52AM -0400, Alvaro Herrera wrote:
>> Process of an instance in
>> charge of some specific, hardcoded background task.  Examples are

> And I think "hardcoded" doesn't mean anything beyond what "specific" means.

> Maybe you'd say: .. process which handles a specific, central task for the
> cluster instance.

Meh.  "Specific" and "background" both seem to be useful terms here.
I do not think "central" is a useful adjective.

regards, tom lane




Re: postgres_fdw - make cached connection functions tests meaningful

2021-07-17 Thread Tom Lane
Bharath Rupireddy  writes:
> On Thu, Jul 15, 2021 at 3:48 AM Tom Lane  wrote:
>> More generally, though, I am not sure that I believe the premise of
>> this patch.  AFAICS it's assuming that forcing debug_discard_caches
>> off guarantees zero cache flushes, which it does not.

> Can the setting debug_discard_caches = 0 still make extra
> flushes/discards (not the regular cache flushes/discards that happen
> because of alters or changes in the cached elements)? My understanding
> was that debug_discard_caches = 0, disables all the extra flushes with
> clobber cache builds. If my understanding wasn't right, isn't it good
> to mention it somewhere in the documentation or in the source code?

The reason for this mechanism is that cache flushes can be triggered
at any time by sinval messages from other processes (e.g., background
autovacuums).  Setting debug_discard_caches allows us to exercise
that possibility exhaustively and make sure that the code is proof
against cache entries disappearing unexpectedly.  Not setting
debug_discard_caches doesn't mean that that can't happen, only that
you can't predict when it will happen.

regards, tom lane




Re: A micro-optimisation for ProcSendSignal()

2021-07-17 Thread Soumyadeep Chakraborty
Hi Thomas,

You might have missed a spot to initialize SERIALIZABLE_XACT->pgprocno in
InitPredicateLocks(), so:

+ PredXact->OldCommittedSxact->pgprocno = INVALID_PGPROCNO;

Slightly tangential: we should add a comment to PGPROC.pgprocno, for more
immediate understandability:

+ int pgprocno; /* index of this PGPROC in ProcGlobal->allProcs */

Also, why don't we take the opportunity to get rid of SERIALIZABLEXACT->pid? We
took a stab. Attached is v2 of your patch with these changes.

Regards,
Ashwin and Deep
From 08d5ab9f498140e1c3a8f6f00d1acd25b3fb8ba0 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Thu, 11 Mar 2021 23:09:11 +1300
Subject: [PATCH v2 1/1] Optimize ProcSendSignal().

Instead of referring to target processes by pid, use pgprocno so that we
don't have to scan the ProcArray and keep track of the startup process.

Discussion: https://postgr.es/m/CA%2BhUKGLYRyDaneEwz5Uya_OgFLMx5BgJfkQSD%3Dq9HmwsfRRb-w%40mail.gmail.com
---
 src/backend/access/transam/xlog.c |  1 -
 src/backend/storage/buffer/buf_init.c |  3 +-
 src/backend/storage/buffer/bufmgr.c   | 10 ++---
 src/backend/storage/lmgr/predicate.c  | 23 ++-
 src/backend/storage/lmgr/proc.c   | 49 ++-
 src/backend/utils/adt/lockfuncs.c |  8 +++-
 src/include/storage/buf_internals.h   |  2 +-
 src/include/storage/predicate_internals.h |  2 +-
 src/include/storage/proc.h|  8 +---
 9 files changed, 34 insertions(+), 72 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c7c928f50bd..2bbeaaf1b90 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7309,7 +7309,6 @@ StartupXLOG(void)
 		 */
 		if (ArchiveRecoveryRequested && IsUnderPostmaster)
 		{
-			PublishStartupProcessInformation();
 			EnableSyncRequestForwarding();
 			SendPostmasterSignal(PMSIGNAL_RECOVERY_STARTED);
 			bgwriterLaunched = true;
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index a299be10430..b9a83c0e9b9 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -16,6 +16,7 @@
 
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "storage/proc.h"
 
 BufferDescPadded *BufferDescriptors;
 char	   *BufferBlocks;
@@ -118,7 +119,7 @@ InitBufferPool(void)
 			CLEAR_BUFFERTAG(buf->tag);
 
 			pg_atomic_init_u32(&buf->state, 0);
-			buf->wait_backend_pid = 0;
+			buf->wait_backend_pgprocno = INVALID_PGPROCNO;
 
 			buf->buf_id = i;
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 86ef607ff38..26122418faf 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1890,11 +1890,11 @@ UnpinBuffer(BufferDesc *buf, bool fixOwner)
 BUF_STATE_GET_REFCOUNT(buf_state) == 1)
 			{
 /* we just released the last pin other than the waiter's */
-int			wait_backend_pid = buf->wait_backend_pid;
+int			wait_backend_pgprocno = buf->wait_backend_pgprocno;
 
 buf_state &= ~BM_PIN_COUNT_WAITER;
 UnlockBufHdr(buf, buf_state);
-ProcSendSignal(wait_backend_pid);
+ProcSendSignal(wait_backend_pgprocno);
 			}
 			else
 UnlockBufHdr(buf, buf_state);
@@ -3995,7 +3995,7 @@ UnlockBuffers(void)
 		 * got a cancel/die interrupt before getting the signal.
 		 */
 		if ((buf_state & BM_PIN_COUNT_WAITER) != 0 &&
-			buf->wait_backend_pid == MyProcPid)
+			buf->wait_backend_pgprocno == MyProc->pgprocno)
 			buf_state &= ~BM_PIN_COUNT_WAITER;
 
 		UnlockBufHdr(buf, buf_state);
@@ -4131,7 +4131,7 @@ LockBufferForCleanup(Buffer buffer)
 			LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 			elog(ERROR, "multiple backends attempting to wait for pincount 1");
 		}
-		bufHdr->wait_backend_pid = MyProcPid;
+		bufHdr->wait_backend_pgprocno = MyProc->pgprocno;
 		PinCountWaitBuf = bufHdr;
 		buf_state |= BM_PIN_COUNT_WAITER;
 		UnlockBufHdr(bufHdr, buf_state);
@@ -4202,7 +4202,7 @@ LockBufferForCleanup(Buffer buffer)
 		 */
 		buf_state = LockBufHdr(bufHdr);
 		if ((buf_state & BM_PIN_COUNT_WAITER) != 0 &&
-			bufHdr->wait_backend_pid == MyProcPid)
+			bufHdr->wait_backend_pgprocno == MyProc->pgprocno)
 			buf_state &= ~BM_PIN_COUNT_WAITER;
 		UnlockBufHdr(bufHdr, buf_state);
 
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index d493aeef0fc..d76632bb56a 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -1276,7 +1276,7 @@ InitPredicateLocks(void)
 		PredXact->OldCommittedSxact->finishedBefore = InvalidTransactionId;
 		PredXact->OldCommittedSxact->xmin = InvalidTransactionId;
 		PredXact->OldCommittedSxact->flags = SXACT_FLAG_COMMITTED;
-		PredXact->OldCommittedSxact->pid = 0;
+		PredXact->OldCommittedSxact->pgprocno = INVALID_PGPROCNO;
 	}
 	/* This never changes, so let's keep a local copy. */
 	OldCommittedSxact = PredXact->OldCommittedSxact;
@@ -1635,8 +16

Re: UniqueKey on Partitioned table.

2021-07-17 Thread Andy Fan
On Sat, Jul 17, 2021 at 3:45 PM David Rowley  wrote:
>
> On Sat, 17 Jul 2021 at 19:32, Andy Fan  wrote:
> > SELECT * FROM t1, t2 WHERE t1.pk = t2.pk;
> >
> > Then when I populate_baserel_uniquekeys for t1, we already have
> > EC{Members={t1.pk, t2.pk}} in root->eq_classes already. Then I use
> > this EC directly for t1's UniqueKey.  The result is:
> >
> > T1's UniqueKey : [ EC{Members={t1.pk, t2.pk}} ].
> >
> > *Would this be OK since at the baserel level,  the "t1.pk = t2.pk" is not
> > executed yet?*
> >
> > I tried the below example to test how PathKey is maintained.
> > CREATE TABLE t1 (a INT, b INT);
> > CREATE TABLE t2 (a INT, b INT);
> > CREATE INDEX ON t1(b);
> >
> > SELECT * FROM t1, t2 WHERE t1.b = t2.b and t1.b > 3;
> >
> > then we can get t1's Path:
> >
> > Index Scan on (b),  PathKey.pk_class include 2 members (t1.b, t2.b}
> > even before the Join.
> >
> > So looks the answer for my question should be "yes"? Hope I have
> > made myself clear.
>
> I don't see the problem.

Thanks for the double check,  that removes a big blocker for my development.
I'd submit a new patch very soon.

> The reason PathKeys use EquivalenceClasses is
> so that queries like: SELECT * FROM tab WHERE a=b ORDER BY b; can see
> that they're also ordered by a.  This is useful because if there
> happens to be an index on tab(a) then we can use it to provide the
> required ordering for this query.
>
> We'll want the same with UniqueKeys.  The same thing there looks like:
>
> CREATE TABLE tab (a int primary key, b int not null);
>
> select distinct b from tab where a=b;
>
> Since we have the EquivalenceClass with {a,b} stored in the UniqueKey,
> then we should be able to execute this without doing any distinct
> operation.
>
> David



-- 
Best Regards
Andy Fan (https://www.aliyun.com/)




slab allocator performance issues

2021-07-17 Thread Andres Freund
Hi,

I just tried to use the slab allocator for a case where aset.c was
bloating memory usage substantially. First: It worked wonders for memory
usage, nearly eliminating overhead.

But it turned out to cause a *substantial* slowdown. With aset the
allocator is barely in the profile. With slab the profile is dominated
by allocator performance.

slab:
NOTICE:  0: 1 ordered insertions in 5.216287 seconds, 19170724/sec
LOCATION:  bfm_test_insert_bulk, radix.c:2880
  Overhead  Command   Shared Object Symbol

+   28.27%  postgres  postgres  [.] SlabAlloc
+9.64%  postgres  bdbench.so[.] bfm_delete
+9.03%  postgres  bdbench.so[.] bfm_set
+8.39%  postgres  bdbench.so[.] bfm_lookup
+8.36%  postgres  bdbench.so[.] bfm_set_leaf.constprop.0
+8.16%  postgres  libc-2.31.so  [.] __memmove_avx_unaligned_erms
+6.88%  postgres  bdbench.so[.] bfm_delete_leaf
+3.24%  postgres  libc-2.31.so  [.] _int_malloc
+2.58%  postgres  bdbench.so[.] bfm_tests
+2.33%  postgres  postgres  [.] SlabFree
+1.29%  postgres  libc-2.31.so  [.] _int_free
+1.09%  postgres  libc-2.31.so  [.] unlink_chunk.constprop.0

aset:

NOTICE:  0: 1 ordered insertions in 2.082602 seconds, 48016848/sec
LOCATION:  bfm_test_insert_bulk, radix.c:2880

+   16.43%  postgres  bdbench.so[.] bfm_lookup
+   15.38%  postgres  bdbench.so[.] bfm_delete
+   12.82%  postgres  libc-2.31.so  [.] __memmove_avx_unaligned_erms
+   12.65%  postgres  bdbench.so[.] bfm_set
+   12.15%  postgres  bdbench.so[.] bfm_set_leaf.constprop.0
+   10.57%  postgres  bdbench.so[.] bfm_delete_leaf
+4.05%  postgres  bdbench.so[.] bfm_tests
+2.93%  postgres  [kernel.vmlinux]  [k] clear_page_erms
+1.59%  postgres  postgres  [.] AllocSetAlloc
+1.15%  postgres  bdbench.so[.] memmove@plt
+1.06%  postgres  bdbench.so[.] bfm_grow_leaf_16

OS:
NOTICE:  0: 1 ordered insertions in 2.089790 seconds, 47851690/sec
LOCATION:  bfm_test_insert_bulk, radix.c:2880


That is somewhat surprising - part of the promise of a slab allocator is
that it's fast...


This is caused by multiple issues, I think. Some of which seems fairly easy to
fix.

1) If allocations are short-lived slab.c, can end up constantly
freeing/initializing blocks. Which requires fairly expensively iterating over
all potential chunks in the block and initializing it. Just to then free that
memory again after a small number of allocations. The extreme case of this is
when there are phases of alloc/free of a single allocation.

I "fixed" this by adding a few && slab->nblocks > 1 in SlabFree() and the
problem vastly reduced. Instead of a 0.4x slowdown it's 0.88x. Of course that
only works if the problem is with the only, so it's not a great
approach. Perhaps just keeping the last allocated block around would work?


2) SlabChunkIndex() in SlabFree() is slow. It requires a 64bit division, taking
up ~50% of the cycles in SlabFree(). A 64bit div, according to [1] , has a
latency of 35-88 cycles on skylake-x (and a reverse throughput of 21-83,
i.e. no parallelism). While it's getting a bit faster on icelake / zen 3, it's
still slow enough there to be very worrisome.

I don't see a way to get around the division while keeping the freelist
structure as is. But:

ISTM that we only need the index because of the free-chunk list, right? Why
don't we make the chunk list use actual pointers? Is it concern that that'd
increase the minimum allocation size?  If so, I see two ways around that:
First, we could make the index just the offset from the start of the block,
that's much cheaper to calculate. Second, we could store the next pointer in
SlabChunk->slab/block instead (making it a union) - while on the freelist we
don't need to dereference those, right?

I suspect both would also make the block initialization a bit cheaper.

That should also accelerate SlabBlockGetChunk(), which currently shows up as
an imul, which isn't exactly fast either (and uses a lot of execution ports).


3) Every free/alloc needing to unlink from slab->freelist[i] and then relink
shows up prominently in profiles. That's ~tripling the number of cachelines
touched in the happy path, with unpredictable accesses to boot.

Perhaps we could reduce the precision of slab->freelist indexing to amortize
that cost? I.e. not have one slab->freelist entry for each nfree, but instead
have an upper limit on the number of freelists?


4) Less of a performance, and more of a usability issue: The constant
block size strikes me as problematic. Most users of an allocator can
sometimes be used with a small amount of data, and sometimes with a
large amount.

Greetings,

Andres Freund

[1] https://www.agner.org/optimize/instruction_tables.pdf




Re: slab allocator performance issues

2021-07-17 Thread Andres Freund
Hi,

On 2021-07-17 12:43:33 -0700, Andres Freund wrote:
> 2) SlabChunkIndex() in SlabFree() is slow. It requires a 64bit division, 
> taking
> up ~50% of the cycles in SlabFree(). A 64bit div, according to [1] , has a
> latency of 35-88 cycles on skylake-x (and a reverse throughput of 21-83,
> i.e. no parallelism). While it's getting a bit faster on icelake / zen 3, it's
> still slow enough there to be very worrisome.
> 
> I don't see a way to get around the division while keeping the freelist
> structure as is. But:
> 
> ISTM that we only need the index because of the free-chunk list, right? Why
> don't we make the chunk list use actual pointers? Is it concern that that'd
> increase the minimum allocation size?  If so, I see two ways around that:
> First, we could make the index just the offset from the start of the block,
> that's much cheaper to calculate. Second, we could store the next pointer in
> SlabChunk->slab/block instead (making it a union) - while on the freelist we
> don't need to dereference those, right?
> 
> I suspect both would also make the block initialization a bit cheaper.
> 
> That should also accelerate SlabBlockGetChunk(), which currently shows up as
> an imul, which isn't exactly fast either (and uses a lot of execution ports).

Oh - I just saw that effectively the allocation size already is a
uintptr_t at minimum. I had only seen

/* Make sure the linked list node fits inside a freed chunk */
if (chunkSize < sizeof(int))
chunkSize = sizeof(int);
but it's followed by
/* chunk, including SLAB header (both addresses nicely aligned) */
fullChunkSize = sizeof(SlabChunk) + MAXALIGN(chunkSize);

which means we are reserving enough space for a pointer on just about
any platform already? Seems we can just make that official and reserve
space for a pointer as part of the chunk size rounding up, instead of
fullChunkSize?

Greetings,

Andres Freund




corruption of WAL page header is never reported

2021-07-17 Thread Yugo NAGATA
Hello,

I found that any corruption of WAL page header found during recovery is never
reported in log messages. If wal page header is broken, it is detected in
XLogReaderValidatePageHeader called from  XLogPageRead, but the error messages
are always reset and never reported.

if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
{
   /* reset any error XLogReaderValidatePageHeader() might have set 
*/
   xlogreader->errormsg_buf[0] = '\0';
   goto next_record_is_invalid;
}

Since the commit 06687198018, we call XLogReaderValidatePageHeader here so that
we can check a page header and retry immediately if it's invalid, but the error
message is reset immediately and not reported. I guess the reason why the error
message is reset is because we might get the right WAL after some retries.
However, I think it is better to report the error for each check in order to let
users know the actual issues founded in the WAL.

I attached a patch to fix it in this way.

Or, if we wouldn't like to report an error for each check and also what we want
to check here is just about old recycled WAL instead of header corruption 
itself, 
I wander that we could check just xlp_pageaddr instead of calling
XLogReaderValidatePageHeader.

Regards,
Yugo Nagata

-- 
Yugo NAGATA 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index edb15fe58d..4fdb23f061 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -12297,8 +12297,13 @@ retry:
 	 */
 	if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
 	{
-		/* reset any error XLogReaderValidatePageHeader() might have set */
-		xlogreader->errormsg_buf[0] = '\0';
+		if (StandbyMode && xlogreader->errormsg_buf[0] != '\0')
+		{
+			ereport(emode_for_corrupt_record(emode, EndRecPtr),
+	(errmsg_internal("%s", xlogreader->errormsg_buf) /* already translated */ ));
+			/* reset any error XLogReaderValidatePageHeader() might have set */
+			xlogreader->errormsg_buf[0] = '\0';
+		}
 		goto next_record_is_invalid;
 	}
 


Re: slab allocator performance issues

2021-07-17 Thread Tomas Vondra

Hi,

On 7/17/21 9:43 PM, Andres Freund wrote:

Hi,

I just tried to use the slab allocator for a case where aset.c was
bloating memory usage substantially. First: It worked wonders for memory
usage, nearly eliminating overhead.

But it turned out to cause a *substantial* slowdown. With aset the
allocator is barely in the profile. With slab the profile is dominated
by allocator performance.

slab:
NOTICE:  0: 1 ordered insertions in 5.216287 seconds, 19170724/sec
LOCATION:  bfm_test_insert_bulk, radix.c:2880
   Overhead  Command   Shared Object Symbol

+   28.27%  postgres  postgres  [.] SlabAlloc
+9.64%  postgres  bdbench.so[.] bfm_delete
+9.03%  postgres  bdbench.so[.] bfm_set
+8.39%  postgres  bdbench.so[.] bfm_lookup
+8.36%  postgres  bdbench.so[.] bfm_set_leaf.constprop.0
+8.16%  postgres  libc-2.31.so  [.] __memmove_avx_unaligned_erms
+6.88%  postgres  bdbench.so[.] bfm_delete_leaf
+3.24%  postgres  libc-2.31.so  [.] _int_malloc
+2.58%  postgres  bdbench.so[.] bfm_tests
+2.33%  postgres  postgres  [.] SlabFree
+1.29%  postgres  libc-2.31.so  [.] _int_free
+1.09%  postgres  libc-2.31.so  [.] unlink_chunk.constprop.0

aset:

NOTICE:  0: 1 ordered insertions in 2.082602 seconds, 48016848/sec
LOCATION:  bfm_test_insert_bulk, radix.c:2880

+   16.43%  postgres  bdbench.so[.] bfm_lookup
+   15.38%  postgres  bdbench.so[.] bfm_delete
+   12.82%  postgres  libc-2.31.so  [.] __memmove_avx_unaligned_erms
+   12.65%  postgres  bdbench.so[.] bfm_set
+   12.15%  postgres  bdbench.so[.] bfm_set_leaf.constprop.0
+   10.57%  postgres  bdbench.so[.] bfm_delete_leaf
+4.05%  postgres  bdbench.so[.] bfm_tests
+2.93%  postgres  [kernel.vmlinux]  [k] clear_page_erms
+1.59%  postgres  postgres  [.] AllocSetAlloc
+1.15%  postgres  bdbench.so[.] memmove@plt
+1.06%  postgres  bdbench.so[.] bfm_grow_leaf_16

OS:
NOTICE:  0: 1 ordered insertions in 2.089790 seconds, 47851690/sec
LOCATION:  bfm_test_insert_bulk, radix.c:2880


That is somewhat surprising - part of the promise of a slab allocator is
that it's fast...


This is caused by multiple issues, I think. Some of which seems fairly easy to
fix.

1) If allocations are short-lived slab.c, can end up constantly
freeing/initializing blocks. Which requires fairly expensively iterating over
all potential chunks in the block and initializing it. Just to then free that
memory again after a small number of allocations. The extreme case of this is
when there are phases of alloc/free of a single allocation.

I "fixed" this by adding a few && slab->nblocks > 1 in SlabFree() and the
problem vastly reduced. Instead of a 0.4x slowdown it's 0.88x. Of course that
only works if the problem is with the only, so it's not a great
approach. Perhaps just keeping the last allocated block around would work?



+1

I think it makes perfect sense to not free the blocks immediately, and 
keep one (or a small number) as a cache. I'm not sure why we decided not 
to have a "keeper" block, but I suspect memory consumption was my main 
concern at that point. But I never expected the cost to be this high.




2) SlabChunkIndex() in SlabFree() is slow. It requires a 64bit division, taking
up ~50% of the cycles in SlabFree(). A 64bit div, according to [1] , has a
latency of 35-88 cycles on skylake-x (and a reverse throughput of 21-83,
i.e. no parallelism). While it's getting a bit faster on icelake / zen 3, it's
still slow enough there to be very worrisome.

I don't see a way to get around the division while keeping the freelist
structure as is. But:

ISTM that we only need the index because of the free-chunk list, right? Why
don't we make the chunk list use actual pointers? Is it concern that that'd
increase the minimum allocation size?  If so, I see two ways around that:
First, we could make the index just the offset from the start of the block,
that's much cheaper to calculate. Second, we could store the next pointer in
SlabChunk->slab/block instead (making it a union) - while on the freelist we
don't need to dereference those, right?

I suspect both would also make the block initialization a bit cheaper.

That should also accelerate SlabBlockGetChunk(), which currently shows up as
an imul, which isn't exactly fast either (and uses a lot of execution ports).



Hmm, I think you're right we could simply use the pointers, but I have 
not tried that.




3) Every free/alloc needing to unlink from slab->freelist[i] and then relink
shows up prominently in profiles. That's ~tripling the number of cachelines
touched in the happy path, with unpredictable accesses to boot.

Perhaps we could reduce the precision of slab->freelist indexing to amortize
that cost? I.e. not have one slab->freelist entry for each nfree, but instead
have an upper limit on the number

Re: slab allocator performance issues

2021-07-17 Thread Andres Freund
Hi,

On 2021-07-17 22:35:07 +0200, Tomas Vondra wrote:
> On 7/17/21 9:43 PM, Andres Freund wrote:
> > 1) If allocations are short-lived slab.c, can end up constantly
> > freeing/initializing blocks. Which requires fairly expensively iterating 
> > over
> > all potential chunks in the block and initializing it. Just to then free 
> > that
> > memory again after a small number of allocations. The extreme case of this 
> > is
> > when there are phases of alloc/free of a single allocation.
> > 
> > I "fixed" this by adding a few && slab->nblocks > 1 in SlabFree() and the
> > problem vastly reduced. Instead of a 0.4x slowdown it's 0.88x. Of course 
> > that
> > only works if the problem is with the only, so it's not a great
> > approach. Perhaps just keeping the last allocated block around would work?
> > 
> 
> +1
> 
> I think it makes perfect sense to not free the blocks immediately, and keep
> one (or a small number) as a cache. I'm not sure why we decided not to have
> a "keeper" block, but I suspect memory consumption was my main concern at
> that point. But I never expected the cost to be this high.

I think one free block might be too low in some cases. It's pretty
common to have workloads where the number of allocations is "bursty",
and it's imo one case where one might justifiably want to use a slab
allocator... Perhaps a portion of a high watermark? Or a portion of the
in use blocks?

Hm. I wonder if we should just not populate the freelist eagerly, to
drive down the initialization cost. I.e. have a separate allocation path
for chunks that have never been allocated, by having a
SlabBlock->free_offset or such.

Sure, it adds a branch to the allocation happy path, but it also makes the
first allocation for a chunk cheaper, because there's no need to get the next
element from the freelist (adding a likely cache miss). And it should make the
allocation of a new block faster by a lot.


> > 2) SlabChunkIndex() in SlabFree() is slow. It requires a 64bit division, 
> > taking
> > up ~50% of the cycles in SlabFree(). A 64bit div, according to [1] , has a
> > latency of 35-88 cycles on skylake-x (and a reverse throughput of 21-83,
> > i.e. no parallelism). While it's getting a bit faster on icelake / zen 3, 
> > it's
> > still slow enough there to be very worrisome.
> > 
> > I don't see a way to get around the division while keeping the freelist
> > structure as is. But:
> > 
> > ISTM that we only need the index because of the free-chunk list, right? Why
> > don't we make the chunk list use actual pointers? Is it concern that that'd
> > increase the minimum allocation size?  If so, I see two ways around that:
> > First, we could make the index just the offset from the start of the block,
> > that's much cheaper to calculate. Second, we could store the next pointer in
> > SlabChunk->slab/block instead (making it a union) - while on the freelist we
> > don't need to dereference those, right?
> > 
> > I suspect both would also make the block initialization a bit cheaper.
> > 
> > That should also accelerate SlabBlockGetChunk(), which currently shows up as
> > an imul, which isn't exactly fast either (and uses a lot of execution 
> > ports).
> > 
> 
> Hmm, I think you're right we could simply use the pointers, but I have not
> tried that.

I quickly tried that, and it does seem to improve matters considerably. The
block initialization still shows up as expensive, but not as bad. The div and
imul are gone (exept in an assertion build right now). The list manipulation
still is visible.



> > 3) Every free/alloc needing to unlink from slab->freelist[i] and then relink
> > shows up prominently in profiles. That's ~tripling the number of cachelines
> > touched in the happy path, with unpredictable accesses to boot.
> > 
> > Perhaps we could reduce the precision of slab->freelist indexing to amortize
> > that cost? I.e. not have one slab->freelist entry for each nfree, but 
> > instead
> > have an upper limit on the number of freelists?
> > 
> 
> Yeah. The purpose of organizing the freelists like this is to prioritize the
> "more full" blocks" when allocating new chunks, in the hope that the "less
> full" blocks will end up empty and freed faster.
> 
> But this is naturally imprecise, and strongly depends on the workload, of
> course, and I bet for most cases a less precise approach would work just as
> fine.
> 
> I'm not sure how exactly would the upper limit you propose work, but perhaps
> we could group the blocks for nfree ranges, say [0-15], [16-31] and so on.
> So after the alloc/free we'd calculate the new freelist index as (nfree/16)
> and only moved the block if the index changed. This would reduce the
> overhead to 1/16 and probably even more in practice.

> Of course, we could also say we have e.g. 8 freelists and work the ranges
> backwards from that, I guess that's what you propose.

Yea, I was thinking something along those lines. As you say, ow about there's
always at most 8 freelists or such. During init

Re: corruption of WAL page header is never reported

2021-07-17 Thread Ranier Vilela
Em sáb., 17 de jul. de 2021 às 16:57, Yugo NAGATA 
escreveu:

> Hello,
>
> I found that any corruption of WAL page header found during recovery is
> never
> reported in log messages. If wal page header is broken, it is detected in
> XLogReaderValidatePageHeader called from  XLogPageRead, but the error
> messages
> are always reset and never reported.
>
> if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr,
> readBuf))
> {
>/* reset any error XLogReaderValidatePageHeader() might
> have set */
>xlogreader->errormsg_buf[0] = '\0';
>goto next_record_is_invalid;
> }
>
> Since the commit 06687198018, we call XLogReaderValidatePageHeader here so
> that
> we can check a page header and retry immediately if it's invalid, but the
> error
> message is reset immediately and not reported. I guess the reason why the
> error
> message is reset is because we might get the right WAL after some retries.
> However, I think it is better to report the error for each check in order
> to let
> users know the actual issues founded in the WAL.
>
> I attached a patch to fix it in this way.
>
I think to keep the same behavior as before, is necessary always run:

/* reset any error XLogReaderValidatePageHeader() might have set */
xlogreader->errormsg_buf[0] = '\0';

not?

regards,
Ranier Vilela
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index edb15fe58d..4fdb23f061 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -12297,8 +12297,13 @@ retry:
 	 */
 	if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
 	{
+		if (StandbyMode && xlogreader->errormsg_buf[0] != '\0')
+		{
+			ereport(emode_for_corrupt_record(emode, EndRecPtr),
+	(errmsg_internal("%s", xlogreader->errormsg_buf) /* already translated */ ));
+		}
		/* reset any error XLogReaderValidatePageHeader() might have set */
		xlogreader->errormsg_buf[0] = '\0';
 		goto next_record_is_invalid;
 	}
 


Re: slab allocator performance issues

2021-07-17 Thread Tomas Vondra




On 7/17/21 11:14 PM, Andres Freund wrote:

Hi,

On 2021-07-17 22:35:07 +0200, Tomas Vondra wrote:

On 7/17/21 9:43 PM, Andres Freund wrote:

1) If allocations are short-lived slab.c, can end up constantly
freeing/initializing blocks. Which requires fairly expensively iterating over
all potential chunks in the block and initializing it. Just to then free that
memory again after a small number of allocations. The extreme case of this is
when there are phases of alloc/free of a single allocation.

I "fixed" this by adding a few && slab->nblocks > 1 in SlabFree() and the
problem vastly reduced. Instead of a 0.4x slowdown it's 0.88x. Of course that
only works if the problem is with the only, so it's not a great
approach. Perhaps just keeping the last allocated block around would work?



+1

I think it makes perfect sense to not free the blocks immediately, and keep
one (or a small number) as a cache. I'm not sure why we decided not to have
a "keeper" block, but I suspect memory consumption was my main concern at
that point. But I never expected the cost to be this high.


I think one free block might be too low in some cases. It's pretty
common to have workloads where the number of allocations is "bursty",
and it's imo one case where one might justifiably want to use a slab
allocator... Perhaps a portion of a high watermark? Or a portion of the
in use blocks?



I think the portion of watermark would be problematic for cases with one 
huge transaction - that'll set a high watermark, and we'll keep way too 
many free blocks. But the portion of in use blocks might work, I think.



Hm. I wonder if we should just not populate the freelist eagerly, to
drive down the initialization cost. I.e. have a separate allocation path
for chunks that have never been allocated, by having a
SlabBlock->free_offset or such.

Sure, it adds a branch to the allocation happy path, but it also makes the
first allocation for a chunk cheaper, because there's no need to get the next
element from the freelist (adding a likely cache miss). And it should make the
allocation of a new block faster by a lot.



Not sure what you mean by 'not populate eagerly' so can't comment :-(




2) SlabChunkIndex() in SlabFree() is slow. It requires a 64bit division, taking
up ~50% of the cycles in SlabFree(). A 64bit div, according to [1] , has a
latency of 35-88 cycles on skylake-x (and a reverse throughput of 21-83,
i.e. no parallelism). While it's getting a bit faster on icelake / zen 3, it's
still slow enough there to be very worrisome.

I don't see a way to get around the division while keeping the freelist
structure as is. But:

ISTM that we only need the index because of the free-chunk list, right? Why
don't we make the chunk list use actual pointers? Is it concern that that'd
increase the minimum allocation size?  If so, I see two ways around that:
First, we could make the index just the offset from the start of the block,
that's much cheaper to calculate. Second, we could store the next pointer in
SlabChunk->slab/block instead (making it a union) - while on the freelist we
don't need to dereference those, right?

I suspect both would also make the block initialization a bit cheaper.

That should also accelerate SlabBlockGetChunk(), which currently shows up as
an imul, which isn't exactly fast either (and uses a lot of execution ports).



Hmm, I think you're right we could simply use the pointers, but I have not
tried that.


I quickly tried that, and it does seem to improve matters considerably. The
block initialization still shows up as expensive, but not as bad. The div and
imul are gone (exept in an assertion build right now). The list manipulation
still is visible.



Understood. I didn't expect this to be a full solution.




3) Every free/alloc needing to unlink from slab->freelist[i] and then relink
shows up prominently in profiles. That's ~tripling the number of cachelines
touched in the happy path, with unpredictable accesses to boot.

Perhaps we could reduce the precision of slab->freelist indexing to amortize
that cost? I.e. not have one slab->freelist entry for each nfree, but instead
have an upper limit on the number of freelists?



Yeah. The purpose of organizing the freelists like this is to prioritize the
"more full" blocks" when allocating new chunks, in the hope that the "less
full" blocks will end up empty and freed faster.

But this is naturally imprecise, and strongly depends on the workload, of
course, and I bet for most cases a less precise approach would work just as
fine.

I'm not sure how exactly would the upper limit you propose work, but perhaps
we could group the blocks for nfree ranges, say [0-15], [16-31] and so on.
So after the alloc/free we'd calculate the new freelist index as (nfree/16)
and only moved the block if the index changed. This would reduce the
overhead to 1/16 and probably even more in practice.



Of course, we could also say we have e.g. 8 freelists and work the ranges
backwards from that, I 

Re: slab allocator performance issues

2021-07-17 Thread Andres Freund
Hi,

On 2021-07-18 00:46:09 +0200, Tomas Vondra wrote:
> On 7/17/21 11:14 PM, Andres Freund wrote:
> > Hm. I wonder if we should just not populate the freelist eagerly, to
> > drive down the initialization cost. I.e. have a separate allocation path
> > for chunks that have never been allocated, by having a
> > SlabBlock->free_offset or such.
> > 
> > Sure, it adds a branch to the allocation happy path, but it also makes the
> > first allocation for a chunk cheaper, because there's no need to get the 
> > next
> > element from the freelist (adding a likely cache miss). And it should make 
> > the
> > allocation of a new block faster by a lot.
> > 
> 
> Not sure what you mean by 'not populate eagerly' so can't comment :-(

Instead of populating a linked list with all chunks upon creation of a block -
which requires touching a fair bit of memory - keep a per-block pointer (or an
offset) into "unused" area of the block. When allocating from the block and
theres still "unused" memory left, use that, instead of bothering with the
freelist.

I tried that, and it nearly got slab up to the allocation/freeing performance
of aset.c (while winning after allocation, due to the higher memory density).


> > > > 4) Less of a performance, and more of a usability issue: The constant
> > > > block size strikes me as problematic. Most users of an allocator can
> > > > sometimes be used with a small amount of data, and sometimes with a
> > > > large amount.
> > > > 
> > > 
> > > I doubt this is worth the effort, really. The constant block size makes
> > > various places much simpler (both to code and reason about), so this 
> > > should
> > > not make a huge difference in performance. And IMHO the block size is 
> > > mostly
> > > an implementation detail, so I don't see that as a usability issue.
> > 
> > Hm? It's something the user has to specify, so I it's not really an
> > implementation detail. It needs to be specified without sufficient
> > information, as well, since externally one doesn't know how much memory the
> > block header and chunk headers + rounding up will use, so computing a good
> > block size isn't easy. I've wondered whether it should just be a count...
> > 
> 
> I think this is mixing two problems - how to specify the block size, and
> whether the block size is constant (as in slab) or grows over time (as in
> allocset).

That was in response to the "implementation detail" bit solely.


> But growing the block size seems problematic for long-lived contexts with
> workloads that change a lot over time - imagine e.g. decoding many small
> transactions, with one huge transaction mixed in. The one huge transaction
> will grow the block size, and we'll keep using it forever. But in that case
> we might have just as well allocate the large blocks from the start, I
> guess.

I was thinking of capping the growth fairly low. I don't think after a 16x
growth or so you're likely to still see allocation performance gains with
slab. And I don't think that'd be too bad for decoding - we'd start with a
small initial block size, and in many workloads that will be enough, and just
workloads where that doesn't suffice will adapt performance wise.  And: Medium
term I wouldn't expect reorderbuffer.c to stay the only slab.c user...


> > Why do you not think it's relevant for performance? Either one causes too 
> > much
> > memory usage by using a too large block size, wasting memory, or one ends up
> > loosing perf through frequent allocations?
>
> True. I simply would not expect this to make a huge difference - I may be
> wrong, and I'm sure there are workloads where it matters. But I still think
> it's easier to just use larger blocks than to make the slab code more
> complex.

IDK. I'm looking at using slab as part of a radix tree implementation right
now. Which I'd hope to be used in various different situations. So it's hard
to choose the right block size - and it does seem to matter for performance.

Greetings,

Andres Freund




Re: slab allocator performance issues

2021-07-17 Thread Andres Freund
Hi,

On 2021-07-17 16:10:19 -0700, Andres Freund wrote:
> Instead of populating a linked list with all chunks upon creation of a block -
> which requires touching a fair bit of memory - keep a per-block pointer (or an
> offset) into "unused" area of the block. When allocating from the block and
> theres still "unused" memory left, use that, instead of bothering with the
> freelist.
> 
> I tried that, and it nearly got slab up to the allocation/freeing performance
> of aset.c (while winning after allocation, due to the higher memory density).

Combining that with limiting the number of freelists, and some
microoptimizations, allocation performance is now on par.

Freeing still seems to be a tad slower, mostly because SlabFree()
practically is immediately stalled on fetching the block, whereas
AllocSetFree() can happily speculate ahead and do work like computing
the freelist index. And then aset only needs to access memory inside the
context - which is much more likely to be in cache than a freelist
inside a block (there are many more).

But that's ok, I think. It's close and it's only a small share of the
overall runtime of my workload...

Greetings,

Andres Freund




Failure with 004_logrotate in prairiedog

2021-07-17 Thread Michael Paquier
Hi all,

prairiedog has failed in a way that seems a bit obscure to me:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2021-07-18%2000%3A23%3A29

Here are the details of the failure:
server signaled to rotate log file
could not read
"/Users/buildfarm/bf-data/HEAD/pgsql.build/src/bin/pg_ctl/tmp_check/t_004_logrotate_primary_data/pgdata/current_logfiles":
No such file or directory at t/004_logrotate.pl line 78
### Stopping node "primary" using mode immediate

update_metainfo_datafile() creates a temporary file renamed to
current_logfiles with rename().  It should be atomic, though this
error points out that this is not the case?  The previous steps of
this test ensure that current_logfiles should exist.

We could use some eval blocks in this area, but a non-atomic rename()
would cause problems in more areas.  Thoughts?
--
Michael


signature.asc
Description: PGP signature


Re: Added documentation for cascade and restrict option of drop statistics

2021-07-17 Thread Michael Paquier
On Fri, Jul 16, 2021 at 08:12:56PM +0530, vignesh C wrote:
> Cascade and restrict options are supported for drop statistics syntax:
> drop statistics stat1 cascade;
> drop statistics stat2 restrict;
> 
> The documentation for this was missing, attached a patch which
> includes the documentation for these options.

Indeed, good catch.  The other commands document that, so let's fix
it.
--
Michael


signature.asc
Description: PGP signature


Re: Failure with 004_logrotate in prairiedog

2021-07-17 Thread Tom Lane
Michael Paquier  writes:
> prairiedog has failed in a way that seems a bit obscure to me:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2021-07-18%2000%3A23%3A29
> ...
> We could use some eval blocks in this area, but a non-atomic rename()
> would cause problems in more areas.  Thoughts?

Awhile back we discovered that old macOS versions have non-atomic rename
[1].  I eventually shut down dromedary because that was causing failures
often enough to be annoying.  I'd not seen such a failure before on
prairiedog, but it sure seems plausible that this is one.

regards, tom lane

[1] https://www.postgresql.org/message-id/2636.1569016...@sss.pgh.pa.us