Re: Replication slot stats misgivings

2021-04-28 Thread Masahiko Sawada
On Fri, Apr 16, 2021 at 2:58 PM Amit Kapila  wrote:
>
> On Thu, Apr 15, 2021 at 4:35 PM Masahiko Sawada  wrote:
> >
> > Thank you for the update! The patch looks good to me.
> >

BTW regarding the commit f5fc2f5b23 that added total_txns and
total_bytes, we add the reorder buffer size (i.g., rb->size) to
rb->totalBytes but I think we should use the transaction size (i.g.,
txn->size) instead:

@@ -1363,6 +1365,11 @@ ReorderBufferIterTXNNext(ReorderBuffer *rb,
ReorderBufferIterTXNState *state)
dlist_delete(&change->node);
dlist_push_tail(&state->old_change, &change->node);

+   /*
+* Update the total bytes processed before releasing the current set
+* of changes and restoring the new set of changes.
+*/
+   rb->totalBytes += rb->size;
if (ReorderBufferRestoreChanges(rb, entry->txn, &entry->file,
&state->entries[off].segno))
{
@@ -2363,6 +2370,20 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,
ReorderBufferIterTXNFinish(rb, iterstate);
iterstate = NULL;

+   /*
+* Update total transaction count and total transaction bytes
+* processed. Ensure to not count the streamed transaction multiple
+* times.
+*
+* Note that the statistics computation has to be done after
+* ReorderBufferIterTXNFinish as it releases the serialized change
+* which we have already accounted in ReorderBufferIterTXNNext.
+*/
+   if (!rbtxn_is_streamed(txn))
+   rb->totalTxns++;
+
+   rb->totalBytes += rb->size;
+

IIUC rb->size could include multiple decoded transactions. So it's not
appropriate to add that value to the counter as the transaction size
passed to the logical decoding plugin. If the reorder buffer process a
transaction while having a large transaction that is being decoded, we
could end up more increasing txn_bytes than necessary.

Please review the attached patch.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index c27f710053..30d9ffa0da 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1369,7 +1369,7 @@ ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state)
 		 * Update the total bytes processed before releasing the current set
 		 * of changes and restoring the new set of changes.
 		 */
-		rb->totalBytes += rb->size;
+		rb->totalBytes += entry->txn->size;
 		if (ReorderBufferRestoreChanges(rb, entry->txn, &entry->file,
 		&state->entries[off].segno))
 		{
@@ -2382,7 +2382,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		if (!rbtxn_is_streamed(txn))
 			rb->totalTxns++;
 
-		rb->totalBytes += rb->size;
+		rb->totalBytes += txn->size;
 
 		/*
 		 * Done with current changes, send the last message for this set of


Re: [BUG] "FailedAssertion" reported when streaming in logical replication

2021-04-28 Thread Dilip Kumar
On Wed, Apr 28, 2021 at 12:25 PM tanghy.f...@fujitsu.com
 wrote:
>
> > I have modified the patch based on the above comments.
>
> Thanks for your patch.
> I tested again after applying your patch and the problem is fixed.

Thanks for confirming.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Result Cache node shows per-worker info even for workers not launched

2021-04-28 Thread David Rowley
On Wed, 28 Apr 2021 at 00:39, Amit Khandekar  wrote:
> If planned parallel workers do not get launched, the Result Cache plan
> node shows all-0 stats for each of those workers:

Thanks for reporting this and for the patch.

You're right that there is a problem here. I did in fact have code to
skip workers that didn't have any cache misses right up until v18 of
the patch [1], but I removed it because I was getting some test
instability in the partition_prune regression tests. That was
highlighted by the CFbot machines. I mentioned about that in the final
paragraph of [2]. I didn't mention the exact test there, but I was
talking about the test in partition_prune.sql.

By the time it came to b6002a796, I did end up changing the
partition_prune tests. However, I had to back that version out again
because of some problems with force_parallel_mode = regress buildfarm
animals.  By the time I re-committed Result Cache in 9eacee2e6, I had
changed the partition_prune tests so they did SET enable_resultcache =
0 before running that parallel test. I'd basically decided that the
test was never going to be stable on the buildfarm.

The problem there was that the results would vary depending on if the
parallel worker managed to do anything before the main process did all
the work. Because the tests are pretty small scale, many machines
wouldn't manage to get their worker(s) in gear and running before the
main process had finished the test. This was the reason I removed the
cache_misses == 0 test in explain.c. I'd thought that I could make
that test stable by just always outputting the cache stats line from
workers, even if they didn't assist during execution.

So, given that I removed the parallel test in partition_prune.sql, and
don't have any EXPLAIN ANALYZE output for parallel tests in
resultcache.sql, it should be safe enough to put that cache_misses ==
0 test back into explain.c

I've attached a patch to do this. The explain.c part is pretty similar
to your patch, I just took my original code and comment.

The patch also removes the SET force_parallel_mode = off in
resultcache.sql.  That's no longer needed due to adding this check in
explain.c again.  I also removed the changes I made to the
explain_parallel_append function in partition_prune.sql.  I shouldn't
have included those in 9eacee2e6.

I plan to push this in the next 24 hours or so.

David

[1] 
https://postgr.es/m/caaphdvoomttnpof-+q1daoma8vwfsfbyqb_a0iukts5nf2d...@mail.gmail.com
[2] 
https://postgr.es/m/CAApHDvrz4f+i1wu-8hyqJ=pxydroga5okgo5rwpoj47rz6q...@mail.gmail.com


resultcache_explain_fix.patch
Description: Binary data


Some doubious error messages and comments

2021-04-28 Thread Kyotaro Horiguchi
Hello.

0001: I found some typos in a error message and a comment.

multirangetypes.c: 1420
> errmsg("range_intersect_agg must be called with a multirange")));

This "range_intersect_agg" looks like a typo of "multirange_..".

operatorcmds.c:303
> * Look up a join estimator function ny name, and verify that it has the

"ny" looks like a typo of "by".



0002: The following messages are substantially same and are uselessly
split into separate messages. I'm not sure any compiler complains
about using %zu for int, explicit casting would work in that case.

be-secure-gssapi.c:351
>   (errmsg("oversize GSSAPI packet sent by the client (%zu > %zu)",
>   (size_t) input.length,
>   PQ_GSS_RECV_BUFFER_SIZE - sizeof(uint32;
be-secure-gssapi.c:570
>   (errmsg("oversize GSSAPI packet sent by the client (%zu > %d)",
>   (size_t) input.length,
>   PQ_GSS_RECV_BUFFER_SIZE)));



0003: The messages below seems to be a bit unclear.  I'm not sure they
worth doing.

conversioncmds.c: 130
 errmsg("encoding conversion function %s returned incorrect result for empty 
input",

This is not wrong at all, but another message just above is saying
that "encoding conversion function %s must return type %s".  Why
aren't we explicit here, like this?

"encoding conversion function %s must return zero for empty input"


typecmds.c:4294
>   if (requireSuper)
>   if (!superuser())
>   ereport(ERROR,
>errmsg("must be superuser to alter a 
> type")));

Where, requireSuper varies depending on the set of operations but the
description looks like describing general behavior. I'm not sure but
something like the following might be better?

+errmsg("must be superuser to perform all operations")));
+errmsg("some of the operations require superuser privilege")));

Any opinions or suggestions?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 1634d8dd87b1474f60cfd6689c1e58acc87c1cfc Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 28 Apr 2021 17:23:52 +0900
Subject: [PATCH 1/3] Fix typos

---
 src/backend/commands/operatorcmds.c | 2 +-
 src/backend/utils/adt/multirangetypes.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/commands/operatorcmds.c b/src/backend/commands/operatorcmds.c
index 809043c5d1..ba456c4dd1 100644
--- a/src/backend/commands/operatorcmds.c
+++ b/src/backend/commands/operatorcmds.c
@@ -300,7 +300,7 @@ ValidateRestrictionEstimator(List *restrictionName)
 }
 
 /*
- * Look up a join estimator function ny name, and verify that it has the
+ * Look up a join estimator function by name, and verify that it has the
  * correct signature and we have the permissions to attach it to an
  * operator.
  */
diff --git a/src/backend/utils/adt/multirangetypes.c b/src/backend/utils/adt/multirangetypes.c
index 0b81649779..2583ddeedf 100644
--- a/src/backend/utils/adt/multirangetypes.c
+++ b/src/backend/utils/adt/multirangetypes.c
@@ -1417,7 +1417,7 @@ multirange_intersect_agg_transfn(PG_FUNCTION_ARGS)
 	if (!type_is_multirange(mltrngtypoid))
 		ereport(ERROR,
 (errcode(ERRCODE_DATATYPE_MISMATCH),
- errmsg("range_intersect_agg must be called with a multirange")));
+ errmsg("multirange_intersect_agg must be called with a multirange")));
 
 	typcache = multirange_get_typcache(fcinfo, mltrngtypoid);
 
-- 
2.27.0

>From 14a6b4b81d3b4d35bf690e1c240975bd888531f3 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 28 Apr 2021 17:24:31 +0900
Subject: [PATCH 2/3] Consolidate substantially same messages

---
 src/backend/libpq/be-secure-gssapi.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/libpq/be-secure-gssapi.c b/src/backend/libpq/be-secure-gssapi.c
index 316ca65db5..a335f78d47 100644
--- a/src/backend/libpq/be-secure-gssapi.c
+++ b/src/backend/libpq/be-secure-gssapi.c
@@ -567,9 +567,9 @@ secure_open_gssapi(Port *port)
 		if (input.length > PQ_GSS_RECV_BUFFER_SIZE)
 		{
 			ereport(COMMERROR,
-	(errmsg("oversize GSSAPI packet sent by the client (%zu > %d)",
+	(errmsg("oversize GSSAPI packet sent by the client (%zu > %zu)",
 			(size_t) input.length,
-			PQ_GSS_RECV_BUFFER_SIZE)));
+			(size_t) PQ_GSS_RECV_BUFFER_SIZE)));
 			return -1;
 		}
 
-- 
2.27.0

>From 0db041334e0fc4df46f1d98e75e984b04777befe Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 28 Apr 2021 17:25:36 +0900
Subject: [PATCH 3/3] Clarify merror messages

---
 src/backend/commands/conversioncmds.c | 2 +-
 src/backend/commands/typecmds.c   | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/commands/conversioncmds.c b/src/backend/commands/conversioncmds.c
index 5fed97a2f9..307afb909d 100644
--- a/src/backend/commands/conversioncmds.c
+++ b/src/backend/commands/conversioncmds.c
@@ -127,7 +127,7 @@ Create

Re: Better sanity checking for message length words

2021-04-28 Thread Aleksander Alekseev
Hi Tom,

> scratches head ... ]  The patch still applies perfectly cleanly
> for me, using either "patch" or "git apply".

Sorry, my bad. It was about lines separating on different platforms. The
patch is fine and passes installcheck-world on MacOS.

-- 
Best regards,
Aleksander Alekseev


"Multiple table synchronizations are processed serially" still happens

2021-04-28 Thread Guillaume Lelarge
Hi,

One of my customers has an issue with logical replication. As $SUBJECT
says, multiple table synchronization happens serially. To be honest, it
doesn't do this every time. It happens when the tables are big enough.

This issue was already described on this thread (from 2017):
https://www.postgresql.org/message-id/flat/CAD21AoC2KJdavS7MFffmSsRc1dn3Vg_0xmuc=upbrz-_mux...@mail.gmail.com

This thread was closed by a commit (
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=6c2003f8a1bbc7c192a2e83ec51581c018aa162f)
which apparently fixed the issue for the OP.

Attached is a small test case where it still happens for me on 12.6, 11.11,
and 10.16. I can't make it happen on 13.2. I don't know why. It may imply
bigger tables for 13.2, but why? I simply don't know.

Anyway, the issue at the end of the test case is that synchronizations
sometimes happen serially. You can see on the process list that one
walsender process is waiting in "idle in transaction" state:

guillau+  4868222227  0 10:44 ?00:00:00
/opt/postgresql/12/bin/postgres
guillau+  486824  486822  0 10:44 ?00:00:01 postgres: testcase:
checkpointer
guillau+  486825  486822  0 10:44 ?00:00:04 postgres: testcase:
background writer
guillau+  486826  486822  1 10:44 ?00:00:06 postgres: testcase:
walwriter
guillau+  486827  486822  0 10:44 ?00:00:00 postgres: testcase:
autovacuum launcher
guillau+  486828  486822  0 10:44 ?00:00:00 postgres: testcase:
stats collector
guillau+  486829  486822  0 10:44 ?00:00:00 postgres: testcase:
logical replication launcher
guillau+  489822  486822  0 10:55 ?00:00:00 postgres: testcase:
logical replication worker for subscription 16436
guillau+  489824  486822 10 10:55 ?00:00:01 postgres: testcase:
walsender repuser ::1(38770) idle
guillau+  489825  486822 22 10:55 ?00:00:02 postgres: testcase:
logical replication worker for subscription 16436 sync 16416
guillau+  489826  486822  8 10:55 ?00:00:00 postgres: testcase:
walsender repuser ::1(38772) COPY
guillau+  489827  486822  0 10:55 ?00:00:00 postgres: testcase:
logical replication worker for subscription 16436 sync 16427
guillau+  489828  486822  0 10:55 ?00:00:00 postgres: testcase:
walsender repuser ::1(38774) idle in transaction waiting

And the log says (from the start of the subscription):

2021-04-28 10:55:32.337 CEST [489822] LOG:  logical replication apply
worker for subscription "sub" has started
2021-04-28 10:55:32.342 CEST [489824] LOG:  duration: 0.426 ms  statement:
SELECT pg_catalog.set_config('search_path', '', false);
2021-04-28 10:55:32.342 CEST [489824] LOG:  received replication command:
IDENTIFY_SYSTEM
2021-04-28 10:55:32.342 CEST [489824] LOG:  received replication command:
START_REPLICATION SLOT "sub" LOGICAL 0/0   (proto_version '1',
publication_names '"pub"')
2021-04-28 10:55:32.342 CEST [489824] LOG:  starting logical decoding for
slot "sub"
2021-04-28 10:55:32.342 CEST [489824] DETAIL:  Streaming transactions
committing after 1/FF5D8130, reading WAL from  1/FF5D80F8.
2021-04-28 10:55:32.342 CEST [489824] LOG:  logical decoding found
consistent point at 1/FF5D80F8
2021-04-28 10:55:32.342 CEST [489824] DETAIL:  There are no running
transactions.
2021-04-28 10:55:32.345 CEST [489825] LOG:  logical replication table
synchronization worker for subscription "sub", table "foo" has started
2021-04-28 10:55:32.348 CEST [489826] LOG:  duration: 0.315 ms  statement:
SELECT pg_catalog.set_config('search_path', '', false);
2021-04-28 10:55:32.349 CEST [489826] LOG:  duration: 0.041 ms  statement:
BEGIN READ ONLY ISOLATION LEVEL REPEATABLE READ
2021-04-28 10:55:32.349 CEST [489826] LOG:  received replication command:
CREATE_REPLICATION_SLOT "sub_16436_sync_16416" TEMPORARY LOGICAL pgoutput
USE_SNAPSHOT
2021-04-28 10:55:32.355 CEST [489827] LOG:  logical replication table
synchronization worker for subscription "sub", table "bar" has started
2021-04-28 10:55:32.359 CEST [489828] LOG:  duration: 0.431 ms  statement:
SELECT pg_catalog.set_config('search_path', '', false);
2021-04-28 10:55:32.359 CEST [489828] LOG:  duration: 0.048 ms  statement:
BEGIN READ ONLY ISOLATION LEVEL REPEATABLE READ
2021-04-28 10:55:32.360 CEST [489828] LOG:  received replication command:
CREATE_REPLICATION_SLOT "sub_16436_sync_16427" TEMPORARY LOGICAL pgoutput
USE_SNAPSHOT
2021-04-28 10:55:32.407 CEST [489826] LOG:  logical decoding found
consistent point at 1/FF602880
2021-04-28 10:55:32.407 CEST [489826] DETAIL:  There are no running
transactions.
2021-04-28 10:55:32.409 CEST [489826] LOG:  duration: 1.262 ms  statement:
SELECT c.oid, c.relreplident  FROM pg_catalog.pg_class c  INNER JOIN
pg_catalog.pg_namespace nON (c.relnamespace = n.oid) WHERE
n.nspname = 's01'   AND c.relname = 'foo'   AND c.relkind = 'r'
2021-04-28 10:55:32.410 CEST [489826] LOG:  duration: 1.347 ms  statement:
SELECT a.attname,   a.atttypid,   a.atttypmod,   a.a

Re: "Multiple table synchronizations are processed serially" still happens

2021-04-28 Thread Guillaume Lelarge
Le mer. 28 avr. 2021 à 11:12, Guillaume Lelarge  a
écrit :

> Hi,
>
> One of my customers has an issue with logical replication. As $SUBJECT
> says, multiple table synchronization happens serially. To be honest, it
> doesn't do this every time. It happens when the tables are big enough.
>
> This issue was already described on this thread (from 2017):
> https://www.postgresql.org/message-id/flat/CAD21AoC2KJdavS7MFffmSsRc1dn3Vg_0xmuc=upbrz-_mux...@mail.gmail.com
>
> This thread was closed by a commit (
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=6c2003f8a1bbc7c192a2e83ec51581c018aa162f)
> which apparently fixed the issue for the OP.
>
> Attached is a small test case where it still happens for me on 12.6,
> 11.11, and 10.16. I can't make it happen on 13.2. I don't know why. It may
> imply bigger tables for 13.2, but why? I simply don't know.
>
>
Actually, it also happens on 13.2.


-- 
Guillaume.


Re: Result Cache node shows per-worker info even for workers not launched

2021-04-28 Thread Bharath Rupireddy
On Wed, Apr 28, 2021 at 1:54 PM David Rowley  wrote:
> I plan to push this in the next 24 hours or so.

I happen to see explain_resultcache in resultcache.sql, seems like two
of the tests still have numbers for cache hits and misses - Hits: 980
Misses: 20, won't these make tests unstable? Will these numbers be
same across machines? Or is it that no buildfarm had caught these? The
comment below says that, the hits and misses are not same across
machines:
-- Ensure we get some evictions.  We're unable to validate the hits and misses
-- here as the number of entries that fit in the cache at once will vary
-- between different machines.

Should we remove the hide_hitmiss parameter in explain_resultcache and
always print N for non-zero and Zero for 0 hits and misses? This
clearly shows that we have 0 or non-zero hits or misses.

Am I missing something?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: Replication slot stats misgivings

2021-04-28 Thread Amit Kapila
On Wed, Apr 28, 2021 at 12:49 PM Masahiko Sawada  wrote:
>
>
> BTW regarding the commit f5fc2f5b23 that added total_txns and
> total_bytes, we add the reorder buffer size (i.g., rb->size) to
> rb->totalBytes but I think we should use the transaction size (i.g.,
> txn->size) instead:
>

You are right about the problem but I think your proposed fix also
won't work because txn->size always has current transaction size which
will be top-transaction in the case when a transaction has multiple
subtransactions. It won't include the subtxn->size. For example, you
can try to decode with below kind of transaction:
Begin;
insert into t1 values(1);
savepoint s1;
insert into t1 values(2);
savepoint s2;
insert into t1 values(3);
commit;

I think we can fix it by keeping track of total_size in toptxn as we
are doing for the streaming case in ReorderBufferChangeMemoryUpdate.
We can probably do it for non-streaming cases as well.

-- 
With Regards,
Amit Kapila.




Re: [PATCH] We install pg_regress and isolationtester but not pg_isolation_regress

2021-04-28 Thread Aleksander Alekseev
Hi hackers,

> Any opinions on backpatching?

> Other than src/test/modules/brin, the ISOLATION users don't look
> much like real extensions (rather than test scaffolding), either.

> Out of core, the only thing I can see with isolation tests is rum, but
> it uses a workaround to have an access to the necessary binaries.

I just wanted to let you know that TimescaleDB uses
pg_isolation_regress and occasionally there are reports from some
suffering/puzzled users/developers, e.g. [1]. Not 100% sure if it
makes investing the time into backpatching worth it. However if
someone could do it, it would be nice.

[1]: https://github.com/timescale/timescaledb/issues/1655

-- 
Best regards,
Aleksander Alekseev




Re: Built-in connection pooler

2021-04-28 Thread Antonin Houska
Konstantin Knizhnik  wrote:

> People asked me to resubmit built-in connection pooler patch to commitfest.
> Rebased version of connection pooler is attached.

I've reviewd the patch but haven't read the entire thread thoroughly. I hope
that I don't duplicate many comments posted earlier by others.

(Please note that the patch does not apply to the current master, I had to
reset the head of my repository to the appropriate commit.)


Documentation / user interface
--

* session_pool_size (config.sgml)

I wonder if

"The default value is 10, so up to 10 backends will serve each database,"

should rather be

"The default value is 10, so up to 10 backends will serve each 
database/user combination."

However, when I read the code, I think that each proxy counts the size of the
pool separately, so the total number of backends used for particular
database/user combination seems to be

session_pool_size * connection_proxies

Since the feature uses shared memory statistics anyway, shouldn't it only
count the total number of backends per database/user? It would need some
locking, but the actual pools (hash tables) could still be local to the proxy
processes.

* connection_proxies

(I've noticed that Ryan Lambert questioned this variable upthread.)

I think this variable makes the configuration less straightforward from the
user perspective. Cannot the server launch additional proxies dynamically, as
needed, e.g. based on the shared memory statistics that the patch introduces?
I see that postmaster would have to send the sockets in a different way. How
about adding a "proxy launcher" process that would take care of the scheduling
and launching new proxies?

* multitenant_proxy

I thought the purpose of this setting is to reduce the number of backends
needed, but could not find evidence in the code. In particular,
client_attach() always retrieves the backend from the appropriate pool, and
backend_reschedule() does so as well. Thus the role of both client and backend
should always match. What piece of information do I miss?

* typo (2 occurrences in config.sgml)

  "stanalone" -> "standalone"


Design / coding
---

* proxy.c:backend_start() does not change the value of the "host" parameter to
  the socket directory, so I assume the proxy connects to the backend via TCP
  protocol. I think the unix socket should be preferred for this connection if
  the platform has it, however:

* is libpq necessary for the proxy to connect to backend at all? Maybe
  postgres.c:ReadCommand() can be adjusted so that the backend can communicate
  with the proxy just via the plain socket.

  I don't like the idea of server components communicating via libpq (do we
  need anything else of the libpq connection than the socket?) as such, but
  especially these includes in proxy.c look weird:

  #include "../interfaces/libpq/libpq-fe.h"
  #include "../interfaces/libpq/libpq-int.h"

* How does the proxy recognize connections to the walsender? I haven't tested
  that, but it's obvious that these connections should not be proxied.

* ConnectionProxyState is in shared memory, so access to its fields should be
  synchronized.

* StartConnectionProxies() is only called from PostmasterMain(), so I'm not
  sure the proxies get restarted after crash. Perhaps PostmasterStateMachine()
  needs to call it too after calling StartupDataBase().

* Why do you need the Channel.magic integer field? Wouldn't a boolean field
  "active" be sufficient?

  ** In proxy_loop(), I've noticed tests (chan->magic == ACTIVE_CHANNEL_MAGIC)
 tests inside the branch

 else if (chan->magic == ACTIVE_CHANNEL_MAGIC)

 Since neither channel_write() nor channel_read() seem to change the
 value, I think those tests are not necessary.

* Comment lines are often too long.

* pgindent should be applied to the patch at some point.


I can spend more time reviewing the patch during the next CF.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com




Re: Some oversights in query_id calculation

2021-04-28 Thread Aleksander Alekseev
Hi Julien,

> I'm attaching a patch that fixes those, with regression tests to reproduce 
> each
> problem.

I believe something could be not quite right with the patch. Here is what I did:

$ git apply ...
# revert the changes in the code but keep the new tests
$ git checkout src/backend/utils/misc/queryjumble.c
$ ./full-build.sh && single-install.sh && make installcheck-world

... where named .sh scripts are something I use to quickly check a patch [1].

I was expecting that several tests will fail but they didn't. Maybe I
missed something?

[1]: https://github.com/afiskon/pgscripts

-- 
Best regards,
Aleksander Alekseev




Re: Parallel INSERT SELECT take 2

2021-04-28 Thread Greg Nancarrow
On Wed, Apr 28, 2021 at 12:44 PM houzj.f...@fujitsu.com
 wrote:
>
> 0003:
> 1) Temporarily, add the check of built-in function by adding a member for 
> proparallel in FmgrBuiltin.
> To avoid enlarging FmgrBuiltin struct , change the existing bool members 
> strict and and retset into
> one member of type char, and represent the original values with some bit 
> flags.
>

I was thinking that it would be better to replace the two bool members
with one "unsigned char" member for the bitflags for strict and
retset, and another "char" member for parallel.
The struct would still remain the same size as it originally was, and
you wouldn't need to convert between bit settings and char
('u'/'r'/'s') each time a built-in function was checked for
parallel-safety in fmgr_info().

> Note: this will lock down the parallel property of built-in function, but, I 
> think the parallel safety of built-in function
> is related to the C code, user should not change the property of it unless 
> they change its code. So, I think it might be
> better to disallow changing parallel safety for built-in functions,  Thoughts 
> ?
>

I'd vote for disallowing it (unless someone can justify why it
currently is allowed).

> I have not added the parallel safety check in ALTER/CREATE table PARALLEL DML 
> SAFE command.
> I think it seems better to add it after some more discussion.
>

I'd vote for not adding such a check (as this is a declaration).


Some additional comments:

1) In patch 0002 comment, it says:
This property is recorded in pg_class's relparallel column as 'u', 'r', or 's',
just like pg_proc's proparallel.  The default is UNSAFE.
It should say "relparalleldml" column.

2) With the patches applied, I seem to be getting a couple of errors
when running "make installcheck-world" with
force_parallel_mode=regress in effect.
Can you please try it?


Regards,
Greg Nancarrow
Fujitsu Australia




Re: Some oversights in query_id calculation

2021-04-28 Thread Julien Rouhaud
Hi Aleksander,

On Wed, Apr 28, 2021 at 01:19:36PM +0300, Aleksander Alekseev wrote:
> Hi Julien,
> 
> > I'm attaching a patch that fixes those, with regression tests to reproduce 
> > each
> > problem.
> 
> I believe something could be not quite right with the patch. Here is what I 
> did:
> 
> $ git apply ...
> # revert the changes in the code but keep the new tests
> $ git checkout src/backend/utils/misc/queryjumble.c
> $ ./full-build.sh && single-install.sh && make installcheck-world
> 
> ... where named .sh scripts are something I use to quickly check a patch [1].
> 
> I was expecting that several tests will fail but they didn't. Maybe I
> missed something?

I think it's because installcheck-* don't run pg_stat_statements' tests, see
its Makefile:

> # Disabled because these tests require 
> "shared_preload_libraries=pg_stat_statements",
> # which typical installcheck users do not have (e.g. buildfarm clients).
> NO_INSTALLCHECK = 1

You should see failures doing a check-world or simply a make -C
contrib/pg_stat_statements check




Re: Result Cache node shows per-worker info even for workers not launched

2021-04-28 Thread Amit Khandekar
On Wed, 28 Apr 2021 at 15:08, Bharath Rupireddy
 wrote:
>
> On Wed, Apr 28, 2021 at 1:54 PM David Rowley  wrote:
> > I plan to push this in the next 24 hours or so.
>
> I happen to see explain_resultcache in resultcache.sql, seems like two
> of the tests still have numbers for cache hits and misses - Hits: 980
> Misses: 20, won't these make tests unstable? Will these numbers be
> same across machines? Or is it that no buildfarm had caught these? The
> comment below says that, the hits and misses are not same across
> machines:
> -- Ensure we get some evictions.  We're unable to validate the hits and misses
> -- here as the number of entries that fit in the cache at once will vary
> -- between different machines.
>
> Should we remove the hide_hitmiss parameter in explain_resultcache and
> always print N for non-zero and Zero for 0 hits and misses? This
> clearly shows that we have 0 or non-zero hits or misses.
>
> Am I missing something?

I believe, the assumption here is that with no workers involved, it is
guaranteed to have the exact same cache misses and hits anywhere.




Re: Result Cache node shows per-worker info even for workers not launched

2021-04-28 Thread David Rowley
On Wed, 28 Apr 2021 at 21:38, Bharath Rupireddy
 wrote:
>
> On Wed, Apr 28, 2021 at 1:54 PM David Rowley  wrote:
> > I plan to push this in the next 24 hours or so.
>
> I happen to see explain_resultcache in resultcache.sql, seems like two
> of the tests still have numbers for cache hits and misses - Hits: 980
> Misses: 20, won't these make tests unstable? Will these numbers be
> same across machines? Or is it that no buildfarm had caught these? The
> comment below says that, the hits and misses are not same across
> machines:
> -- Ensure we get some evictions.  We're unable to validate the hits and misses
> -- here as the number of entries that fit in the cache at once will vary
> -- between different machines.

The only reason it would be unstable is if there are cache evictions.
Evictions will only happen if the cache fills up and we need to make
way for new entries.  A 32-bit machine, for example, will use slightly
less memory for caching items, so the number of evictions is going to
be a bit less on those machine.  Having an unstable number of
evictions will cause the hits and misses to be unstable too.
Otherwise, the number of misses is predictable, it'll be the number of
distinct sets of parameters that we lookup in the cache.  Any repeats
will be a hit.  So hits plus misses should just add up to the number
of times that a normal parameterized nested loop would execute the
inner side, and that's predictable too. It would only change if you
change the query or the data in the table.

> Should we remove the hide_hitmiss parameter in explain_resultcache and
> always print N for non-zero and Zero for 0 hits and misses? This
> clearly shows that we have 0 or non-zero hits or misses.

I added that because if there are no evictions then the hits and
misses should be perfectly stable, providing the test is small enough
not to exceed work_mem and fill the cache.   If someone was to run the
tests with a small work_mem, then there would be no shortage of other
tests that would fail due to plan changes.  These tests were designed
to be small enough so there's no danger of getting close to work_mem
and filling the cache.

However, I did add 1 test that sets work_mem down to 64kB to ensure
the eviction code does get some exercise. You'll notice that I pass
"true" to explain_resultcache() to hide the hits and misses there.  We
can't test the exact number of hits/misses/evictions there, but we can
at least tell apart the zero and non-zero by how I coded
explain_resultcache() to replace with Zero or N depending on if the
number was zero or above zero.

David




Re: Result Cache node shows per-worker info even for workers not launched

2021-04-28 Thread Amit Khandekar
On Wed, 28 Apr 2021 at 16:14, David Rowley  wrote:
> However, I did add 1 test that sets work_mem down to 64kB to ensure
> the eviction code does get some exercise. You'll notice that I pass
> "true" to explain_resultcache() to hide the hits and misses there.  We
> can't test the exact number of hits/misses/evictions there, but we can
> at least tell apart the zero and non-zero by how I coded
> explain_resultcache() to replace with Zero or N depending on if the
> number was zero or above zero.

Thanks for the explanation. I did realize after replying to Bharat
upthread, that I was wrong in assuming that the cache misses and cache
hits are always stable for non-parallel scans.




Re: Result Cache node shows per-worker info even for workers not launched

2021-04-28 Thread Amit Khandekar
On Wed, 28 Apr 2021 at 13:54, David Rowley  wrote:

> So, given that I removed the parallel test in partition_prune.sql, and
> don't have any EXPLAIN ANALYZE output for parallel tests in
> resultcache.sql, it should be safe enough to put that cache_misses ==
> 0 test back into explain.c
>
> I've attached a patch to do this. The explain.c part is pretty similar
> to your patch, I just took my original code and comment.

Sounds good. And thanks for the cleanup patch, and the brief history.
Patch looks ok to me.




Re: Replication slot stats misgivings

2021-04-28 Thread Masahiko Sawada
On Wed, Apr 28, 2021 at 6:39 PM Amit Kapila  wrote:
>
> On Wed, Apr 28, 2021 at 12:49 PM Masahiko Sawada  
> wrote:
> >
> >
> > BTW regarding the commit f5fc2f5b23 that added total_txns and
> > total_bytes, we add the reorder buffer size (i.g., rb->size) to
> > rb->totalBytes but I think we should use the transaction size (i.g.,
> > txn->size) instead:
> >
>
> You are right about the problem but I think your proposed fix also
> won't work because txn->size always has current transaction size which
> will be top-transaction in the case when a transaction has multiple
> subtransactions. It won't include the subtxn->size.

Right. I missed the point that ReorderBufferProcessTXN() processes
also subtransactions.

> I think we can fix it by keeping track of total_size in toptxn as we
> are doing for the streaming case in ReorderBufferChangeMemoryUpdate.
> We can probably do it for non-streaming cases as well.

Agreed.

I've updated the patch. What do you think?

Regards,


--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


use_total_size_v2.patch
Description: Binary data


Re: Replication slot stats misgivings

2021-04-28 Thread Amit Kapila
On Wed, Apr 28, 2021 at 4:51 PM Masahiko Sawada  wrote:
>
> On Wed, Apr 28, 2021 at 6:39 PM Amit Kapila  wrote:
> >
>
> > I think we can fix it by keeping track of total_size in toptxn as we
> > are doing for the streaming case in ReorderBufferChangeMemoryUpdate.
> > We can probably do it for non-streaming cases as well.
>
> Agreed.
>
> I've updated the patch. What do you think?
>

@@ -1369,7 +1369,7 @@ ReorderBufferIterTXNNext(ReorderBuffer *rb,
ReorderBufferIterTXNState *state)
  * Update the total bytes processed before releasing the current set
  * of changes and restoring the new set of changes.
  */
- rb->totalBytes += rb->size;
+ rb->totalBytes += entry->txn->total_size;
  if (ReorderBufferRestoreChanges(rb, entry->txn, &entry->file,
  &state->entries[off].segno))

I have not tested this but won't in the above change you need to check
txn->toptxn for subtxns?

-- 
With Regards,
Amit Kapila.




Re: Error in libpq docs for target_session_attrs, prefer-standby mode

2021-04-28 Thread Laurenz Albe
On Wed, 2021-04-28 at 12:55 +1000, Greg Nancarrow wrote:
> I spotted an error in the development version documentation for
> libpq's connection parameter "target_session_attrs" (34.1.2 Parameter
> Key Words).
> In the description for the "prefer-standby" mode, it says "... but if
> none of the listed hosts is a standby server, try again in all mode".
> There is no such "all" mode. It should instead say "any" mode.
> Patch is attached.

You are right, and the patch is good.

Yours,
Laurenz Albe





RE: Forget close an open relation in ReorderBufferProcessTXN()

2021-04-28 Thread osumi.takami...@fujitsu.com
On Monday, April 26, 2021 2:05 PM Amit Kapila  wrote:
> On Fri, Apr 23, 2021 at 8:03 PM osumi.takami...@fujitsu.com
>  wrote:
> >
> > On Saturday, April 17, 2021 4:13 PM Amit Kapila
>  wrote:
> > > > I also don't find a test for this.  It is introduced in
> > > > 5dfd1e5a6696, wrote by Simon Riggs, Marco Nenciarini and Peter
> > > > Eisentraut.  Maybe they can explain when we can enter this condition?
> > >
> > > My guess is that this has been copied from the code a few lines
> > > above to handle insert/update/delete where it is required to handle
> > > some DDL ops like Alter Table but I think we don't need it here (for
> > > Truncate op). If that understanding turns out to be true then we
> > > should either have an Assert for this or an elog message.
> > In this thread, we are discussing 3 topics below...
> >
> > (1) necessity of the check for REORDER_BUFFER_CHANGE_TRUNCATE
> in
> > ReorderBufferProcessTXN()
> > (2) discussion of whether we disallow decoding of operations on user
> > catalog tables or not
> > (3) memory leak of maybe_send_schema() (patch already provided)
> >
> > Let's address those one by one.
> > In terms of (1), which was close to the motivation of this thread,
> >
> 
> I think (1) and (2) are related because if we need (2) then the check removed
> by (1) needs to be replaced with another check. So, I am not sure how to
> make this decision.
Yeah, you are right.


On Monday, April 19, 2021 9:23 PM Amit Kapila  wrote:
> On Sat, Apr 17, 2021 at 12:01 PM Amit Kapila 
> wrote:
> > On Fri, Apr 16, 2021 at 11:24 PM Andres Freund 
> wrote:
> > >
> > > > I think it is also important to *not* acquire any lock on relation
> > > > otherwise it can lead to some sort of deadlock or infinite wait in
> > > > the decoding process. Consider a case for operations like Truncate
> > > > (or if the user has acquired an exclusive lock on the relation in
> > > > some other way say via Lock command) which acquires an exclusive
> > > > lock on relation, it won't get replicated in synchronous mode
> > > > (when synchronous_standby_name is configured). The truncate
> > > > operation will wait for the transaction to be replicated to the
> > > > subscriber and the decoding process will wait for the Truncate
> operation to finish.
> > >
> > > However, this cannot be really relied upon for catalog tables. An
> > > output function might acquire locks or such. But for those we do not
> > > need to decode contents...
> > >
> >
> > I see that if we define a user_catalog_table (create table t1_cat(c1
> > int) WITH(user_catalog_table = true);), we are able to decode
> > operations like (insert, truncate) on such a table. What do you mean
> > by "But for those we do not need to decode contents"?
> >
> 
> I think we are allowed to decode the operations on user catalog tables
> because we are using RelationIsLogicallyLogged() instead of
> RelationIsAccessibleInLogicalDecoding() in ReorderBufferProcessTXN().
> Based on this discussion, I think we should not be allowing decoding of
> operations on user catalog tables, so we should use
> RelationIsAccessibleInLogicalDecoding to skip such ops in
> ReorderBufferProcessTXN(). Am, I missing something?
> 
> Can you please clarify?
I don't understand that point, either.

I read the context where the user_catalog_table was introduced - [1].
There, I couldn't find any discussion if we should skip decode operations
on that kind of tables or not. Accordingly, we just did not conclude it, I 
suppose.

What surprised me a bit is to decode operations of system catalog table are 
considered like [2]
somehow at the time. I cannot find any concrete description of such use cases 
in the thread, though.

Anyway, I felt disallowing decoding of operations on user catalog tables
doesn't spoil the feature's purpose. So, I'm OK to do so. What do you think ?


[1] - 
https://www.postgresql.org/message-id/flat/20130914204913.GA4071%40awork2.anarazel.de

Note that in this discussion, user_catalog_table was renamed from
treat_as_catalog_table in the middle of the thread. Searching it might help you 
to shorten your time to have a look at it.

[2] - 
https://www.postgresql.org/message-id/CA%2BTgmobhDCHuckL_86wRDWJ31Gw3Y1HrQ4yUKEn7U1_hTbeVqQ%40mail.gmail.com


Best Regards,
Takamichi Osumi



Re: Some oversights in query_id calculation

2021-04-28 Thread Aleksander Alekseev
Hi Julien,

> You should see failures doing a check-world or simply a make -C
> contrib/pg_stat_statements check

Sorry, my bad. I was running make check-world, but did it with -j4 flag
which was a mistake.

The patch is OK.


On Wed, Apr 28, 2021 at 1:27 PM Julien Rouhaud  wrote:

> Hi Aleksander,
>
> On Wed, Apr 28, 2021 at 01:19:36PM +0300, Aleksander Alekseev wrote:
> > Hi Julien,
> >
> > > I'm attaching a patch that fixes those, with regression tests to
> reproduce each
> > > problem.
> >
> > I believe something could be not quite right with the patch. Here is
> what I did:
> >
> > $ git apply ...
> > # revert the changes in the code but keep the new tests
> > $ git checkout src/backend/utils/misc/queryjumble.c
> > $ ./full-build.sh && single-install.sh && make installcheck-world
> >
> > ... where named .sh scripts are something I use to quickly check a patch
> [1].
> >
> > I was expecting that several tests will fail but they didn't. Maybe I
> > missed something?
>
> I think it's because installcheck-* don't run pg_stat_statements' tests,
> see
> its Makefile:
>
> > # Disabled because these tests require
> "shared_preload_libraries=pg_stat_statements",
> > # which typical installcheck users do not have (e.g. buildfarm clients).
> > NO_INSTALLCHECK = 1
>
> You should see failures doing a check-world or simply a make -C
> contrib/pg_stat_statements check
>


-- 
Best regards,
Aleksander Alekseev


Re: Use simplehash.h instead of dynahash in SMgr

2021-04-28 Thread Yura Sokolov

David Rowley писал 2021-04-26 09:43:

I tried that and it got a median result of 113.795 seconds over 14
runs with this recovery benchmark test.

LOG:  size: 4096, members: 2032, filled: 0.496094, total chain: 1014,
max chain: 6, avg chain: 0.499016, total_collisions: 428,
max_collisions: 3, avg_collisions: 0.210630

I also tried the following hash function just to see how much
performance might be left from speeding it up:

static inline uint32
relfilenodebackend_hash(RelFileNodeBackend *rnode)
{
uint32 h;

h = pg_rotate_right32((uint32) rnode->node.relNode, 16) ^ ((uint32)
rnode->node.dbNode);
return murmurhash32(h);
}

I got a median of 112.685 seconds over 14 runs with:

LOG:  size: 4096, members: 2032, filled: 0.496094, total chain: 1044,
max chain: 7, avg chain: 0.513780, total_collisions: 438,
max_collisions: 3, avg_collisions: 0.215551


The best result is with just:

   return (uint32)rnode->node.relNode;

ie, relNode could be taken without mixing at all.
relNode is unique inside single database, and almost unique among whole 
cluster

since it is Oid.


I'd like to see benchmark code. It quite interesting this place became
measurable at all.


Sure.

$ cat recoverybench_insert_hash.sh


David.


So, I've repeated benchmark with different number of partitons (I tried
to catch higher fillfactor) and less amount of inserted data (since I 
don't

want to stress my SSD).

partitions/ | dynahash | dynahash +  | simplehash | simplehash + |
fillfactor  |  | simple func |   | simple func  |
+--+-+--+
 3500/0.43  |   3.73s  |   3.54s |   3.58s|   3.34s  |
 3200/0.78  |   3.64s  |   3.46s |   3.47s|   3.25s  |
 1500/0.74  |   3.18s  |   2.97s |   3.03s|   2.79s  |

Fillfactor is effective fillfactor in simplehash with than number of 
partitions.

I wasn't able to measure with fillfactor close to 0.9 since looks like
simplehash tends to grow much earlier due to SH_GROW_MAX_MOVE.

Simple function is hash function that returns only rnode->node.relNode.
I've test it both with simplehash and dynahash.
For dynahash also custom match function were made.

Conclusion:
- trivial hash function gives better results for both simplehash and 
dynahash,
- simplehash improves performance for both complex and trivial hash 
function,

- simplehash + trivial function perform best.

I'd like to hear other's people comments on trivial hash function. But 
since
generation of relation's Oid are not subject of human interventions, I'd 
recommend

to stick with trivial.

Since patch is simple, harmless and gives measurable improvement,
I think it is ready for commit fest.

regards,
Yura Sokolov.
Postgres Proffesional https://www.postgrespro.com

PS. David, please send patch once again since my mail client reattached 
files in

previous messages, and commit fest robot could think I'm author.




Re: Some doubious error messages and comments

2021-04-28 Thread Justin Pryzby
On Wed, Apr 28, 2021 at 05:36:33PM +0900, Kyotaro Horiguchi wrote:
> 0001: I found some typos in a error message and a comment.
> 
> multirangetypes.c: 1420
> > errmsg("range_intersect_agg must be called with a multirange")));
> 
> This "range_intersect_agg" looks like a typo of "multirange_..".
> 
> operatorcmds.c:303
> > * Look up a join estimator function ny name, and verify that it has the
> 
> "ny" looks like a typo of "by".

"ny name" shows up a 2nd time.

I have another "comment typos" patch - maybe someone will want to push them
together.

commit 32e979c652c68ca5e3a7f308d677058e0c08547b
Author: Kyotaro Horiguchi 
Date:   Wed Apr 28 17:23:52 2021 +0900

Fix typos

ny name: 321eed5f0f7563a0cabb3d7a98132856287c1ad1
multirange: 6df7a9698bb036610c1e8c6d375e1be38cb26d5f

diff --git a/src/backend/commands/operatorcmds.c 
b/src/backend/commands/operatorcmds.c
index 809043c5d1..fbd7d8d062 100644
--- a/src/backend/commands/operatorcmds.c
+++ b/src/backend/commands/operatorcmds.c
@@ -265,7 +265,7 @@ DefineOperator(List *names, List *parameters)
 }
 
 /*
- * Look up a restriction estimator function ny name, and verify that it has
+ * Look up a restriction estimator function by name, and verify that it has
  * the correct signature and we have the permissions to attach it to an
  * operator.
  */
@@ -300,7 +300,7 @@ ValidateRestrictionEstimator(List *restrictionName)
 }
 
 /*
- * Look up a join estimator function ny name, and verify that it has the
+ * Look up a join estimator function by name, and verify that it has the
  * correct signature and we have the permissions to attach it to an
  * operator.
  */
diff --git a/src/backend/utils/adt/multirangetypes.c 
b/src/backend/utils/adt/multirangetypes.c
index 0b81649779..2583ddeedf 100644
--- a/src/backend/utils/adt/multirangetypes.c
+++ b/src/backend/utils/adt/multirangetypes.c
@@ -1417,7 +1417,7 @@ multirange_intersect_agg_transfn(PG_FUNCTION_ARGS)
if (!type_is_multirange(mltrngtypoid))
ereport(ERROR,
(errcode(ERRCODE_DATATYPE_MISMATCH),
-errmsg("range_intersect_agg must be called 
with a multirange")));
+errmsg("multirange_intersect_agg must be 
called with a multirange")));
 
typcache = multirange_get_typcache(fcinfo, mltrngtypoid);
 

commit 8247b4034ed4c68241be9fbdec249bc967ceafd4
Author: Justin Pryzby 
Date:   Tue Apr 27 07:57:50 2021 -0500

Comment typos: extended stats a4d75c86b and 518442c7f

diff --git a/src/backend/parser/parse_utilcmd.c 
b/src/backend/parser/parse_utilcmd.c
index 9dd30370da..eb9e63f4a8 100644
--- a/src/backend/parser/parse_utilcmd.c
+++ b/src/backend/parser/parse_utilcmd.c
@@ -1943,7 +1943,7 @@ generateClonedExtStatsStmt(RangeVar *heapRel, Oid 
heapRelid,
 * simply append them after simple column references.
 *
 * XXX Some places during build/estimation treat expressions as if they
-* are before atttibutes, but for the CREATE command that's entirely
+* are before attributes, but for the CREATE command that's entirely
 * irrelevant.
 */
datum = SysCacheGetAttr(STATEXTOID, ht_stats,
diff --git a/src/backend/statistics/extended_stats.c 
b/src/backend/statistics/extended_stats.c
index 7e11cb9d5f..5e53783ea6 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -1796,7 +1796,7 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, 
List *clauses, int varReli
continue;
 
/*
-* Now we know the clause is compatible (we have either 
atttnums
+* Now we know the clause is compatible (we have either 
attnums
 * or expressions extracted from it), and was not 
estimated yet.
 */
 




Re: Since '2001-09-09 01:46:40'::timestamp microseconds are lost when extracting epoch

2021-04-28 Thread Tom Lane
Thomas Munro  writes:
> On Wed, Apr 28, 2021 at 3:56 AM Tom Lane  wrote:
>> Of course Wikipedia has been known to contain errors, but now
>> I'm inclined to think I blew this.  Anyone want to check my work?

> I tried a couple of examples not from Wikipedia.  ...

Thanks for checking!  I'll go adjust the documentation.

regards, tom lane




Re: Unresolved repliaction hang and stop problem.

2021-04-28 Thread Alvaro Herrera
On 2021-Apr-28, Amit Kapila wrote:

> On Wed, Apr 28, 2021 at 6:48 AM Alvaro Herrera  
> wrote:

> > Hmm ... On what does it depend (other than plain git conflicts, which
> > are aplenty)?  On a quick look to the commit, it's clear that we need to
> > be careful in order not to cause an ABI break, but that doesn't seem
> > impossible to solve, but I'm wondering if there is more to it than that.
> 
> As mentioned in the commit message, we need another commit [1] change
> to make this work.
> 
> [1] - 
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=c55040ccd0

Oh, yeah, that looks tougher.  (Still not impossible: it adds a new WAL
message type, but we have added such on a minor release before.)

... It's strange that replication worked for them on pg10 though and
broke on 13.  What did we change anything to make it so?  (I don't have
any fish to fry on this topic at present, but it seems a bit
concerning.)

-- 
Álvaro Herrera   Valdivia, Chile




Re: pg_hba.conf.sample wording improvement

2021-04-28 Thread Alvaro Herrera
On 2021-Apr-28, Peter Eisentraut wrote:

> I propose the attached patch to shake up the wording in the connection type
> section of pg_hba.conf.sample a bit.  After the hostgssenc part was added
> on, the whole thing became a bit wordy, and it's also a bit inaccurate for
> example in that the current wording for "host" appears to say that it does
> not apply to GSS-encrypted connections.

Yeah, that's a clear improvement.

Looking at it now, I wonder how well do the "hostno" options work.  If I
say "hostnogssenc", is an SSL-encrypted socket good?  If I say
"hostnossl", is a GSS-encrypted socket good?  If so, how does that make
sense?

-- 
Álvaro Herrera39°49'30"S 73°17'W
"Pensar que el espectro que vemos es ilusorio no lo despoja de espanto,
sólo le suma el nuevo terror de la locura" (Perelandra, C.S. Lewis)




Re: Replication slot stats misgivings

2021-04-28 Thread Masahiko Sawada
On Wed, Apr 28, 2021 at 3:25 PM Amit Kapila  wrote:
>
> On Wed, Apr 28, 2021 at 9:37 AM Masahiko Sawada  wrote:
> >
> > On Wed, Apr 28, 2021 at 12:29 PM Amit Kapila  
> > wrote:
> > >
> > > On Tue, Apr 27, 2021 at 11:02 AM vignesh C  wrote:
> > > >
> > > > On Tue, Apr 27, 2021 at 9:48 AM vignesh C  wrote:
> > > > >
> > > >
> > > > Attached patch has the changes to update statistics during
> > > > spill/stream which prevents the statistics from being lost during
> > > > interrupt.
> > > >
> > >
> > >  void
> > > -UpdateDecodingStats(LogicalDecodingContext *ctx)
> > > +UpdateDecodingStats(ReorderBuffer *rb)
> > >
> > > I don't think you need to change this interface because
> > > reorderbuffer->private_data points to LogicalDecodingContext. See
> > > StartupDecodingContext.
> >
> > +1
> >
> > With this approach, we could still miss the totalTxns and totalBytes
> > updates if the decoding a large but less than
> > logical_decoding_work_mem is interrupted, right?
> >
>
> Right, but is there some simple way to avoid that? I see two
> possibilities (a) store stats in ReplicationSlot and then send them at
> ReplicationSlotRelease but that will lead to an increase in shared
> memory usage and as per the discussion above, we don't want that, (b)
> send intermediate stats after decoding say N changes but for that, we
> need to additionally compute the size of each change which might add
> some overhead.

Right.

> I am not sure if any of these alternatives are a good idea. What do
> you think? Do you have any other ideas for this?

I've been considering some ideas but don't come up with a good one
yet. It’s just an idea and not tested but how about having
CreateDecodingContext() register before_shmem_exit() callback with the
decoding context to ensure that we send slot stats even on
interruption. And FreeDecodingContext() cancels the callback.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: Skip temporary table schema name from explain-verbose output.

2021-04-28 Thread Greg Stark
> On Tue, Apr 27, 2021 at 7:08 PM Bharath Rupireddy
>  Make sense, we would lose the ability to differentiate temporary
> tables from the auto_explain logs.

There's no useful differentiation that can be done with the temp
schema name. They're assigned on connection start randomly from the
pool of temp schemas. The names you find in the log won't be useful
and as new connections are made the same schema names will be reused
for different connections.

I would say it makes sense to remove them -- except perhaps it makes
it harder to parse explain output. If explain verbose always includes
the schema then it's easier for a parser to make sense of the explain
plan output without having to be prepared to sometimes see a schema
and sometimes not. That's probably a pretty hypothetical concern
however since all the explain plan parsers that actually exist are
prepared to deal with non-verbose plans anyways. And we have actual
machine-readable formats too anyways.


-- 
greg




Re: pg_hba.conf.sample wording improvement

2021-04-28 Thread Tom Lane
Peter Eisentraut  writes:
> I propose the attached patch to shake up the wording in the connection 
> type section of pg_hba.conf.sample a bit.  After the hostgssenc part was 
> added on, the whole thing became a bit wordy, and it's also a bit 
> inaccurate for example in that the current wording for "host" appears to 
> say that it does not apply to GSS-encrypted connections.

+1 for revising it in this general way.  I notice you omitted "TCP/IP"
from the last line though:

+# - "hostnogssenc" is a not GSSAPI-encrypted socket

which doesn't seem consistent.

Another thought is to switch the phrase order:

+# - "local" is a Unix-domain socket
+# - "host" is a TCP/IP socket (encrypted or not)
+# - "hostssl" is a TCP/IP socket that is SSL-encrypted
+# - "hostnossl" is a TCP/IP socket that is not SSL-encrypted
+# - "hostgssenc" is a TCP/IP socket that is GSSAPI-encrypted
+# - "hostnogssenc" is a TCP/IP socket that is not GSSAPI-encrypted

I'm not wedded to that idea, but it seems to help reduce random
variations between the wordings of these lines.

regards, tom lane




Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

2021-04-28 Thread Amit Langote
On Wed, Apr 28, 2021 at 8:32 AM Alvaro Herrera  wrote:
> On 2021-Apr-27, Alvaro Herrera wrote:
>
> > This v3 handles things as you suggested and works correctly AFAICT.  I'm
> > going to add some more tests cases to verify the behavior in the
> > scenarios you showed, and get them to run under cache-clobber options to
> > make sure it's good.
>
> Yep, it seems to work.  Strangely, the new isolation case doesn't
> actually crash before the fix -- it merely throws a memory allocation
> error.

Thanks.  Yeah, it does seem to work.

I noticed that rd_partdesc_nodetached_xmin can sometimes end up with
value 0. While you seem to be already aware of that, because otherwise
you wouldn't have added TransactionIdIsValid(...) in condition in
RelationGetPartitionDesc(), the comments nearby don't mention why such
a thing might happen.  Also, I guess it can't be helped that the
partdesc_nodetached will have to be leaked when the xmin is 0, but
that shouldn't be as problematic as the case we discussed earlier.

+   /*
+* But first, a kluge: if there's an old context for this type of
+* descriptor, it contains an old partition descriptor that may still be
+* referenced somewhere.  Preserve it, while not leaking it, by
+* reattaching it as a child context of the new one.  Eventually it will
+* get dropped by either RelationClose or RelationClearRelation.
+*
+* We keep the regular partdesc in rd_pdcxt, and the partdesc-excluding-
+* detached-partitions in rd_pddcxt.
+*/
+   context = is_omit ? &rel->rd_pddcxt : &rel->rd_pdcxt;
+   if (*context != NULL)
+   MemoryContextSetParent(*context, new_pdcxt);
+   *context = new_pdcxt;

Would it be a bit more readable to just duplicate this stanza in the
blocks that assign to rd_partdesc_nodetached and rd_partdesc,
respectively?  That's not much code to duplicate and it'd be easier to
see which context is for which partdesc.

+   TransactionId rd_partdesc_nodetached_xmin;  /* xmin for the above */

Could you please expand this description a bit?

-- 
Amit Langote
EDB: http://www.enterprisedb.com




Re: Skip temporary table schema name from explain-verbose output.

2021-04-28 Thread Tom Lane
Greg Stark  writes:
>> On Tue, Apr 27, 2021 at 7:08 PM Bharath Rupireddy
>> > Make sense, we would lose the ability to differentiate temporary
>> tables from the auto_explain logs.

> There's no useful differentiation that can be done with the temp
> schema name.

Agreed.

> I would say it makes sense to remove them -- except perhaps it makes
> it harder to parse explain output.

I don't think we should remove them.  However, it could make sense to
print the "pg_temp" alias instead of the real schema name when we
are talking about myTempNamespace.  Basically try to make that alias
a bit less leaky.

regards, tom lane




Re: pg_hba.conf.sample wording improvement

2021-04-28 Thread Tom Lane
Alvaro Herrera  writes:
> Looking at it now, I wonder how well do the "hostno" options work.  If I
> say "hostnogssenc", is an SSL-encrypted socket good?  If I say
> "hostnossl", is a GSS-encrypted socket good?  If so, how does that make
> sense?

Kind of off-topic for this thread, but I wonder if we should introduce
"hostenc" and "hostnoenc" to mean "encrypted (or not), and I don't care
by which method".  The addition of GSS has made it painful to express
those concepts.

regards, tom lane




Re: pg_upgrade fails to detect unsupported arrays and ranges

2021-04-28 Thread Tom Lane
[ blast-from-the-past department ]

I wrote:
> Daniel Gustafsson  writes:
>> I can see the appeal of
>> including it before the wrap, even though I personally would've held off.

> Nah, I'm not gonna risk it at this stage.  I concur with your point
> that this is an ancient bug, and one that is unlikely to bite many
> people.  I'll push it Wednesday or so.

I happened across a couple of further pg_upgrade oversights in the
same vein as 29aeda6e4 et al:

* Those commits fixed the bugs in pg_upgrade/version.c about not
checking the contents of arrays/ranges/etc, but there are two
similar functions in pg_upgrade/check.c that I failed to notice
(probably due to the haste with which that patch was prepared).

* We really need to also reject user tables that contain instances
of system-defined composite types (i.e. catalog rowtypes), because
except for a few bootstrap catalogs, those type OIDs are assigned by
genbki.pl not by hand, so they aren't stable across major versions.
For example, in HEAD I get

regression=# select 'pg_enum'::regtype::oid;
  oid  
---
 13045
(1 row)

but the same OID was 12022 in v13, 11551 in v11, etc.  So if you
had a column of type pg_enum, you'd likely get no-such-type-OID
failures when reading the values after an upgrade.  I don't see
much use-case for doing such a thing, so it seems saner to just
block off the possibility rather than trying to support it.
(We'd have little choice in the back branches anyway, as their
OID values are locked down now.)

The attached proposed patch fixes these cases too.  I generalized
the recursive query a little more so that it could start from an
arbitrary query yielding pg_type OIDs, rather than just one type
name; otherwise it's pretty straightforward.

Barring objections I'll apply and back-patch this soon.

regards, tom lane

diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index 1c1c908664..ce821a46b3 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -24,6 +24,7 @@ static void check_for_prepared_transactions(ClusterInfo *cluster);
 static void check_for_isn_and_int8_passing_mismatch(ClusterInfo *cluster);
 static void check_for_user_defined_postfix_ops(ClusterInfo *cluster);
 static void check_for_tables_with_oids(ClusterInfo *cluster);
+static void check_for_composite_data_type_usage(ClusterInfo *cluster);
 static void check_for_reg_data_type_usage(ClusterInfo *cluster);
 static void check_for_jsonb_9_4_usage(ClusterInfo *cluster);
 static void check_for_pg_role_prefix(ClusterInfo *cluster);
@@ -100,6 +101,7 @@ check_and_dump_old_cluster(bool live_check)
 	check_is_install_user(&old_cluster);
 	check_proper_datallowconn(&old_cluster);
 	check_for_prepared_transactions(&old_cluster);
+	check_for_composite_data_type_usage(&old_cluster);
 	check_for_reg_data_type_usage(&old_cluster);
 	check_for_isn_and_int8_passing_mismatch(&old_cluster);
 
@@ -1043,6 +1045,63 @@ check_for_tables_with_oids(ClusterInfo *cluster)
 }
 
 
+/*
+ * check_for_composite_data_type_usage()
+ *	Check for system-defined composite types used in user tables.
+ *
+ *	The OIDs of rowtypes of system catalogs and information_schema views
+ *	can change across major versions; unlike user-defined types, we have
+ *	no mechanism for forcing them to be the same in the new cluster.
+ *	Hence, if any user table uses one, that's problematic for pg_upgrade.
+ */
+static void
+check_for_composite_data_type_usage(ClusterInfo *cluster)
+{
+	bool		found;
+	Oid			firstUserOid;
+	char		output_path[MAXPGPATH];
+	char	   *base_query;
+
+	prep_status("Checking for system-defined composite types in user tables");
+
+	snprintf(output_path, sizeof(output_path), "tables_using_composite.txt");
+
+	/*
+	 * Look for composite types that were made during initdb *or* belong to
+	 * information_schema; that's important in case information_schema was
+	 * dropped and reloaded.
+	 *
+	 * The cutoff OID here should match the source cluster's value of
+	 * FirstNormalObjectId.  We hardcode it rather than using that C #define
+	 * because, if that #define is ever changed, our own version's value is
+	 * NOT what to use.  Eventually we may need a test on the source cluster's
+	 * version to select the correct value.
+	 */
+	firstUserOid = 16384;
+
+	base_query = psprintf("SELECT t.oid FROM pg_catalog.pg_type t "
+		  "LEFT JOIN pg_catalog.pg_namespace n ON t.typnamespace = n.oid "
+		  " WHERE typtype = 'c' AND (t.oid < %u OR nspname = 'information_schema')",
+		  firstUserOid);
+
+	found = check_for_data_types_usage(cluster, base_query, output_path);
+
+	free(base_query);
+
+	if (found)
+	{
+		pg_log(PG_REPORT, "fatal\n");
+		pg_fatal("Your installation contains system-defined composite type(s) in user tables.\n"
+ "These type OIDs are not stable across PostgreSQL versions,\n"
+ "so this cluster cannot currently be upgraded.  You can\n"
+ "remove the problem tables and restart the upgrade.\n"
+ "A list of the p

回复:Attach to shared memory after fork()

2021-04-28 Thread 邱宇航(烛远)
> Windows has CreateProcess, which isn't available elsewhere.
Yes, we still need fork() on *nix. So the solution is to reduce the
overhead of fork(). Attach to shared memory after fork() might be a
"Better shared memory management".

> This is one of the reasons for using a connection pooler like pgbouncer,
> which can vastly reduce the number of new process creations Postgres has
to do.
Yes, it’s another way I forgot to mention. But I think there should be a
cleaner way without other component.

> This proposal seems moderately insane.  In the first place, it
> introduces failure modes we could do without, and in the second place,
> how is it not strictly *more* expensive than what happens now?  You
> still have to end up with all those TLB entries mapped in the child.
Yes, the idea is radical. But it’s practical.
1. I don’t quite catch that. Can you explain it?
2. Yes, the overall cost is still the same, but the cost can spread
into multi processes thus CPUs, not 100% on Postmaster.

> (If your kernel is unable to pass down shared-memory TLBs effectively,
> ISTM that's a kernel shortcoming not a Postgres architectural problem.)
Indeed, it’s a kernel/CPUarch shortcoming. But it is also a Postgres
architectural problem. MySQL and Oracle have no such problem.
IMHO Postgres should manage itself well(eg. IO/buffer pool/connection/...)
and not rely so much on OS kernel...

The fork() used to be a genius hack, but now it’s a burden and it will get
worse and worse. All I want to do is remove fork() or reduce the overhead.
Maybe *nux will have CreateProcess someday(and I think it will), and we
should wait for it?



Re: [PATCH] expand the units that pg_size_pretty supports on output

2021-04-28 Thread David Christensen
A second patch to teach the same units to pg_size_bytes().

Best,

David

On Wed, Apr 14, 2021 at 11:13 AM David Christensen <
david.christen...@crunchydata.com> wrote:

> Hi folks,
>
> Enclosed is a patch that expands the unit output for
> pg_size_pretty(numeric) going up to Yottabytes; I reworked the existing
> numeric output code to account for the larger number of units we're using
> rather than just adding nesting levels.
>
> There are also a few other places that could benefit from expanded units,
> but this is a standalone starting point.
>
> Best,
>
> David
>


0001-Expand-the-supported-units-in-pg_size_bytes-to-cover.patch
Description: Binary data


Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

2021-04-28 Thread Álvaro Herrera
Thanks for re-reviewing! This one I hope is the last version.

On Wed, Apr 28, 2021, at 10:21 AM, Amit Langote wrote:
> I noticed that rd_partdesc_nodetached_xmin can sometimes end up with
> value 0. While you seem to be already aware of that, because otherwise
> you wouldn't have added TransactionIdIsValid(...) in condition in
> RelationGetPartitionDesc(), the comments nearby don't mention why such
> a thing might happen.  Also, I guess it can't be helped that the
> partdesc_nodetached will have to be leaked when the xmin is 0, but
> that shouldn't be as problematic as the case we discussed earlier.

The only case I am aware where that can happen is if the pg_inherits tuple is 
frozen. (That's exactly what the affected test case was testing, note the 
"VACUUM FREEZE pg_inherits" there).  So that test case blew up immediately; but 
I think the real-world chances that people are going to be doing that are 
pretty low, so I'm not really concerned about the leak.

> Would it be a bit more readable to just duplicate this stanza in the
> blocks that assign to rd_partdesc_nodetached and rd_partdesc,
> respectively?  That's not much code to duplicate and it'd be easier to
> see which context is for which partdesc.

Sure .. that's how I first wrote this code.  We don't use that style much, so 
I'm OK with backing out of it.

> +   TransactionId rd_partdesc_nodetached_xmin;  /* xmin for the above */
> 
> Could you please expand this description a bit?

Done.From 8b1489a17cb5cfcd30d4dfd18020c64e606c5042 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 27 Apr 2021 19:04:37 -0400
Subject: [PATCH v5] Allow a partdesc-omitting-partitions to be cached

Makes partition descriptor acquisition faster during the transient
period in which a partition is in the process of being detached.

(This incidentally fixes a bug in 8aba9322511 whereby a memory context
holding a transient partdesc is reparented to NULL, leading to permanent
leak of that memory.  Reported by Amit Langote.)

Per gripe from Amit Langote
Discussion: https://postgr.es/m/ca+hiwqfgpp1lxjzobygt9rpvtjxxkg5qg2+xch2z1q7krqz...@mail.gmail.com
---
 src/backend/catalog/pg_inherits.c |  52 -
 src/backend/commands/tablecmds.c  |  66 ++-
 src/backend/commands/trigger.c|   3 +-
 src/backend/partitioning/partdesc.c   | 101 +++-
 src/backend/utils/cache/relcache.c|  33 +-
 src/include/catalog/pg_inherits.h |   6 +-
 src/include/utils/rel.h   |  15 +++
 .../detach-partition-concurrently-3.out   | 110 +-
 .../detach-partition-concurrently-3.spec  |  14 ++-
 9 files changed, 327 insertions(+), 73 deletions(-)

diff --git a/src/backend/catalog/pg_inherits.c b/src/backend/catalog/pg_inherits.c
index 6447b52854..c373faf2d6 100644
--- a/src/backend/catalog/pg_inherits.c
+++ b/src/backend/catalog/pg_inherits.c
@@ -52,6 +52,22 @@ typedef struct SeenRelsEntry
  * then no locks are acquired, but caller must beware of race conditions
  * against possible DROPs of child relations.
  *
+ * Partitions marked as being detached are omitted; see
+ * find_inheritance_children_extended for details.
+ */
+List *
+find_inheritance_children(Oid parentrelId, LOCKMODE lockmode)
+{
+	return find_inheritance_children_extended(parentrelId, true, lockmode,
+			  NULL, NULL);
+}
+
+/*
+ * find_inheritance_children_extended
+ *
+ * As find_inheritance_children, with more options regarding detached
+ * partitions.
+ *
  * If a partition's pg_inherits row is marked "detach pending",
  * *detached_exist (if not null) is set true.
  *
@@ -60,11 +76,13 @@ typedef struct SeenRelsEntry
  * marked "detach pending" is visible to that snapshot, then that partition is
  * omitted from the output list.  This makes partitions invisible depending on
  * whether the transaction that marked those partitions as detached appears
- * committed to the active snapshot.
+ * committed to the active snapshot.  In addition, *detached_xmin (if not null)
+ * is set to the xmin of the row of the detached partition.
  */
 List *
-find_inheritance_children(Oid parentrelId, bool omit_detached,
-		  LOCKMODE lockmode, bool *detached_exist)
+find_inheritance_children_extended(Oid parentrelId, bool omit_detached,
+   LOCKMODE lockmode, bool *detached_exist,
+   TransactionId *detached_xmin)
 {
 	List	   *list = NIL;
 	Relation	relation;
@@ -132,7 +150,32 @@ find_inheritance_children(Oid parentrelId, bool omit_detached,
 snap = GetActiveSnapshot();
 
 if (!XidInMVCCSnapshot(xmin, snap))
+{
+	if (detached_xmin)
+	{
+		/*
+		 * Two detached partitions should not occur (see
+		 * checks in MarkInheritDetached), but if they do,
+		 * track the newer of the two.  Make sure to warn the
+		 * user, so that they can clean up.  Since this is
+		 * just a cross-check against potentially corrupt
+		 * catalogs, we don't 

Re: WIP: WAL prefetch (another approach)

2021-04-28 Thread Andres Freund
Hi,

On 2021-04-22 13:59:58 +1200, Thomas Munro wrote:
> On Thu, Apr 22, 2021 at 1:21 PM Tom Lane  wrote:
> > I've also tried to reproduce on 32-bit and 64-bit Intel, without
> > success.  So if this is real, maybe it's related to being big-endian
> > hardware?  But it's also quite sensitive to $dunno-what, maybe the
> > history of WAL records that have already been replayed.
> 
> Ah, that's interesting.  There are a couple of sparc64 failures and a
> ppc64 failure in the build farm, but I couldn't immediately spot what
> was wrong with them or whether it might be related to this stuff.
> 
> Thanks for the clues.  I'll see what unusual systems I can find to try
> this on

FWIW, I've run 32 and 64 bit x86 through several hundred regression
cycles, without hitting an issue. For a lot of them I set
checkpoint_timeout to a lower value as I thought that might make it more
likely to reproduce an issue.

Tom, any chance you could check if your machine repros the issue before
these commits?

Greetings,

Andres Freund




Re: WIP: WAL prefetch (another approach)

2021-04-28 Thread Tom Lane
Andres Freund  writes:
> Tom, any chance you could check if your machine repros the issue before
> these commits?

Wilco, but it'll likely take a little while to get results ...

regards, tom lane




strange case of "if ((a & b))"

2021-04-28 Thread Justin Pryzby
These look strange to me - the inner parens don't do anything.
I wouldn't write it with 2x parens for the same reason I wouldn't write it with
8x parens.

diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index 9f159eb3db..3bbc13c443 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -693,7 +693,7 @@ check_tuple_header(HeapCheckContext *ctx)
report_corruption(ctx,
  psprintf("tuple data 
should begin at byte %u, but actually begins at byte %u (1 attribute, has 
nulls)",
   
expected_hoff, ctx->tuphdr->t_hoff));
-   else if ((infomask & HEAP_HASNULL))
+   else if ((infomask & HEAP_HASNULL) != 0)
report_corruption(ctx,
  psprintf("tuple data 
should begin at byte %u, but actually begins at byte %u (%u attributes, has 
nulls)",
   
expected_hoff, ctx->tuphdr->t_hoff, ctx->natts));
diff --git a/contrib/ltree/ltree_io.c b/contrib/ltree/ltree_io.c
index 15115cb29f..0dd2838f8b 100644
--- a/contrib/ltree/ltree_io.c
+++ b/contrib/ltree/ltree_io.c
@@ -661,17 +661,17 @@ deparse_lquery(const lquery *in)
}
memcpy(ptr, curtlevel->name, curtlevel->len);
ptr += curtlevel->len;
-   if ((curtlevel->flag & LVAR_SUBLEXEME))
+   if ((curtlevel->flag & LVAR_SUBLEXEME) != 0)
{
*ptr = '%';
ptr++;
}
-   if ((curtlevel->flag & LVAR_INCASE))
+   if ((curtlevel->flag & LVAR_INCASE) != 0)
{
*ptr = '@';
ptr++;
}
-   if ((curtlevel->flag & LVAR_ANYEND))
+   if ((curtlevel->flag & LVAR_ANYEND) != 0)
{
*ptr = '*';
ptr++;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 13396eb7f2..f5a4db5c57 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2107,7 +2107,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId 
cid,
vmstatus = visibilitymap_get_status(relation,
 
BufferGetBlockNumber(buffer), &vmbuffer);
 
-   if ((starting_with_empty_page || vmstatus & 
VISIBILITYMAP_ALL_FROZEN))
+   if (starting_with_empty_page ||
+   (vmstatus & VISIBILITYMAP_ALL_FROZEN) != 0)
all_frozen_set = true;
}
 
diff --git a/src/backend/access/transam/xact.c 
b/src/backend/access/transam/xact.c
index 441445927e..28fdd2943b 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2417,7 +2417,7 @@ PrepareTransaction(void)
 * cases, such as a temp table created and dropped all within the
 * transaction.  That seems to require much more bookkeeping though.
 */
-   if ((MyXactFlags & XACT_FLAGS_ACCESSEDTEMPNAMESPACE))
+   if ((MyXactFlags & XACT_FLAGS_ACCESSEDTEMPNAMESPACE) != 0)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 errmsg("cannot PREPARE a transaction that has 
operated on temporary objects")));
@@ -5530,7 +5530,7 @@ XactLogCommitRecord(TimestampTz commit_time,
xl_xinfo.xinfo |= XACT_COMPLETION_UPDATE_RELCACHE_FILE;
if (forceSyncCommit)
xl_xinfo.xinfo |= XACT_COMPLETION_FORCE_SYNC_COMMIT;
-   if ((xactflags & XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK))
+   if ((xactflags & XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK) != 0)
xl_xinfo.xinfo |= XACT_XINFO_HAS_AE_LOCKS;
 
/*
@@ -5681,7 +5681,7 @@ XactLogAbortRecord(TimestampTz abort_time,
 
xlrec.xact_time = abort_time;
 
-   if ((xactflags & XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK))
+   if ((xactflags & XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK) != 0)
xl_xinfo.xinfo |= XACT_XINFO_HAS_AE_LOCKS;
 
if (nsubxacts > 0)
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 8e717ada28..f341e6d143 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -16029,7 +16029,7 @@ PreCommit_on_commit_actions(void)

Re: strange case of "if ((a & b))"

2021-04-28 Thread Tom Lane
Justin Pryzby  writes:
> These look strange to me - the inner parens don't do anything.
> I wouldn't write it with 2x parens for the same reason I wouldn't write it 
> with
> 8x parens.

Agreed, but shouldn't we just drop the excess parens rather than
doubling down on useless notation?

regards, tom lane




Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

2021-04-28 Thread Álvaro Herrera
Pushed that now, with a one-line addition to the docs that only one
partition can be marked detached.

-- 
Álvaro Herrera39°49'30"S 73°17'W
"That sort of implies that there are Emacs keystrokes which aren't obscure.
I've been using it daily for 2 years now and have yet to discover any key
sequence which makes any sense."(Paul Thomas)




Re: Better sanity checking for message length words

2021-04-28 Thread Tom Lane
Aleksander Alekseev  writes:
> Sorry, my bad. It was about lines separating on different platforms. The
> patch is fine and passes installcheck-world on MacOS.

Pushed, thanks for looking at it!

regards, tom lane




Re: pg_upgrade fails to detect unsupported arrays and ranges

2021-04-28 Thread Daniel Gustafsson
> On 28 Apr 2021, at 17:09, Tom Lane  wrote:
> 
> [ blast-from-the-past department ]
> 
> I wrote:
>> Daniel Gustafsson  writes:
>>> I can see the appeal of
>>> including it before the wrap, even though I personally would've held off.
> 
>> Nah, I'm not gonna risk it at this stage.  I concur with your point
>> that this is an ancient bug, and one that is unlikely to bite many
>> people.  I'll push it Wednesday or so.
> 
> I happened across a couple of further pg_upgrade oversights in the
> same vein as 29aeda6e4 et al:

Nice find, this makes a lot of sense.

> ..the same OID was 12022 in v13, 11551 in v11, etc.  So if you
> had a column of type pg_enum, you'd likely get no-such-type-OID
> failures when reading the values after an upgrade.  I don't see
> much use-case for doing such a thing, so it seems saner to just
> block off the possibility rather than trying to support it.

Agreed.  Having implemented basically this for Greenplum I think it’s wise to
avoid it unless we really have to, it gets very complicated once the layers of
worms are peeled back.

> The attached proposed patch fixes these cases too.  I generalized
> the recursive query a little more so that it could start from an
> arbitrary query yielding pg_type OIDs, rather than just one type
> name; otherwise it's pretty straightforward.
> 
> Barring objections I'll apply and back-patch this soon.

Patch LGTM on reading, +1 on applying.  Being on parental leave I don’t have my
dev env ready to go so I didn’t perform testing; sorry about that.

> + pg_fatal("Your installation contains system-defined composite 
> type(s) in user tables.\n"
> +  "These type OIDs are not stable across 
> PostgreSQL versions,\n"
> +  "so this cluster cannot currently be upgraded. 
>  You can\n"
> +  "remove the problem tables and restart the 
> upgrade.\n"
> +  "A list of the problem columns is in the 
> file:\n"

Would it be helpful to inform the user that they can alter/drop just the
problematic columns as a potentially less scary alternative to dropping the
entire table?

> -  * The type of interest might be wrapped in a domain, array,
> +  * The types of interest might be wrapped in a domain, array,

Shouldn't this be "type(s)” as in the other changes here?

--
Daniel Gustafsson   https://vmware.com/





Re: Replication slot stats misgivings

2021-04-28 Thread Tom Lane
It seems that the test case added by f5fc2f5b2 is still a bit
unstable, even after c64dcc7fe:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=peripatus&dt=2021-04-23%2006%3A20%3A12

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=peripatus&dt=2021-04-24%2018%3A20%3A10

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snapper&dt=2021-04-28%2017%3A53%3A14

(The snapper run fails to show regression.diffs, so it's not certain
that it's the same failure as peripatus, but ...)

regards, tom lane




Re: pg_upgrade fails to detect unsupported arrays and ranges

2021-04-28 Thread Tom Lane
Daniel Gustafsson  writes:
>> On 28 Apr 2021, at 17:09, Tom Lane  wrote:
>> +pg_fatal("Your installation contains system-defined composite 
>> type(s) in user tables.\n"
>> + "These type OIDs are not stable across 
>> PostgreSQL versions,\n"
>> + "so this cluster cannot currently be upgraded. 
>>  You can\n"
>> + "remove the problem tables and restart the 
>> upgrade.\n"
>> + "A list of the problem columns is in the 
>> file:\n"

> Would it be helpful to inform the user that they can alter/drop just the
> problematic columns as a potentially less scary alternative to dropping the
> entire table?

This wording is copied-and-pasted from the other similar tests.  I agree
that it's advocating a solution that might be overkill, but if we change
it we should also change the existing messages.  I don't mind doing
that in HEAD; less sure about the back branches, as (I think) these
are translatable strings.

Thoughts?

>> - * The type of interest might be wrapped in a domain, array,
>> + * The types of interest might be wrapped in a domain, array,

> Shouldn't this be "type(s)” as in the other changes here?

Fair enough.

regards, tom lane




Re: pg_upgrade fails to detect unsupported arrays and ranges

2021-04-28 Thread Daniel Gustafsson
> On 28 Apr 2021, at 22:47, Tom Lane  wrote:
> 
> Daniel Gustafsson  writes:
>>> On 28 Apr 2021, at 17:09, Tom Lane  wrote:
>>> +   pg_fatal("Your installation contains system-defined composite 
>>> type(s) in user tables.\n"
>>> +"These type OIDs are not stable across 
>>> PostgreSQL versions,\n"
>>> +"so this cluster cannot currently be upgraded. 
>>>  You can\n"
>>> +"remove the problem tables and restart the 
>>> upgrade.\n"
>>> +"A list of the problem columns is in the 
>>> file:\n"
> 
>> Would it be helpful to inform the user that they can alter/drop just the
>> problematic columns as a potentially less scary alternative to dropping the
>> entire table?
> 
> This wording is copied-and-pasted from the other similar tests.  I agree
> that it's advocating a solution that might be overkill, but if we change
> it we should also change the existing messages.  

Good point.

> I don't mind doing that in HEAD; less sure about the back branches, as

I think it would be helpful for users to try and give slightly more expanded
advice while (obviously) still always being safe.  I’m happy to take a crack at
that once back unless someone beats me to it.

> (I think) these are translatable strings.


If they aren't I think we should try and make them so to as far as we can
reduce language barrier problems in such important messages.

--
Daniel Gustafsson   https://vmware.com/





Re: pg_upgrade fails to detect unsupported arrays and ranges

2021-04-28 Thread Tom Lane
Daniel Gustafsson  writes:
> On 28 Apr 2021, at 22:47, Tom Lane  wrote:
>> This wording is copied-and-pasted from the other similar tests.  I agree
>> that it's advocating a solution that might be overkill, but if we change
>> it we should also change the existing messages.  

> Good point.

>> I don't mind doing that in HEAD; less sure about the back branches, as

> I think it would be helpful for users to try and give slightly more expanded
> advice while (obviously) still always being safe.  I’m happy to take a crack 
> at
> that once back unless someone beats me to it.

Seems like s/remove the problem tables/drop the problem columns/
is easy and sufficient.

>> (I think) these are translatable strings.

> If they aren't I think we should try and make them so to as far as we can
> reduce language barrier problems in such important messages.

Checking, I see they do appear in pg_upgrade's po files.  So I propose
that we change the existing messages in HEAD but not the back branches.

regards, tom lane




Re: SQL/JSON: functions

2021-04-28 Thread Andrew Dunstan
On Fri, Mar 26, 2021 at 9:14 PM Nikita Glukhov 
wrote:

> Attached 54th version of the patches rebased onto current master.
>
>
> On 27.03.2021 01:30, Andrew Dunstan wrote:
>
> Specifically, patch 4 (SQL-JSON-query-functions) fails with this when
> built with LLVM:
>
>
> There is also a bug that results in a warning in gram.y, but fixing it
> doesn't affect this issue. Nikita, please look into this ASAP.
>
> LLVM issues and gram.y are fixed.
>
>
>
>
It's apparently bitrotted again. See <
http://cfbot.cputube.org/patch_33_2901.log>

cheers

andrew


gcc 11.1.0 warnings in llvmjit_expr.c

2021-04-28 Thread Erik Rijkers



Hello,

gcc 11.1.0 produces quite a litany of second thoughts when compiling 
llvmjit_expr.c; is there something in it?

(all 'warning' or 'note')


llvmjit_expr.c: In function ‘llvm_compile_expr’:
llvmjit_expr.c:69:9: warning: ‘build_EvalXFuncInt’ accessing 8 bytes in a 
region of size 0 [-Wstringop-overflow=]
   69 | build_EvalXFuncInt(b, mod, funcname, v_state, op, \
  | ^~~
   70 |
lengthof(((LLVMValueRef[]){__VA_ARGS__})), \
  |

   71 |
((LLVMValueRef[]){__VA_ARGS__}))
  |

llvmjit_expr.c:1553:33: note: in expansion of macro ‘build_EvalXFunc’
 1553 | build_EvalXFunc(b, mod, 
"ExecEvalSQLValueFunction",
  | ^~~
llvmjit_expr.c:69:9: note: referencing argument 7 of type ‘struct 
LLVMOpaqueValue **’
   69 | build_EvalXFuncInt(b, mod, funcname, v_state, op, \
  | ^~~
   70 |
lengthof(((LLVMValueRef[]){__VA_ARGS__})), \
  |

   71 |
((LLVMValueRef[]){__VA_ARGS__}))
  |

llvmjit_expr.c:1553:33: note: in expansion of macro ‘build_EvalXFunc’
 1553 | build_EvalXFunc(b, mod, 
"ExecEvalSQLValueFunction",
  | ^~~
llvmjit_expr.c:2460:1: note: in a call to function ‘build_EvalXFuncInt’
 2460 | build_EvalXFuncInt(LLVMBuilderRef b, LLVMModuleRef mod, const char 
*funcname,
  | ^~
llvmjit_expr.c:69:9: warning: ‘build_EvalXFuncInt’ accessing 8 bytes in a 
region of size 0 [-Wstringop-overflow=]
   69 | build_EvalXFuncInt(b, mod, funcname, v_state, op, \
  | ^~~
   70 |
lengthof(((LLVMValueRef[]){__VA_ARGS__})), \
  |

   71 |
((LLVMValueRef[]){__VA_ARGS__}))
  |

llvmjit_expr.c:1559:33: note: in expansion of macro ‘build_EvalXFunc’
 1559 | build_EvalXFunc(b, mod, 
"ExecEvalCurrentOfExpr",
  | ^~~
llvmjit_expr.c:69:9: note: referencing argument 7 of type ‘struct 
LLVMOpaqueValue **’
   69 | build_EvalXFuncInt(b, mod, funcname, v_state, op, \
  | ^~~
   70 |
lengthof(((LLVMValueRef[]){__VA_ARGS__})), \
  |

   71 |
((LLVMValueRef[]){__VA_ARGS__}))
  |

llvmjit_expr.c:1559:33: note: in expansion of macro ‘build_EvalXFunc’
 1559 | build_EvalXFunc(b, mod, 
"ExecEvalCurrentOfExpr",
  | ^~~
llvmjit_expr.c:2460:1: note: in a call to function ‘build_EvalXFuncInt’
 2460 | build_EvalXFuncInt(LLVMBuilderRef b, LLVMModuleRef mod, const char 
*funcname,
  | ^~
llvmjit_expr.c:69:9: warning: ‘build_EvalXFuncInt’ accessing 8 bytes in a 
region of size 0 [-Wstringop-overflow=]
   69 | build_EvalXFuncInt(b, mod, funcname, v_state, op, \
  | ^~~
   70 |
lengthof(((LLVMValueRef[]){__VA_ARGS__})), \
  |

   71 |
((LLVMValueRef[]){__VA_ARGS__}))
  |

llvmjit_expr.c:1565:33: note: in expansion of macro ‘build_EvalXFunc’
 1565 | build_EvalXFunc(b, mod, 
"ExecEvalNextValueExpr",
  | ^~~
llvmjit_expr.c:69:9: note: referencing argument 7 of type ‘struct 
LLVMOpaqueValue **’
   69 | build_EvalXFuncInt(b, mod, funcname, v_state, op, \
  | ^~~
   70 |
lengthof(((LLVMValueRef[]){__VA_ARGS__})), \
 

Re: WIP: WAL prefetch (another approach)

2021-04-28 Thread Thomas Munro
On Thu, Apr 29, 2021 at 4:45 AM Tom Lane  wrote:
> Andres Freund  writes:
> > Tom, any chance you could check if your machine repros the issue before
> > these commits?
>
> Wilco, but it'll likely take a little while to get results ...

FWIW I also chewed through many megawatts trying to reproduce this on
a PowerPC system in 64 bit big endian mode, with an emulator.  No
cigar.  However, it's so slow that I didn't make it to 10 runs...




Re: WIP: WAL prefetch (another approach)

2021-04-28 Thread Tom Lane
Thomas Munro  writes:
> FWIW I also chewed through many megawatts trying to reproduce this on
> a PowerPC system in 64 bit big endian mode, with an emulator.  No
> cigar.  However, it's so slow that I didn't make it to 10 runs...

Speaking of megawatts ... my G4 has now finished about ten cycles of
installcheck-parallel without a failure, which isn't really enough
to draw any conclusions yet.  But I happened to notice the
accumulated CPU time for the background processes:

USER   PID  %CPU %MEM  VSZRSS   TT  STAT STARTED  TIME COMMAND
tgl  19048   0.0  4.4   229952  92196   ??  Ss3:19PM  19:59.19 
postgres: startup recovering 000100140022
tgl  19051   0.0  0.1   229656   1696   ??  Ss3:19PM  27:09.14 
postgres: walreceiver streaming 14/227D8F14
tgl  19052   0.0  0.1   229904   2516   ??  Ss3:19PM  17:38.17 
postgres: walsender tgl [local] streaming 14/227D8F14 

IOW, we've spent over twice as many CPU cycles shipping data to the
standby as we did in applying the WAL on the standby.  Is this
expected?  I've got wal_consistency_checking = all, which is bloating
the WAL volume quite a bit, but still it seems like the walsender and
walreceiver have little excuse for spending more cycles per byte
than the startup process.

(This is testing b3ee4c503, so if Thomas' WAL changes improved
efficiency of the replay process at all, the discrepancy could be
even worse in HEAD.)

regards, tom lane




Re: Replication slot stats misgivings

2021-04-28 Thread Masahiko Sawada
On Thu, Apr 29, 2021 at 5:41 AM Tom Lane  wrote:
>
> It seems that the test case added by f5fc2f5b2 is still a bit
> unstable, even after c64dcc7fe:

Hmm, I don't see the exact cause yet but there are two possibilities:
some transactions were really spilled, and it showed the old stats due
to losing the drop (and create) slot messages. For the former case, it
seems to better to create the slot just before the insertion and
setting logical_decoding_work_mem to the default (64MB). For the
latter case, maybe we can use a different name slot than the name used
in other tests?

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: Use simplehash.h instead of dynahash in SMgr

2021-04-28 Thread David Rowley
On Thu, 29 Apr 2021 at 00:28, Yura Sokolov  wrote:
> The best result is with just:
>
> return (uint32)rnode->node.relNode;
>
> ie, relNode could be taken without mixing at all.
> relNode is unique inside single database, and almost unique among whole
> cluster
> since it is Oid.

I admit to having tried that too just to almost eliminate the cost of
hashing.  I just didn't consider it something we'd actually do.

The system catalogues are quite likely to all have the same
relfilenode in all databases, so for workloads that have a large
number of databases, looking up WAL records that touch the catalogues
is going to be pretty terrible.

The simplified hash function I wrote with just the relNode and dbNode
would be the least I'd be willing to entertain.  However, I just
wouldn't be surprised if there was a good reason for that being bad
too.


> So, I've repeated benchmark with different number of partitons (I tried
> to catch higher fillfactor) and less amount of inserted data (since I
> don't
> want to stress my SSD).
>
> partitions/ | dynahash | dynahash +  | simplehash | simplehash + |
> fillfactor  |  | simple func |   | simple func  |
> +--+-+--+
>   3500/0.43  |   3.73s  |   3.54s |   3.58s|   3.34s  |
>   3200/0.78  |   3.64s  |   3.46s |   3.47s|   3.25s  |
>   1500/0.74  |   3.18s  |   2.97s |   3.03s|   2.79s  |
>
> Fillfactor is effective fillfactor in simplehash with than number of
> partitions.
> I wasn't able to measure with fillfactor close to 0.9 since looks like
> simplehash tends to grow much earlier due to SH_GROW_MAX_MOVE.

Thanks for testing that.

I also ran some tests last night to test the 0.75 vs 0.9 fillfactor to
see if it made a difference.  The test was similar to last time, but I
trimmed the number of rows inserted from 100 million down to 25
million.  Last time I tested with 1000 partitions, this time with each
of: 100 200 300 400 500 600 700 800 900 1000 partitions.  There didn't
seem to be any point of testing lower than that as the minimum hash
table size is 512.

The averages over 10 runs were:

nparts  ff75   ff90
100  21.898  22.226
200  23.105  25.493
300  25.274  24.251
400  25.139  25.611
500  25.738  25.454
600  26.656  26.82
700  27.577  27.102
800  27.608  27.546
900  27.284  28.186
100029 28.153

Or to summarise a bit, we could just look at the sum of all the
results per fillfactor:

sum ff75 2592.79
sum ff90 2608.42 100.6%

fillfactor 75 did come out slightly faster, but only just. It seems
close enough that it might be better just to keep the 90 to save a
little memory and improve caching elsewhere.  Also, from below, you
can see that for 300, 500, 600, 700, 1000 tables tests, the hash
tables ended up the same size, yet there's a bit of variability in the
timing result.  So the 0.6% gain might just be noise.

I don't think it's worth making the fillfactor 75.

drowley@amd3990x:~/recoverylogs$ grep -rH -m 1 "collisions"
ff75_tb100.log:LOG:  size: 1024, members: 231, filled: 0.225586, total
chain: 33, max chain: 2, avg chain: 0.142857, total_collisions: 20,
max_collisions: 2, avg_collisions: 0.086580
ff90_tb100.log:LOG:  size: 512, members: 231, filled: 0.451172, total
chain: 66, max chain: 2, avg chain: 0.285714, total_collisions: 36,
max_collisions: 2, avg_collisions: 0.155844

ff75_tb200.log:LOG:  size: 1024, members: 431, filled: 0.420898, total
chain: 160, max chain: 4, avg chain: 0.371230, total_collisions: 81,
max_collisions: 3, avg_collisions: 0.187935
ff90_tb200.log:LOG:  size: 512, members: 431, filled: 0.841797, total
chain: 942, max chain: 9, avg chain: 2.185615, total_collisions: 134,
max_collisions: 3, avg_collisions: 0.310905

ff90_tb300.log:LOG:  size: 1024, members: 631, filled: 0.616211, total
chain: 568, max chain: 9, avg chain: 0.900158, total_collisions: 158,
max_collisions: 4, avg_collisions: 0.250396
ff75_tb300.log:LOG:  size: 1024, members: 631, filled: 0.616211, total
chain: 568, max chain: 9, avg chain: 0.900158, total_collisions: 158,
max_collisions: 4, avg_collisions: 0.250396

ff75_tb400.log:LOG:  size: 2048, members: 831, filled: 0.405762, total
chain: 341, max chain: 4, avg chain: 0.410349, total_collisions: 162,
max_collisions: 3, avg_collisions: 0.194946
ff90_tb400.log:LOG:  size: 1024, members: 831, filled: 0.811523, total
chain: 1747, max chain: 15, avg chain: 2.102286, total_collisions:
269, max_collisions: 3, avg_collisions: 0.323706

ff75_tb500.log:LOG:  size: 2048, members: 1031, filled: 0.503418,
total chain: 568, max chain: 5, avg chain: 0.550921, total_collisions:
219, max_collisions: 4, avg_collisions: 0.212415
ff90_tb500.log:LOG:  size: 2048, members: 1031, filled: 0.503418,
total chain: 568, max chain: 5, avg chain: 0.550921, total_collisions:
219, max_collisions: 4, avg_collisions: 0.212415

ff75_tb600.log:LOG:  size: 2048, members: 1231, filled: 0.601074,
total chain: 9

Re: WIP: WAL prefetch (another approach)

2021-04-28 Thread Andres Freund
Hi,

On 2021-04-28 19:24:53 -0400, Tom Lane wrote:
> But I happened to notice the accumulated CPU time for the background
> processes:
> 
> USER   PID  %CPU %MEM  VSZRSS   TT  STAT STARTED  TIME COMMAND
> tgl  19048   0.0  4.4   229952  92196   ??  Ss3:19PM  19:59.19 
> postgres: startup recovering 000100140022
> tgl  19051   0.0  0.1   229656   1696   ??  Ss3:19PM  27:09.14 
> postgres: walreceiver streaming 14/227D8F14
> tgl  19052   0.0  0.1   229904   2516   ??  Ss3:19PM  17:38.17 
> postgres: walsender tgl [local] streaming 14/227D8F14 
> 
> IOW, we've spent over twice as many CPU cycles shipping data to the
> standby as we did in applying the WAL on the standby.  Is this
> expected?  I've got wal_consistency_checking = all, which is bloating
> the WAL volume quite a bit, but still it seems like the walsender and
> walreceiver have little excuse for spending more cycles per byte
> than the startup process.

I don't really know how the time calculation works on mac. Is there a
chance it includes time spent doing IO? On the primary the WAL IO is
done by a lot of backends, but on the standby it's all going to be the
walreceiver. And the walreceiver does fsyncs in a not particularly
efficient manner.

FWIW, on my linux workstation no such difference is visible:
USER PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
andres   2910540  9.4  0.0 2237252 126680 ?  Ss   16:55   0:20 postgres: 
dev assert standby: startup recovering 00010002003F
andres   2910544  5.2  0.0 2236724 9260 ?Ss   16:55   0:11 postgres: 
dev assert standby: walreceiver streaming 2/3FDCF118
andres   2910545  2.1  0.0 2237036 10672 ?   Ss   16:55   0:04 postgres: 
dev assert: walsender andres [local] streaming 2/3FDCF118



> (This is testing b3ee4c503, so if Thomas' WAL changes improved
> efficiency of the replay process at all, the discrepancy could be
> even worse in HEAD.)

The prefetching isn't enabled by default, so I'd not expect meaningful
differences... And even with the prefetching enabled, our normal
regression tests largely are resident in s_b, so there shouldn't be much
prefetching.


Oh! I was about to ask how much shared buffers your primary / standby
have. And I think I may actually have reproduce a variant of the issue!

I previously had played around with different settings that I thought
might increase the likelihood of reproducing the problem. But this time
I set shared_buffers lower than before, and got:

2021-04-28 17:03:22.174 PDT [2913840][] LOG:  database system was shut down in 
recovery at 2021-04-28 17:03:11 PDT
2021-04-28 17:03:22.174 PDT [2913840][] LOG:  entering standby mode
2021-04-28 17:03:22.178 PDT [2913840][1/0] LOG:  redo starts at 2/416C6278
2021-04-28 17:03:37.628 PDT [2913840][1/0] LOG:  consistent recovery state 
reached at 4/7F5C3200
2021-04-28 17:03:37.628 PDT [2913840][1/0] FATAL:  invalid memory alloc request 
size 3053455757
2021-04-28 17:03:37.628 PDT [2913839][] LOG:  database system is ready to 
accept read only connections
2021-04-28 17:03:37.636 PDT [2913839][] LOG:  startup process (PID 2913840) 
exited with exit code 1

This reproduces across restarts. Yay, I guess.

Isn't it off that we get a "database system is ready to accept read only
connections"?

Greetings,

Andres Freund




Re: WIP: WAL prefetch (another approach)

2021-04-28 Thread Tom Lane
Andres Freund  writes:
> On 2021-04-28 19:24:53 -0400, Tom Lane wrote:
>> IOW, we've spent over twice as many CPU cycles shipping data to the
>> standby as we did in applying the WAL on the standby.

> I don't really know how the time calculation works on mac. Is there a
> chance it includes time spent doing IO?

I'd be pretty astonished if it did.  This is basically a NetBSD system
remember (in fact, this ancient macOS release is a good deal closer
to those roots than modern versions).  BSDen have never accounted for
time that way AFAIK.  Also, the "ps" man page says specifically that
that column is CPU time.

> Oh! I was about to ask how much shared buffers your primary / standby
> have. And I think I may actually have reproduce a variant of the issue!

Default configurations, so 128MB each.

regards, tom lane




Re: Use simplehash.h instead of dynahash in SMgr

2021-04-28 Thread Yura Sokolov

David Rowley писал 2021-04-29 02:51:
On Thu, 29 Apr 2021 at 00:28, Yura Sokolov  
wrote:

The best result is with just:

return (uint32)rnode->node.relNode;

ie, relNode could be taken without mixing at all.
relNode is unique inside single database, and almost unique among 
whole

cluster
since it is Oid.


I admit to having tried that too just to almost eliminate the cost of
hashing.  I just didn't consider it something we'd actually do.

The system catalogues are quite likely to all have the same
relfilenode in all databases, so for workloads that have a large
number of databases, looking up WAL records that touch the catalogues
is going to be pretty terrible.


Single workload that could touch system catalogues in different
databases is recovery (and autovacuum?). Client backends couldn't
be connected to more than one database.

But netherless, you're right. I oversimplified it intentionally.
I wrote originally:

   hashcode = (uint32)rnode->node.dbNode * 0xc2b2ae35;
   hashcode ^= (uint32)rnode->node.relNode;
   return hashcode;

But before sending mail I'd cut dbNode part.
Still, main point: relNode could be put unmixed into final value.
This way less collisions happen.



The simplified hash function I wrote with just the relNode and dbNode
would be the least I'd be willing to entertain.  However, I just
wouldn't be surprised if there was a good reason for that being bad
too.


So, I've repeated benchmark with different number of partitons (I 
tried

to catch higher fillfactor) and less amount of inserted data (since I
don't
want to stress my SSD).

partitions/ | dynahash | dynahash +  | simplehash | simplehash + |
fillfactor  |  | simple func |   | simple func  |
+--+-+--+
  3500/0.43  |   3.73s  |   3.54s |   3.58s|   3.34s  |
  3200/0.78  |   3.64s  |   3.46s |   3.47s|   3.25s  |
  1500/0.74  |   3.18s  |   2.97s |   3.03s|   2.79s  |

Fillfactor is effective fillfactor in simplehash with than number of
partitions.
I wasn't able to measure with fillfactor close to 0.9 since looks like
simplehash tends to grow much earlier due to SH_GROW_MAX_MOVE.


Thanks for testing that.

I also ran some tests last night to test the 0.75 vs 0.9 fillfactor to
see if it made a difference.  The test was similar to last time, but I
trimmed the number of rows inserted from 100 million down to 25
million.  Last time I tested with 1000 partitions, this time with each
of: 100 200 300 400 500 600 700 800 900 1000 partitions.  There didn't
seem to be any point of testing lower than that as the minimum hash
table size is 512.

The averages over 10 runs were:

nparts  ff75   ff90
100  21.898  22.226
200  23.105  25.493
300  25.274  24.251
400  25.139  25.611
500  25.738  25.454
600  26.656  26.82
700  27.577  27.102
800  27.608  27.546
900  27.284  28.186
100029 28.153

Or to summarise a bit, we could just look at the sum of all the
results per fillfactor:

sum ff75 2592.79
sum ff90 2608.42 100.6%

fillfactor 75 did come out slightly faster, but only just. It seems
close enough that it might be better just to keep the 90 to save a
little memory and improve caching elsewhere.  Also, from below, you
can see that for 300, 500, 600, 700, 1000 tables tests, the hash
tables ended up the same size, yet there's a bit of variability in the
timing result.  So the 0.6% gain might just be noise.

I don't think it's worth making the fillfactor 75.


To be clear: I didn't change SH_FILLFACTOR. It were equal to 0.9 .
I just were not able to catch table with fill factor more than 0.78.
Looks like you've got it with 900 partitions :-)



Another line of thought for making it go faster would be to do
something like get rid of the hash status field from SMgrEntry.  That
could be either coded into a single bit we'd borrow from the hash
value, or it could be coded into the least significant bit of the data
field.  A pointer to palloc'd memory should always be MAXALIGNed,
which means at least the lower two bits are always zero.  We'd just
need to make sure and do something like "data & ~((uintptr_t) 3)" to
trim off the hash status bits before dereferencing the pointer.  That
would make the SMgrEntry 12 bytes on a 64-bit machine.  However, it
would also mean that some entries would span 2 cache lines, which
might affect performance a bit.


Then data pointer will be itself unaligned to 8 bytes. While x86 is
mostly indifferent to this, doubtfully this memory economy will pay
off.

regards,
Yura Sokolov.




Patch to allow pg_filedump to support reading of pg_filenode.map

2021-04-28 Thread Richard Yen
Hello Hackers,

I recently had to work on a case where some catalog files were
corrupt and/or missing.  One of the things we sought to inspect was
pg_filenode.map, but there was no tooling available to do so.

With the help of Álvaro H.. I've put together a patch to allow pg_filedump
to do some rudimentary decoding of pg_filenode.map, so that it's at least
human-readable.  I had the idea to try to map the OIDs to relnames, but
that might get hairy, with TOAST table OIDs possibly changing (for pg_proc,
etc.)

It seems that Christoph Berg is the primary committer for the pg_filedump
project these days so he is cc'd here, but if there's some other place I
should be sending this kind of contribution to, please let me know.
Hopefully this will be helpful to others in the future.

Much appreciated,
--Richard


add_filenode_support.patch
Description: Binary data


Re: WIP: WAL prefetch (another approach)

2021-04-28 Thread Andres Freund
Hi,

On 2021-04-28 20:24:43 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > Oh! I was about to ask how much shared buffers your primary / standby
> > have.
> Default configurations, so 128MB each.

I thought that possibly initdb would detect less or something...


I assume this is 32bit? I did notice that a 32bit test took a lot longer
than a 64bit test. But didn't investigate so far.


> And I think I may actually have reproduce a variant of the issue!

Unfortunately I had not set up things in a way that the primary retains
the WAL, making it harder to compare whether it's the WAL that got
corrupted or whether it's a decoding bug.

I can however say that pg_waldump on the standby's pg_wal does also
fail. The failure as part of the backend is "invalid memory alloc
request size", whereas in pg_waldump I get the much more helpful:
pg_waldump: fatal: error in WAL record at 4/7F5C31C8: record with incorrect 
prev-link 416200FF/FF00 at 4/7F5C3200

In frontend code that allocation actually succeeds, because there is no
size check. But in backend code we run into the size check, and thus
don't even display a useful error.

In 13 the header is validated before allocating space for the
record(except if header is spread across pages) - it seems inadvisable
to turn that around?


Greetings,

Andres Freund




RE: [bug?] Missed parallel safety checks, and wrong parallel safety

2021-04-28 Thread houzj.f...@fujitsu.com
> >>> I'm curious. The FmgrBuiltin struct includes the "strict" flag, so
> >>> that would "lock down the value" of the strict flag, wouldn't it?
> 
> >> It does, but that's much more directly a property of the function's C
> >> code than parallel-safety is.
> 
> > I'm not sure I agree with that, but I think having the "strict" flag
> > in FmgrBuiltin isn't that nice either.
> 
> Yeah, if we could readily do without it, we probably would.  But the function
> call mechanism itself is responsible for implementing strictness, so it *has* 
> to
> have that flag available.

So, If we do not want to lock down the parallel safety of built-in functions.
It seems we can try to fetch the proparallel from pg_proc for built-in function
in fmgr_info_cxt_security too. To avoid recursive safety check when fetching
proparallel from pg_proc, we can add a Global variable to mark is it in a 
recursive state.
And we skip safety check in a recursive state, In this approach, parallel safety
will not be locked, and there are no new members in FmgrBuiltin.

Attaching the patch about this approach [0001-approach-1].
Thoughts ?

I also attached another approach patch [0001-approach-2] about adding
parallel safety in FmgrBuiltin, because this approach seems faster and
we can combine some bool member into a bitflag to avoid enlarging the
FmgrBuiltin array, though this approach will lock down the parallel safety
of built-in function.

Best regards,
houzj


0002-fix-testcase-with-wrong-parallel-safety-flag.patch
Description: 0002-fix-testcase-with-wrong-parallel-safety-flag.patch


0001-approach-1-check-parallel-safety-in-fmgr_info_cxt_se.patch
Description:  0001-approach-1-check-parallel-safety-in-fmgr_info_cxt_se.patch


0001-approach-2-check-parallel-safety-in-fmgr_info_cxt_se.patch
Description:  0001-approach-2-check-parallel-safety-in-fmgr_info_cxt_se.patch


Re: [HACKERS] logical decoding of two-phase transactions

2021-04-28 Thread Ajin Cherian
Modified pgbench's "tpcb-like" builtin query as below to do two-phase
commits and then ran a 4 cascade replication setup.

"BEGIN;\n"
"UPDATE pgbench_accounts SET abalance = abalance + :delta
WHERE aid = :aid;\n"
"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
"UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE
tid = :tid;\n"
"UPDATE pgbench_branches SET bbalance = bbalance + :delta
WHERE bid = :bid;\n"
"INSERT INTO pgbench_history (tid, bid, aid, delta, mtime)
VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n"
"PREPARE TRANSACTION ':aid:';\n"
"COMMIT PREPARED ':aid:';\n"

The tests ran fine and all 4 cascaded servers replicated the changes
correctly. All the subscriptions were configured with two_phase
enabled.

regards,
Ajin Cherian
Fujitsu Australia




Re: WIP: WAL prefetch (another approach)

2021-04-28 Thread Andres Freund
Hi,

On 2021-04-28 17:59:22 -0700, Andres Freund wrote:
> I can however say that pg_waldump on the standby's pg_wal does also
> fail. The failure as part of the backend is "invalid memory alloc
> request size", whereas in pg_waldump I get the much more helpful:
> pg_waldump: fatal: error in WAL record at 4/7F5C31C8: record with incorrect 
> prev-link 416200FF/FF00 at 4/7F5C3200

There's definitely something broken around continuation records, in
XLogFindNextRecord(). Which means that it's not the cause for the server
side issue, but obviously still not good.

The conversion of XLogFindNextRecord() to be state machine based
basically only works in a narrow set of circumstances. Whenever the end
of the first record read is on a different page than the start of the
record, we'll endlessly loop.

We'll go into XLogFindNextRecord(), and return until we've successfully
read the page header. Then we'll enter the second loop. Which will try
to read until the end of the first record. But after returning the first
loop will again ask for page header.

Even if that's fixed, the second loop alone has the same problem: As
XLogBeginRead() is called unconditionally we'll start read the start of
the record, discover that it needs data on a second page, return, and
do the same thing again.

I think it needs something roughly like the attached.

Greetings,

Andres Freund
diff --git i/src/include/access/xlogreader.h w/src/include/access/xlogreader.h
index 3b8af31a8fe..82a80cf2bf5 100644
--- i/src/include/access/xlogreader.h
+++ w/src/include/access/xlogreader.h
@@ -297,6 +297,7 @@ struct XLogFindNextRecordState
 	XLogReaderState *reader_state;
 	XLogRecPtr		targetRecPtr;
 	XLogRecPtr		currRecPtr;
+	bool			found_start;
 };
 
 /* Report that data is available for decoding. */
diff --git i/src/backend/access/transam/xlogreader.c w/src/backend/access/transam/xlogreader.c
index 4277e92d7c9..935c841347f 100644
--- i/src/backend/access/transam/xlogreader.c
+++ w/src/backend/access/transam/xlogreader.c
@@ -868,7 +868,7 @@ XLogDecodeOneRecord(XLogReaderState *state, bool allow_oversized)
 /* validate record header if not yet */
 if (!state->record_verified && record_len >= SizeOfXLogRecord)
 {
-if (!ValidXLogRecordHeader(state, state->DecodeRecPtr,
+	if (!ValidXLogRecordHeader(state, state->DecodeRecPtr,
 			   state->PrevRecPtr, prec))
 		goto err;
 
@@ -1516,6 +1516,7 @@ InitXLogFindNextRecord(XLogReaderState *reader_state, XLogRecPtr start_ptr)
 	state->reader_state = reader_state;
 	state->targetRecPtr = start_ptr;
 	state->currRecPtr = start_ptr;
+	state->found_start = false;
 
 	return state;
 }
@@ -1545,7 +1546,7 @@ XLogFindNextRecord(XLogFindNextRecordState *state)
 	 * skip over potential continuation data, keeping in mind that it may span
 	 * multiple pages
 	 */
-	while (true)
+	while (!state->found_start)
 	{
 		XLogRecPtr	targetPagePtr;
 		int			targetRecOff;
@@ -1616,7 +1617,12 @@ XLogFindNextRecord(XLogFindNextRecordState *state)
 	 * because either we're at the first record after the beginning of a page
 	 * or we just jumped over the remaining data of a continuation.
 	 */
-	XLogBeginRead(state->reader_state, state->currRecPtr);
+	if (!state->found_start)
+	{
+		XLogBeginRead(state->reader_state, state->currRecPtr);
+		state->found_start = true;
+	}
+
 	while ((result = XLogReadRecord(state->reader_state, &record, &errormsg)) !=
 		   XLREAD_FAIL)
 	{


Re: [PATCH] We install pg_regress and isolationtester but not pg_isolation_regress

2021-04-28 Thread Michael Paquier
On Wed, Apr 28, 2021 at 12:44:45PM +0300, Aleksander Alekseev wrote:
> I just wanted to let you know that TimescaleDB uses
> pg_isolation_regress and occasionally there are reports from some
> suffering/puzzled users/developers, e.g. [1]. Not 100% sure if it
> makes investing the time into backpatching worth it. However if
> someone could do it, it would be nice.

FWIW, I am not really sure that this is important enough to justify a
back-patch, even it is true that there have been cases in the past
where extra binaries got added in minor releases.
--
Michael


signature.asc
Description: PGP signature


[BUG]"FailedAssertion" reported in lazy_scan_heap() when running logical replication

2021-04-28 Thread tanghy.f...@fujitsu.com
Hi

I met an assertion failure at the publisher in lazy_scan_heap() when 
synchronous running logical replication. Could someone please take a look at it?

Here's what I did to produce the problem.

First, use './configure --enable-cassert' to build the PG.
Then, I created multiple publications at publisher and multiple subscriptions 
at subscriber.
Then, set the value of synchronous_standby_names and reload, make them in 
synchronous commit mode. After that, an assertion failed at publisher when I 
COMMIT and ROLLBACK transactions concurrently: 

>TRAP: FailedAssertion("!all_visible_according_to_vm || 
>prunestate.all_visible", File: "vacuumlazy.c", Line: 1347, PID: 1274675)

BTW, in asynchronous mode, the same problem can also happen but in a low 
frequency.(I tried many times, but the problem happened only 2 times)
As for synchronous mode, I found it seems easier to reproduce the problem with 
setting "autovacuum_naptime = 1".
But it still can't be 100% to reproduced it. (I tested it 5 times, 3 of them 
reproduced it.) 

The script and the log are attached. It took about 6min to run it(without 
problem) on my machine and it could be less than 6min if the server crashed.

Regards
Tang
<>


pub.log
Description: pub.log


Re: [PATCH] We install pg_regress and isolationtester but not pg_isolation_regress

2021-04-28 Thread Tom Lane
Michael Paquier  writes:
> On Wed, Apr 28, 2021 at 12:44:45PM +0300, Aleksander Alekseev wrote:
>> I just wanted to let you know that TimescaleDB uses
>> pg_isolation_regress and occasionally there are reports from some
>> suffering/puzzled users/developers, e.g. [1]. Not 100% sure if it
>> makes investing the time into backpatching worth it. However if
>> someone could do it, it would be nice.

> FWIW, I am not really sure that this is important enough to justify a
> back-patch, even it is true that there have been cases in the past
> where extra binaries got added in minor releases.

Yeah, I think adding a binary in a minor release is a Big Deal to
packagers.  I doubt that the case here justifies that.

regards, tom lane




Re: [BUG]"FailedAssertion" reported in lazy_scan_heap() when running logical replication

2021-04-28 Thread Michael Paquier
On Thu, Apr 29, 2021 at 02:34:21AM +, tanghy.f...@fujitsu.com wrote:
> I met an assertion failure at the publisher in lazy_scan_heap() when
> synchronous running logical replication. Could someone please take a
> look at it?

This assertion is new as of 7136bf34.  Peter?
--
Michael


signature.asc
Description: PGP signature


Re: Error in libpq docs for target_session_attrs, prefer-standby mode

2021-04-28 Thread Michael Paquier
On Wed, Apr 28, 2021 at 01:32:05PM +0200, Laurenz Albe wrote:
> You are right, and the patch is good.

Thanks, fixed.
--
Michael


signature.asc
Description: PGP signature


Re: Replication slot stats misgivings

2021-04-28 Thread Amit Kapila
On Thu, Apr 29, 2021 at 4:58 AM Masahiko Sawada  wrote:
>
> On Thu, Apr 29, 2021 at 5:41 AM Tom Lane  wrote:
> >
> > It seems that the test case added by f5fc2f5b2 is still a bit
> > unstable, even after c64dcc7fe:
>
> Hmm, I don't see the exact cause yet but there are two possibilities:
> some transactions were really spilled,
>

This is the first test and inserts just one small record, so how it
can lead to spill of data. Do you mean to say that may be some
background process has written some transaction which leads to a spill
of data?

> and it showed the old stats due
> to losing the drop (and create) slot messages.
>

Yeah, something like this could happen. Another possibility here could
be that before the stats collector has processed drop and create
messages, we have enquired about the stats which lead to it giving us
the old stats. Note, that we don't wait for 'drop' or 'create' message
to be delivered. So, there is a possibility of the same. What do you
think?

> For the former case, it
> seems to better to create the slot just before the insertion and
> setting logical_decoding_work_mem to the default (64MB). For the
> latter case, maybe we can use a different name slot than the name used
> in other tests?
>

How about doing both of the above suggestions? Alternatively, we can
wait for both 'drop' and 'create' message to be delivered but that
might be overkill.


-- 
With Regards,
Amit Kapila.




Re: WIP: WAL prefetch (another approach)

2021-04-28 Thread Andres Freund
Hi,

On 2021-04-28 17:59:22 -0700, Andres Freund wrote:
> I can however say that pg_waldump on the standby's pg_wal does also
> fail. The failure as part of the backend is "invalid memory alloc
> request size", whereas in pg_waldump I get the much more helpful:
> pg_waldump: fatal: error in WAL record at 4/7F5C31C8: record with incorrect 
> prev-link 416200FF/FF00 at 4/7F5C3200
> 
> In frontend code that allocation actually succeeds, because there is no
> size check. But in backend code we run into the size check, and thus
> don't even display a useful error.
> 
> In 13 the header is validated before allocating space for the
> record(except if header is spread across pages) - it seems inadvisable
> to turn that around?

I was now able to reproduce the problem again, and I'm afraid that the
bug I hit is likely separate from Tom's. The allocation thing above is
the issue in my case:

The walsender connection ended (I restarted the primary), thus the
startup switches to replaying locally. For some reason the end of the
WAL contains non-zero data (I think it's because walreceiver doesn't
zero out pages - that's bad!). Because the allocation happen before the
header is validated, we reproducably end up in the mcxt.c ERROR path,
failing recovery.

To me it looks like a smaller version of the problem is present in < 14,
albeit only when the page header is at a record boundary. In that case
we don't validate the page header immediately, only once it's completely
read. But we do believe the total size, and try to allocate
that.

There's a really crufty escape hatch (from 70b4f82a4b) to that:

/*
 * Note that in much unlucky circumstances, the random data read from a
 * recycled segment can cause this routine to be called with a size
 * causing a hard failure at allocation.  For a standby, this would 
cause
 * the instance to stop suddenly with a hard failure, preventing it to
 * retry fetching WAL from one of its sources which could allow it to 
move
 * on with replay without a manual restart. If the data comes from a 
past
 * recycled segment and is still valid, then the allocation may succeed
 * but record checks are going to fail so this would be short-lived.  If
 * the allocation fails because of a memory shortage, then this is not a
 * hard failure either per the guarantee given by MCXT_ALLOC_NO_OOM.
 */
if (!AllocSizeIsValid(newSize))
return false;

but it looks to me like that's pretty much the wrong fix, at least in
the case where we've not yet validated the rest of the header. We don't
need to allocate all that data before we've read the rest of the
*fixed-size* header.

It also seems to me that 70b4f82a4b should also have changed walsender
to pad out the received data to an 8KB boundary?

Greetings,

Andres Freund




Re: Replication slot stats misgivings

2021-04-28 Thread Tom Lane
Amit Kapila  writes:
> This is the first test and inserts just one small record, so how it
> can lead to spill of data. Do you mean to say that may be some
> background process has written some transaction which leads to a spill
> of data?

autovacuum, say?

> Yeah, something like this could happen. Another possibility here could
> be that before the stats collector has processed drop and create
> messages, we have enquired about the stats which lead to it giving us
> the old stats. Note, that we don't wait for 'drop' or 'create' message
> to be delivered. So, there is a possibility of the same. What do you
> think?

You should take a close look at the stats test in the main regression
tests.  We had to jump through *high* hoops to get that to be stable,
and yet it still fails semi-regularly.  This looks like pretty much the
same thing, and so I'm pessimistically inclined to guess that it will
never be entirely stable.

(At least not before the fabled stats collector rewrite, which may well
introduce some entirely new set of failure modes.)

Do we really need this test in this form?  Perhaps it could be converted
to a TAP test that's a bit more forgiving.

regards, tom lane




Re: WIP: WAL prefetch (another approach)

2021-04-28 Thread Tom Lane
Andres Freund  writes:
> I was now able to reproduce the problem again, and I'm afraid that the
> bug I hit is likely separate from Tom's.

Yeah, I think so --- the symptoms seem quite distinct.

My score so far today on the G4 is:

12 error-free regression test cycles on b3ee4c503

(plus one more with shared_buffers set to 16MB, on the strength
of your previous hunch --- didn't fail for me though)

HEAD failed on the second run with the same symptom as before:

2021-04-28 22:57:17.048 EDT [50479] FATAL:  inconsistent page found, rel 
1663/58183/69545, forknum 0, blkno 696
2021-04-28 22:57:17.048 EDT [50479] CONTEXT:  WAL redo at 4/B72D408 for 
Heap/INSERT: off 77 flags 0x00; blkref #0: rel 1663/58183/69545, blk 696 FPW

This seems to me to be pretty strong evidence that I'm seeing *something*
real.  I'm currently trying to isolate a specific commit to pin it on.
A straight "git bisect" isn't going to work because so many people had
broken so many different things right around that date :-(, so it may
take awhile to get a good answer.

regards, tom lane




Re: Replication slot stats misgivings

2021-04-28 Thread Andres Freund
On 2021-04-28 23:20:00 -0400, Tom Lane wrote:
> (At least not before the fabled stats collector rewrite, which may well
> introduce some entirely new set of failure modes.)

FWIW, I added a function that forces a flush there. That can be done
synchronously and the underlying functionality needs to exist anyway to
deal with backend exit. Makes it a *lot* easier to write tests for stats
related things...




Re: Small issues with CREATE TABLE COMPRESSION

2021-04-28 Thread Justin Pryzby
On Tue, Apr 27, 2021 at 03:22:25PM +0900, Michael Paquier wrote:
> Hi all,
> 
> I have been looking at and testing the patch set for CREATE TABLE
> COMPRESSION, and spotted a couple of things in parallel of some work
> done by Jacob (added in CC).
> 
> The behavior around CREATE TABLE AS and matviews is a bit confusing,
> and not documented.  First, at the grammar level, it is not possible
> to specify which compression option is used per column when creating
> the relation.  So, all the relation columns would just set a column's
> compression to be default_toast_compression for all the toastable
> columns of the relation.  That's not enforceable at column level when
> the relation is created, except with a follow-up ALTER TABLE.  That's
> similar to STORAGE when it comes to matviews, but these are at least
> documented.
> 
> And so, ALTER MATERIALIZED VIEW supports SET COMPRESSION but this is
> not mentioned in its docs:
> https://www.postgresql.org/docs/devel/sql-altermaterializedview.html
>
> psql could have tab completion support for that.

Actually ALTER matview ALTER col has no tab completion at all, right ?

> Now, we don't really document any of that, and the per-column
> compression value would be set to default_toast_compression while the
> stored values may use a mix of the compression methods, depending on
> where the toasted values come from.  If this behavior is intended, this
> makes me wonder in what the possibility to set the compression for a
> materialized view column is useful for except for a logical
> dump/restore?  As of HEAD we'd just insert the toasted value from the
> origin as-is so the compression of the column has no effect at all.

That may be true if the mat view is trivial, but not true if it has
expressions.  The mat view column may be built on multiple table columns, or be
of a different type than the columns it's built on top of, so the relationship
may not be so direct.

> Another thing here is the inconsistency that this brings with pg_dump.
> For example, as the dumped values are decompressed, we could have
> values compressed with pglz at the origin, with a column using lz4
> within its definition that would make everything compressed with lz4
> once the values are restored.  This choice may be fine, but I think
> that it would be good to document all that.  That would be less
> surprising to the user.

Can you suggest what or where we'd say it?  I think this is just the behavior
that pg_dump shouldn't lose the user's compression setting.

The setting itself is for "future" data, and the only way to guarantee what
compression types are in use are by vacuum full/cluster or pg_dump restore.

> Similarly, we may want to document that COMPRESSION does not trigger a
> table rewrite, but that it is effective only for the new toast values
> inserted if a tuple is rebuilt and rewritten?

Good point.  I started with this.

diff --git a/doc/src/sgml/ref/alter_table.sgml 
b/doc/src/sgml/ref/alter_table.sgml
index 39927be41e..8cceea41d0 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -391,7 +391,21 @@ WITH ( MODULUS numeric_literal, REM
 
 
  
-  This sets the compression method for a column.  The supported compression
+  This sets the compression method to be used for data inserted into a 
column.
+
+  This does not cause the table to be rewritten, so existing data may still
+  be compressed with other compression methods.  If the table is rewritten 
with
+  VACUUM FULL or CLUSTER, or restored
+  with pg_restore, then all tuples are rewritten
+  with the configured compression methods.
+
+  Also, note that when data is inserted from another relation (for example,
+  by INSERT ... SELECT), tuples from the source data are
+  not necessarily detoasted, and any previously compressed data is retained
+  with its existing compression method, rather than recompressing with the
+  compression methods of the target columns.
+
+  The supported compression
   methods are pglz and lz4.
   lz4 is available only if --with-lz4
   was used when building PostgreSQL.




Re: Use simplehash.h instead of dynahash in SMgr

2021-04-28 Thread David Rowley
On Thu, 29 Apr 2021 at 12:30, Yura Sokolov  wrote:
>
> David Rowley писал 2021-04-29 02:51:
> > Another line of thought for making it go faster would be to do
> > something like get rid of the hash status field from SMgrEntry.  That
> > could be either coded into a single bit we'd borrow from the hash
> > value, or it could be coded into the least significant bit of the data
> > field.  A pointer to palloc'd memory should always be MAXALIGNed,
> > which means at least the lower two bits are always zero.  We'd just
> > need to make sure and do something like "data & ~((uintptr_t) 3)" to
> > trim off the hash status bits before dereferencing the pointer.  That
> > would make the SMgrEntry 12 bytes on a 64-bit machine.  However, it
> > would also mean that some entries would span 2 cache lines, which
> > might affect performance a bit.
>
> Then data pointer will be itself unaligned to 8 bytes. While x86 is
> mostly indifferent to this, doubtfully this memory economy will pay
> off.

Actually, I didn't think very hard about that. The struct would still
be 16 bytes and just have padding so the data pointer was aligned to 8
bytes (assuming a 64-bit machine).

David




Re: [BUG]"FailedAssertion" reported in lazy_scan_heap() when running logical replication

2021-04-28 Thread Peter Geoghegan
On Wed, Apr 28, 2021 at 7:34 PM tanghy.f...@fujitsu.com
 wrote:
> >TRAP: FailedAssertion("!all_visible_according_to_vm || 
> >prunestate.all_visible", File: "vacuumlazy.c", Line: 1347, PID: 1274675)
>
> BTW, in asynchronous mode, the same problem can also happen but in a low 
> frequency.(I tried many times, but the problem happened only 2 times)
> As for synchronous mode, I found it seems easier to reproduce the problem 
> with setting "autovacuum_naptime = 1".
> But it still can't be 100% to reproduced it. (I tested it 5 times, 3 of them 
> reproduced it.)

Is setting all_visible_according_to_vm false as below enough to avoid
the assertion failure?

diff --git a/src/backend/access/heap/vacuumlazy.c
b/src/backend/access/heap/vacuumlazy.c
index c3fc12d76c..76c17e063e 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -1146,6 +1146,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams
*params, bool aggressive)
 {
 ReleaseBuffer(vmbuffer);
 vmbuffer = InvalidBuffer;
+all_visible_according_to_vm = false;
 }

 /* Remove the collected garbage tuples from table and indexes */


-- 
Peter Geoghegan




Re: [PATCH] force_parallel_mode and GUC categories

2021-04-28 Thread Justin Pryzby
On Fri, Apr 09, 2021 at 10:50:53AM +0900, Michael Paquier wrote:
> -   {"track_commit_timestamp", PGC_POSTMASTER, REPLICATION,
> +   {"track_commit_timestamp", PGC_POSTMASTER, REPLICATION_SENDING,
> I can get behind this change for clarity where it gets actively used.

I'm not sure what you meant?

...but, I realized just now that *zero* other GUCs use "REPLICATION".
And the documentation puts it in 20.6.1. Sending Servers,
so it still seems to me that this is correct to move this, too.

https://www.postgresql.org/docs/devel/runtime-config-replication.html

Then, I wonder if REPLICATION should be removed from guc_tables.h...

-- 
Justin




Re: Replication slot stats misgivings

2021-04-28 Thread Amit Kapila
On Thu, Apr 29, 2021 at 8:50 AM Tom Lane  wrote:
>
> Amit Kapila  writes:
> > This is the first test and inserts just one small record, so how it
> > can lead to spill of data. Do you mean to say that may be some
> > background process has written some transaction which leads to a spill
> > of data?
>
> autovacuum, say?
>
> > Yeah, something like this could happen. Another possibility here could
> > be that before the stats collector has processed drop and create
> > messages, we have enquired about the stats which lead to it giving us
> > the old stats. Note, that we don't wait for 'drop' or 'create' message
> > to be delivered. So, there is a possibility of the same. What do you
> > think?
>
> You should take a close look at the stats test in the main regression
> tests.  We had to jump through *high* hoops to get that to be stable,
> and yet it still fails semi-regularly.  This looks like pretty much the
> same thing, and so I'm pessimistically inclined to guess that it will
> never be entirely stable.
>

True, it is possible that we can't make it entirely stable but I would
like to try some more before giving up on this. Otherwise, I guess the
other possibility is to remove some of the latest tests added or
probably change them to be more forgiving. For example, we can change
the currently failing test to not check 'spill*' count and rely on
just 'total*' count which will work even in scenarios we discussed for
this failure but it will reduce the efficiency/completeness of the
test case.

> (At least not before the fabled stats collector rewrite, which may well
> introduce some entirely new set of failure modes.)
>
> Do we really need this test in this form?  Perhaps it could be converted
> to a TAP test that's a bit more forgiving.
>

We have a TAP test for slot stats but there we are checking some
scenarios across the restart. We can surely move these tests also
there but it is not apparent to me how it can create a difference?

-- 
With Regards,
Amit Kapila.




Re: WIP: WAL prefetch (another approach)

2021-04-28 Thread Thomas Munro
On Thu, Apr 29, 2021 at 3:14 PM Andres Freund  wrote:
> To me it looks like a smaller version of the problem is present in < 14,
> albeit only when the page header is at a record boundary. In that case
> we don't validate the page header immediately, only once it's completely
> read. But we do believe the total size, and try to allocate
> that.
>
> There's a really crufty escape hatch (from 70b4f82a4b) to that:

Right, I made that problem worse, and that could probably be changed
to be no worse than 13 by reordering those operations.

PS Sorry for my intermittent/slow responses on this thread this week,
as I'm mostly away from the keyboard due to personal commitments.
I'll be back in the saddle next week to tidy this up, most likely by
reverting.  The main thought I've been having about this whole area is
that, aside from the lack of general testing of recovery, which we
should definitely address[1], what it really needs is a decent test
harness to drive it through all interesting scenarios and states at a
lower level, independently.

[1] 
https://www.postgresql.org/message-id/flat/CA%2BhUKGKpRWQ9SxdxxDmTBCJoR0YnFpMBe7kyzY8SUQk%2BHeskxg%40mail.gmail.com




Re: Patch to allow pg_filedump to support reading of pg_filenode.map

2021-04-28 Thread Justin Pryzby
This is separate from the postgresql server repo.
https://git.postgresql.org/gitweb/?p=pg_filedump.git

+#define RELMAPPER_FILEMAGIC   0x592717
+char magic_buffer[8];

...

+  if ( (int) magic_buffer & RELMAPPER_FILEMAGIC ) {

This is doing bitwise arithmetic on a pointer, which seems badly wrong.
I think it breaks normal use of pg_filedump - unless you happen to get a
magic_buffer without those bits set.  The segfault seems to confirm that, as
does gcc:

pg_filedump.c:2041:8: warning: cast from pointer to integer of different size 
[-Wpointer-to-int-cast]
 2041 |   if ( (int) magic_buffer & RELMAPPER_FILEMAGIC ) {

I think it probably means to do memcmp, instead ??

-- 
Justin




Re: pg_hba.conf.sample wording improvement

2021-04-28 Thread Peter Eisentraut

On 28.04.21 16:20, Tom Lane wrote:

Another thought is to switch the phrase order:

+# - "local" is a Unix-domain socket
+# - "host" is a TCP/IP socket (encrypted or not)
+# - "hostssl" is a TCP/IP socket that is SSL-encrypted
+# - "hostnossl" is a TCP/IP socket that is not SSL-encrypted
+# - "hostgssenc" is a TCP/IP socket that is GSSAPI-encrypted
+# - "hostnogssenc" is a TCP/IP socket that is not GSSAPI-encrypted

I'm not wedded to that idea, but it seems to help reduce random
variations between the wordings of these lines.


done that way




Re: Replication slot stats misgivings

2021-04-28 Thread Amit Kapila
On Wed, Apr 28, 2021 at 7:43 PM Masahiko Sawada  wrote:
>
> On Wed, Apr 28, 2021 at 3:25 PM Amit Kapila  wrote:
> >
>
> > I am not sure if any of these alternatives are a good idea. What do
> > you think? Do you have any other ideas for this?
>
> I've been considering some ideas but don't come up with a good one
> yet. It’s just an idea and not tested but how about having
> CreateDecodingContext() register before_shmem_exit() callback with the
> decoding context to ensure that we send slot stats even on
> interruption. And FreeDecodingContext() cancels the callback.
>

Is it a good idea to send stats while exiting and rely on the same? I
think before_shmem_exit is mostly used for the cleanup purpose so not
sure if we can rely on it for this purpose. I think we can't be sure
that in all cases we will send all stats, so maybe Vignesh's patch is
sufficient to cover the cases where we avoid losing it in cases where
we would have sent a large amount of data.

-- 
With Regards,
Amit Kapila.




Re: pg_hba.conf.sample wording improvement

2021-04-28 Thread Peter Eisentraut

On 28.04.21 16:09, Alvaro Herrera wrote:

Looking at it now, I wonder how well do the "hostno" options work.  If I
say "hostnogssenc", is an SSL-encrypted socket good?  If I say
"hostnossl", is a GSS-encrypted socket good?  If so, how does that make
sense?


I think for example if you want to enforce SSL connections, then writing 
"hostnossl ... reject" would be sensible.  That would also reject 
GSS-encrypted connections, but that would be what you want in that scenario.






Re: Unresolved repliaction hang and stop problem.

2021-04-28 Thread Amit Kapila
On Wed, Apr 28, 2021 at 7:36 PM Alvaro Herrera  wrote:
>
> On 2021-Apr-28, Amit Kapila wrote:
>
> > On Wed, Apr 28, 2021 at 6:48 AM Alvaro Herrera  
> > wrote:
>
> > > Hmm ... On what does it depend (other than plain git conflicts, which
> > > are aplenty)?  On a quick look to the commit, it's clear that we need to
> > > be careful in order not to cause an ABI break, but that doesn't seem
> > > impossible to solve, but I'm wondering if there is more to it than that.
> >
> > As mentioned in the commit message, we need another commit [1] change
> > to make this work.
> >
> > [1] - 
> > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=c55040ccd0
>
> Oh, yeah, that looks tougher.  (Still not impossible: it adds a new WAL
> message type, but we have added such on a minor release before.)
>

Yeah, we can try to make it possible if it is really a pressing issue
but I guess even in that case it is better to do it after we release
PG14 so that it can get some more testing.

> ... It's strange that replication worked for them on pg10 though and
> broke on 13.  What did we change anything to make it so?
>

No idea but probably if the other person can share the exact test case
which he sees working fine on PG10 but not on PG13 then it might be a
bit easier to investigate.

-- 
With Regards,
Amit Kapila.




Re: Forget close an open relation in ReorderBufferProcessTXN()

2021-04-28 Thread Amit Kapila
On Wed, Apr 28, 2021 at 5:36 PM osumi.takami...@fujitsu.com
 wrote:
>
> On Monday, April 26, 2021 2:05 PM Amit Kapila  wrote:
> > On Fri, Apr 23, 2021 at 8:03 PM osumi.takami...@fujitsu.com
> >  wrote:
> > I think we are allowed to decode the operations on user catalog tables
> > because we are using RelationIsLogicallyLogged() instead of
> > RelationIsAccessibleInLogicalDecoding() in ReorderBufferProcessTXN().
> > Based on this discussion, I think we should not be allowing decoding of
> > operations on user catalog tables, so we should use
> > RelationIsAccessibleInLogicalDecoding to skip such ops in
> > ReorderBufferProcessTXN(). Am, I missing something?
> >
> > Can you please clarify?
> I don't understand that point, either.
>
> I read the context where the user_catalog_table was introduced - [1].
> There, I couldn't find any discussion if we should skip decode operations
> on that kind of tables or not. Accordingly, we just did not conclude it, I 
> suppose.
>
> What surprised me a bit is to decode operations of system catalog table are 
> considered like [2]
> somehow at the time. I cannot find any concrete description of such use cases 
> in the thread, though.
>
> Anyway, I felt disallowing decoding of operations on user catalog tables
> doesn't spoil the feature's purpose. So, I'm OK to do so. What do you think ?
>

I am not so sure about it because I think we don't have any example of
user_catalog_tables in the core code. This is the reason I was kind of
looking towards Andres to clarify this. Right now, if the user
performs TRUNCATE on user_catalog_table in synchronous mode then it
will hang in case the decoding plugin takes even share lock on it. The
main reason is that we allow decoding of TRUNCATE operation for
user_catalog_tables. I think even if we want to allow decoding of
other operations on user_catalog_table, the decoding of TRUNCATE
should be prohibited but maybe we shouldn't allow decoding of any
operation on such tables as we don't do it for system catalog tables.

-- 
With Regards,
Amit Kapila.




Re: Replication slot stats misgivings

2021-04-28 Thread Masahiko Sawada
On Thu, Apr 29, 2021 at 11:55 AM Amit Kapila  wrote:
>
> On Thu, Apr 29, 2021 at 4:58 AM Masahiko Sawada  wrote:
> >
> > On Thu, Apr 29, 2021 at 5:41 AM Tom Lane  wrote:
> > >
> > > It seems that the test case added by f5fc2f5b2 is still a bit
> > > unstable, even after c64dcc7fe:
> >
> > Hmm, I don't see the exact cause yet but there are two possibilities:
> > some transactions were really spilled,
> >
>
> This is the first test and inserts just one small record, so how it
> can lead to spill of data. Do you mean to say that may be some
> background process has written some transaction which leads to a spill
> of data?

Not sure but I thought that the logical decoding started to decodes
from a relatively old point for some reason and decoded incomplete
transactions that weren’t shown in the result.

>
> > and it showed the old stats due
> > to losing the drop (and create) slot messages.
> >
>
> Yeah, something like this could happen. Another possibility here could
> be that before the stats collector has processed drop and create
> messages, we have enquired about the stats which lead to it giving us
> the old stats. Note, that we don't wait for 'drop' or 'create' message
> to be delivered. So, there is a possibility of the same. What do you
> think?

Yeah, that could happen even if any message didn't get dropped.

>
> > For the former case, it
> > seems to better to create the slot just before the insertion and
> > setting logical_decoding_work_mem to the default (64MB). For the
> > latter case, maybe we can use a different name slot than the name used
> > in other tests?
> >
>
> How about doing both of the above suggestions? Alternatively, we can
> wait for both 'drop' and 'create' message to be delivered but that
> might be overkill.

Agreed. Attached the patch doing both things.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


fix_stats_test.patch
Description: Binary data


RE: [BUG]"FailedAssertion" reported in lazy_scan_heap() when running logical replication

2021-04-28 Thread tanghy.f...@fujitsu.com
On Thursday, April 29, 2021 1:22 PM, Peter Geoghegan  wrote

>Is setting all_visible_according_to_vm false as below enough to avoid
>the assertion failure?
>
>diff --git a/src/backend/access/heap/vacuumlazy.c
>b/src/backend/access/heap/vacuumlazy.c
>index c3fc12d76c..76c17e063e 100644
>--- a/src/backend/access/heap/vacuumlazy.c
>+++ b/src/backend/access/heap/vacuumlazy.c
>@@ -1146,6 +1146,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams
>*params, bool aggressive)
> {
> ReleaseBuffer(vmbuffer);
> vmbuffer = InvalidBuffer;
>+all_visible_according_to_vm = false;
> }
>
> /* Remove the collected garbage tuples from table and indexes */

Thanks for your reply.
I tried your patch but the problem seems not be fixed.

Regards,
Tang


Re: [BUG] "FailedAssertion" reported when streaming in logical replication

2021-04-28 Thread Dilip Kumar
On Wed, Apr 28, 2021 at 1:02 PM Dilip Kumar  wrote:
>
> On Wed, Apr 28, 2021 at 12:25 PM tanghy.f...@fujitsu.com
>  wrote:
> >
> > > I have modified the patch based on the above comments.
> >
> > Thanks for your patch.
> > I tested again after applying your patch and the problem is fixed.
>
> Thanks for confirming.

I tried to think about how to write a test case for this scenario, but
I think it will not be possible to generate an automated test case for
this.  Basically, we need 2 concurrent transactions and out of that,
we need one transaction which just has processed only one change i.e
XLOG_HEAP2_NEW_CID and another transaction should overflow the logical
decoding work mem, so that we select the wrong transaction which
doesn't have the base snapshot.  But how to control that the
transaction which is performing the DDL just write the
XLOG_HEAP2_NEW_CID wal and before it writes any other WAL we should
get the WAl from other transaction which overflows the buffer.
-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Replication slot stats misgivings

2021-04-28 Thread Amit Kapila
On Thu, Apr 29, 2021 at 11:14 AM Masahiko Sawada  wrote:
>
> >
> > How about doing both of the above suggestions? Alternatively, we can
> > wait for both 'drop' and 'create' message to be delivered but that
> > might be overkill.
>
> Agreed. Attached the patch doing both things.
>

Thanks, the patch LGTM. I'll wait for a day before committing to see
if anyone has better ideas.

-- 
With Regards,
Amit Kapila.




RE: [BUG] "FailedAssertion" reported when streaming in logical replication

2021-04-28 Thread tanghy.f...@fujitsu.com

On Thursday, April 29, 2021 3:18 PM, Dilip Kumar  wrote

>I tried to think about how to write a test case for this scenario, but
>I think it will not be possible to generate an automated test case for this.  
>Basically, we need 2 concurrent transactions and out of that,
>we need one transaction which just has processed only one change i.e
>XLOG_HEAP2_NEW_CID and another transaction should overflow the logical
>decoding work mem, so that we select the wrong transaction which
>doesn't have the base snapshot.  But how to control that the
>transaction which is performing the DDL just write the
>XLOG_HEAP2_NEW_CID wal and before it writes any other WAL we should
>get the WAl from other transaction which overflows the buffer.

Thanks for your updating.
Actually, I tried to make the automated test for the problem, too. But made no 
process on this.
Agreed on your opinion " not be possible to generate an automated test case for 
this ".

If anyone figure out a good solution for the test automation of this case. 
Please be kind to share that with us. Thanks.

Regards,
Tang