date:20210518

On Tue, May 18, 2021 at 2:46 PM Masahiko Sawada  wrote:
>
> On Tue, May 18, 2021 at 2:42 PM Peter Geoghegan  wrote:
> >
> > On Mon, May 17, 2021 at 10:29 PM Masahiko Sawada  
> > wrote:
> > > +1 to fix this. Are you already working on fixing this? If not, I'll
> > > post a patch.
> >
> > I posted a patch recently (last Thursday my time). Perhaps you can review 
> > it?
>
> Oh, I missed that the patch includes that fix. I'll review the patch.
>

I've reviewed the patch. Here is one comment:

if (vacrel->num_index_scans == 0 &&
-   vacrel->rel_pages <= FAILSAFE_MIN_PAGES)
+   vacrel->rel_pages <= FAILSAFE_EVERY_PAGES)
return false;

Since there is the condition "vacrel->num_index_scans == 0" we could
enter the failsafe mode even if the table is less than 4GB, if we
enter lazy_check_wraparound_failsafe() after executing more than one
index scan. Whereas a vacuum on the table that is less than 4GB and
has no index never enters the failsafe mode. I think we can remove
this condition since I don't see the reason why we don't allow to
enter the failsafe mode only when the first-time index scan in the
case of such tables. What do you think?

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

RE: Forget close an open relation in ReorderBufferProcessTXN()

2021-05-18 Thread osumi.takami...@fujitsu.com

On Monday, May 17, 2021 6:45 PM Amit Kapila  wrote:
> On Fri, May 14, 2021 at 2:20 PM osumi.takami...@fujitsu.com
>  wrote:
> >
> > On Thursday, May 13, 2021 7:21 PM Amit Kapila
>  wrote:
> > > I don't think we can reproduce it with core plugins as they don't
> > > lock user catalog tables.
> > OK. My current understanding about how the deadlock happens is below.
> >
> > 1. TRUNCATE command is performed on user_catalog_table.
> > 2. TRUNCATE command locks the table and index with ACCESS
> EXCLUSIVE LOCK.
> > 3. TRUNCATE waits for the subscriber's synchronization
> > when synchronous_standby_names is set.
> > 4. Here, the walsender stops, *if* it tries to acquire a lock on the
> user_catalog_table
> > because the table where it wants to see is locked by the
> TRUNCATE already.
> >
> > If this is right,
> >
> 
> Yeah, the above steps are correct, so if we take a lock on user_catalog_table
> when walsender is processing the WAL, it would lead to a problem.
> 
> > we need to go back to a little bit higher level discussion, since
> > whether we should hack any plugin to simulate the deadlock caused by
> > user_catalog_table reference with locking depends on the assumption if
> the plugin takes a lock on the user_catalog_table or not.
> > In other words, if the plugin does read only access to that type of
> > table with no lock (by RelationIdGetRelation for example ?), the
> > deadlock concern disappears and we don't need to add anything to plugin
> sides, IIUC.
> >
> 
> True, if the plugin doesn't acquire any lock on user_catalog_table, then it is
> fine but we don't prohibit plugins to acquire locks on user_catalog_tables.
> This is similar to system catalogs, the plugins and decoding code do acquire
> lock on those.
Thanks for sharing this. I'll take the idea
that plugin can take a lock on user_catalog_table into account.


> > Here, we haven't gotten any response about whether output plugin takes
> > (should take) the lock on the user_catalog_table. But, I would like to
> > make a consensus about this point before the implementation.
> >
> > By the way, Amit-san already mentioned the main reason of this is that
> > we allow decoding of TRUNCATE operation for user_catalog_table in
> synchronous mode.
> > The choices are provided by Amit-san already in the past email in [1].
> > (1) disallow decoding of TRUNCATE operation for user_catalog_tables
> > (2) disallow decoding of any operation for user_catalog_tables like
> > system catalog tables
> >
> > Yet, I'm not sure if either option solves the deadlock concern completely.
> > If application takes an ACCESS EXCLUSIVE lock by LOCK command (not
> by
> > TRUNCATE !) on the user_catalog_table in a transaction, and if the
> > plugin tries to take a lock on it, I think the deadlock happens. Of
> > course, having a consensus that the plugin takes no lock at all would
> remove this concern, though.
> >
> 
> This is true for system catalogs as well. See the similar report [1]
> 
> > Like this, I'd like to discuss those two items in question together at 
> > first.
> > * the plugin should take a lock on user_catalog_table or not
> > * the range of decoding related to user_catalog_table
> >
> > To me, taking no lock on the user_catalog_table from plugin is fine
> >
> 
> We allow taking locks on system catalogs, so why prohibit
> user_catalog_tables? However, I agree that if we want plugins to acquire the
> lock on user_catalog_tables then we should either prohibit decoding of such
> relations or do something else to avoid deadlock hazards.
OK.

Although we have not concluded the range of logical decoding of 
user_catalog_table
(like we should exclude TRUNCATE command only or all operations on that type of 
table),
I'm worried that disallowing the logical decoding of user_catalog_table produces
the deadlock still. It's because disabling it by itself does not affect the
lock taken by TRUNCATE command. What I have in mind is an example below.

(1) plugin (e.g. pgoutput) is designed to take a lock on user_catalog_table.
(2) logical replication is set up in synchronous mode.
(3) TRUNCATE command takes an access exclusive lock on the user_catalog_table.
(4) This time, we don't do anything for the TRUNCATE decoding.
(5) the plugin tries to take a lock on the truncated table
but, it can't due to the lock by TRUNCATE command.

I was not sure that the place where the plugin takes the lock is in truncate_cb
or somewhere else not directly related to decoding of the user_catalog_table 
itself,
so I might be wrong. However, in this case,
the solution would be not disabling the decoding of user_catalog_table
but prohibiting TRUNCATE command on user_catalog_table in synchronous_mode.
If this is true, I need to extend an output plugin and simulate the deadlock 
first
and remove it by fixing the TRUNCATE side. Thoughts ?


Best Regards,
Takamichi Osumi

Re: [PATCH v3 1/1] Fix detection of preadv/pwritev support for OSX.

2021-05-18 Thread Sandeep Thakkar

Hi,

I see this issue persist when I compile PG v14 beta1 on macOS Apple M1
using macOS 11.1 SDK. Even though the build didn't fail, the execution of
initdb on macOS 10.15 failed with the same error. Here is the snippet of
the build log:

--

> gcc -Wall -Wmissing-prototypes -Wpointer-arith
> -Wdeclaration-after-statement -Werror=vla -Wendif-labels
> -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv
> -Wno-unused-command-line-argument -g -isysroot
> /Library/Developer/CommandLineTools/SDKs/MacOSX11.1.sdk
> -mmacosx-version-min=10.14 -arch x86_64 -arch arm64 -O2 -I. -I.
> -I../../../src/include -I/opt/local/Current/include -isysroot
> /Library/Developer/CommandLineTools/SDKs/MacOSX11.1.sdk
>  -I/opt/local/20210406/include/libxml2
>  -I/opt/local/Current/include/libxml2 -I/opt/local/Current/include
> -I/opt/local/Current/include/openssl/  -c -o backup_manifest.o
> backup_manifest.c
> fd.c:3740:10: warning: 'pwritev' is only available on macOS 11.0 or newer
> [-Wunguarded-availability-new]
> part = pg_pwritev(fd, iov, iovcnt, offset);
>^~
> ../../../../src/include/port/pg_iovec.h:49:20: note: expanded from macro
> 'pg_pwritev'
> #define pg_pwritev pwritev
>^~~
> /Library/Developer/CommandLineTools/SDKs/MacOSX11.1.sdk/usr/include/sys/uio.h:104:9:
> note: 'pwritev' has been marked as being introduced in macOS 11.0 here, but
> the deployment target is macOS 10.14.0
> ssize_t pwritev(int, const struct iovec *, int, off_t)
> __DARWIN_NOCANCEL(pwritev) __API_AVAILABLE(macos(11.0), ios(14.0),
> watchos(7.0), tvos(14.0));
> ^
> fd.c:3740:10: note: enclose 'pwritev' in a __builtin_available check to
> silence this warning
> part = pg_pwritev(fd, iov, iovcnt, offset);
>^~
> ../../../../src/include/port/pg_iovec.h:49:20: note: expanded from macro
> 'pg_pwritev'
> #define pg_pwritev pwritev

--

initdb failure:
--

> The database cluster will be initialized with locales
>   COLLATE:  C
>   CTYPE:UTF-8
>   MESSAGES: C
>   MONETARY: C
>   NUMERIC:  C
>   TIME: C
> The default database encoding has accordingly been set to "UTF8".
> initdb: could not find suitable text search configuration for locale
> "UTF-8"
> The default text search configuration will be set to "simple".
> Data page checksums are disabled.
> creating directory /tmp/data ... ok
> creating subdirectories ... ok
> selecting dynamic shared memory implementation ... posix
> selecting default max_connections ... 100
> selecting default shared_buffers ... 128MB
> selecting default time zone ... Asia/Kolkata
> creating configuration files ... ok
> running bootstrap script ... dyld: lazy symbol binding failed: Symbol not
> found: _pwritev
>   Referenced from: /Library/PostgreSQL/14/bin/postgres
>   Expected in: /usr/lib/libSystem.B.dylib
> dyld: Symbol not found: _pwritev
>   Referenced from: /Library/PostgreSQL/14/bin/postgres
>   Expected in: /usr/lib/libSystem.B.dylib
> child process was terminated by signal 6: Abort trap: 6
> initdb: removing data directory "/tmp/data"

--

On Wed, Mar 31, 2021 at 7:49 AM James Hilliard 
wrote:

> On Tue, Mar 30, 2021 at 7:51 PM Tom Lane  wrote:
> >
> > Thomas Munro  writes:
> > > Personally I'm mostly concerned about making it easy for new
> > > contributors to get a working dev system going on a super common
> > > platform without dealing with hard-to-diagnose errors, than people who
> > > actually want a different target as a deliberate choice.  Do I
> > > understand correctly that there a period of time each year when major
> > > upgrades come out of sync and lots of people finish up running a
> > > toolchain and OS with this problem for a while due to the default
> > > target not matching?   If so I wonder if other projects are running
> > > into this with AC_REPLACE_FUNCS and what they're doing.
> >
> > Yeah, we've seen this happen at least a couple of times, though
> > it was only during this past cycle that we (I anyway) entirely
> > understood what was happening.
> >
> > The patches we committed in January (4823621db, 9d23c15a0, 50bebc1ae)
> > to improve our PG_SYSROOT selection heuristics should theoretically
> > improve the situation ... though I admit I won't have a lot of
> > confidence in them till we've been through a couple more rounds of
> > asynchronous-XCode-and-macOS releases.  Still, I feel that we
> > ought to leave that code alone until we see how it does.
>
> I mean, we know that it will still break under a number of common
> circumstances so I think we should be fixing the root cause(improper
> detection of target deployment API availability in probes) in some
> way as this will probably continue to be an issue otherwise, we already
> know that improving PG_SYSROOT selection can not fix the root issue
> but rather tries to workaround it in a way that is pretty much guaranteed
> to be brittle.
>
> Is there any approach to fixing the root cause of

Re: pg_get_wal_replay_pause_state() should not return 'paused' while a promotion is ongoing.





On 2021/05/18 14:53, Dilip Kumar wrote:

On Mon, May 17, 2021 at 7:59 PM Fujii Masao  wrote:


If a promotion is triggered while recovery is paused, the paused state ends
and promotion continues. But currently pg_get_wal_replay_pause_state()
returns 'paused' in that case. Isn't this a bug?

Attached patch fixes this issue by resetting the recovery pause state to
'not paused' when standby promotion is triggered.

Thought?



I think, prior to commit 496ee647ecd2917369ffcf1eaa0b2cdca07c8730
(Prefer standby promotion over recovery pause.) this behavior was fine
because the pause was continued but after this commit now we are
giving preference to pause so this is a bug so need to be fixed.

The fix looks fine but I think along with this we should also return
immediately from the pause loop if promotion is requested.  Because if
we recheck the recovery pause then someone can pause again and we will
be in loop so better to exit as soon as promotion is requested, see
attached patch.  Should be applied along with your patch.


But this change can cause the recovery to continue with insufficient parameter
settings if a promotion is requested while the server is in the paused state
because of such invalid settings. This behavior seems not safe.
If this my understanding is right, the recovery should abort immediately
(i.e., FATAL error ""recovery aborted because of insufficient parameter 
settings"
should be thrown) if a promotion is requested in that case, like when
pg_wal_replay_resume() is executed in that case. Thought?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Re: Parallel scan with SubTransGetTopmostTransaction assert coredump

2021-05-18 Thread Pavel Borisov

I've also seen the reports of the same Assert(TransactionIdFollowsOrEquals(xid,
TransactionXmin)) with a subsequent crash in a parallel worker in
PostgreSQL v11-based build, Though I was unable to investigate deeper and
reproduce the issue. The details above in the thread make me think it is a
real and long-time-persistent error that is surely worth to be fixed.

-- 
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com

Re: pg_get_wal_replay_pause_state() should not return 'paused' while a promotion is ongoing.

2021-05-18 Thread Dilip Kumar

On Tue, May 18, 2021 at 1:43 PM Fujii Masao  wrote:
> > The fix looks fine but I think along with this we should also return
> > immediately from the pause loop if promotion is requested.  Because if
> > we recheck the recovery pause then someone can pause again and we will
> > be in loop so better to exit as soon as promotion is requested, see
> > attached patch.  Should be applied along with your patch.
>
> But this change can cause the recovery to continue with insufficient parameter
> settings if a promotion is requested while the server is in the paused state
> because of such invalid settings. This behavior seems not safe.
> If this my understanding is right, the recovery should abort immediately
> (i.e., FATAL error ""recovery aborted because of insufficient parameter 
> settings"
> should be thrown) if a promotion is requested in that case, like when
> pg_wal_replay_resume() is executed in that case. Thought?

Yeah, you are right, I missed that.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] logical decoding of two-phase transactions

2021-05-18 Thread Peter Smith

On Sun, May 16, 2021 at 12:07 AM Ajin Cherian  wrote:
>
> On Thu, May 13, 2021 at 7:50 PM Peter Smith  wrote:
> >
> > Please find attached the latest patch set v75*
> >
> > Differences from v74* are:
> >
> > * Rebased to HEAD @ today.
> >
> > * v75 also addresses some of the feedback comments from Vignesh [1].
>
> Adding a patch to this patch-set that avoids empty transactions from
> being sent to the subscriber/replica. This patch is based on the
> logic that was proposed for empty transactions in the thread [1]. This
> patch uses that patch and handles empty prepared transactions
> as well. So, this will avoid empty prepared transactions from being
> sent to the subscriber/replica. This patch also avoids sending
> COMMIT PREPARED /ROLLBACK PREPARED if the prepared transaction was
> skipped provided the COMMIT /ROLLBACK happens
> prior to a restart of the walsender. If the COMMIT/ROLLBACK PREPARED
> happens after a restart, it will not be able know that the
> prepared transaction prior to the restart was not sent, in this case
> the apply worker of the subscription will check if a prepare of the
> same type exists
> and if it does not, it will silently ignore the COMMIT PREPARED
> (ROLLBACK PREPARED logic was already doing this).
> Do have a look and let me know if you have any comments.
>
> [1] - 
> https://www.postgresql.org/message-id/CAFPTHDYegcoS3xjGBj0XHfcdZr6Y35%2BYG1jq79TBD1VCkK7v3A%40mail.gmail.com
>

Hi Ajin.

I have applied the latest patch set v76*.

The patches applied cleanly.

All of the make, make check, and TAP subscriptions tests worked OK.

Below are my REVIEW COMMENTS for the v76-0003 part.

==

1.  File: doc/src/sgml/logicaldecoding.sgml

1.1

@@ -862,11 +862,19 @@ typedef void (*LogicalDecodePrepareCB) (struct
LogicalDecodingContext *ctx,
   The required commit_prepared_cb callback is called
   whenever a transaction COMMIT PREPARED has
been decoded.
   The gid field, which is part of the
-  txn parameter, can be used in this callback.
+  txn parameter, can be used in this callback. The
+  parameters prepare_end_lsn and
+  prepare_time can be used to check if the plugin
+  has received this PREPARE TRANSACTION in which case
+  it can apply the rollback, otherwise, it can skip the rollback
operation. The
+  gid alone is not sufficient because the downstream
+  node can have a prepared transaction with same identifier.

This is in the commit prepared section, but that new text is referring
to "it can apply to the rollback" etc.
Is this deliberate text, or maybe cut/paste error?

==

2. File: src/backend/replication/pgoutput/pgoutput.c

2.1

@@ -76,6 +78,7 @@ static void
pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,

 static bool publications_valid;
 static bool in_streaming;
+static bool in_prepared_txn;

Wondering why this is a module static flag. That makes it looks like
it somehow applies globally to all the functions in this scope, but
really I think this is just a txn property, right?
- e.g. why not use another member of the private TXN data instead? or
- e.g. why not use rbtxn_prepared(txn) macro?

--

2.2

@@ -404,10 +410,32 @@ pgoutput_startup(LogicalDecodingContext *ctx,
OutputPluginOptions *opt,
 static void
 pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 {
+ PGOutputTxnData*data = MemoryContextAllocZero(ctx->context,
+ sizeof(PGOutputTxnData));
+
+ (void)txn; /* keep compiler quiet */

I guess since now the arg "txn" is being used the added statement to
"keep compiler quiet" is now redundant, so should be removed.

--

2.3

+static void
+pgoutput_begin(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
  bool send_replication_origin = txn->origin_id != InvalidRepOriginId;
+ PGOutputTxnData *data = (PGOutputTxnData *) txn->output_plugin_private;

  OutputPluginPrepareWrite(ctx, !send_replication_origin);
  logicalrep_write_begin(ctx->out, txn);
+ data->sent_begin_txn = true;


I wondered is it worth adding Assert(data); here?

--

2.4

@@ -422,8 +450,14 @@ static void
 pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  XLogRecPtr commit_lsn)
 {
+ PGOutputTxnData *data = (PGOutputTxnData *) txn->output_plugin_private;
+
  OutputPluginUpdateProgress(ctx);

I wondered is it worthwhile to add Assert(data); here also?

--

2.5
@@ -422,8 +450,14 @@ static void
 pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  XLogRecPtr commit_lsn)
 {
+ PGOutputTxnData *data = (PGOutputTxnData *) txn->output_plugin_private;
+
  OutputPluginUpdateProgress(ctx);

+ /* skip COMMIT message if nothing was sent */
+ if (!data->sent_begin_txn)
+ return;

Shouldn't this code also be freeing that allocated data? I think you
do free it in similar functions later in this patch.

--

2.6

@@ -435,10 +469,31 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
 static void
 pgoutput_begin_prepare_txn(

Re: allow specifying direct role membership in pg_hba.conf

2021-05-18 Thread Magnus Hagander

On Mon, May 17, 2021 at 11:18 PM Chapman Flack  wrote:
>
> On 05/17/21 16:35, Magnus Hagander wrote:
> > So you're saying that some entries int he parameter section would
> > depend on the db/user/ip combo and some would depend just on the ip?
>
> I don't *think* that's what I was saying. What I was thinking was this:
>
> The pg_hba.conf file is an ordered list of entries. Each entry can specify
> a (broad or narrow) set of IPs it applies to, a (broad or narrow) set of
> databases it applies to, and a (broad or narrow) set of users it applies to.
>
> Also, in this hypothetical, it can specify a min protocol version.
>
> Right now, we're doing something like this:
>
> 1. accept an incoming connection, learning the client IP
> 2. SSLRequest message leads to negotiating TLS
> 3. StartupMessage supplies the desired database and user name
> 4. pg_hba entries are consulted once and filtered down to the first one
> applicable to the client IP, database, and username (and SSLness)
> 5. that entry is used for authentication
>
>
> I suggested only:
>
> Insert step 1½, filter the pg_hba entries down to only those that could
> possibly accept a connection from this IP address. This is an improper
> subset of the whole list, and an improper superset of the singleton to be
> generated later in step 4.
>
> Step 2 still negotiates TLS, but can fail early if the protocol would
> be older than the oldest allowed in the pre-filtered list.

Nop, this is *exactly* what I'm referring to as being a bad idea.

Step 1 1/2 in this *ignores* the fact that you may have specified a
restriction on username and database name in pg_hba.conf, because it
hasn't seen them yet. Thus,  a parameter such as min_tls_version would
not respect the username/databasename field, whereas other parameters
would. That is a massive risk of misconfiguration.

I mean, if you have
hostssl somedatabase someuser 10.0.0.0/24 gss
hostssl somedatabase supseruser 10.0.0.0/24 gss tls_min_version=1.3

One would reasonably expect that "someuser" can connect with whatever
the default version i for tls_min_versino, whereas "superuser" would
require a minimum of 1.3. But that's *not* what would happen --
superuser would also be allowed to connect with a lower version if
that's allowed in the global set.

> Step 4 takes that pre-filtered list and completes the restriction down to
> first entry matching the IP, database, and username. This should be the
> same singleton it would generate now. But it can fail-fast if that entry
> would require a higher protocol version than what's been negotiated,
> before sending the corresponding authentication request message, so no
> authentication data will be exchanged over a less-secure channel than
> intended. However, the user, database name, and options in the Startup
> message might have been exposed over a lower TLS version than intended.
> Maybe that's not the end of the world?

That is exactly the problem. And while that may hold true of current
auth methods, it may not hold true of all. And it could still trigger
things like an ident callback if that is allowed etc.

So I stand by thinking this is the wrong place to solve the problem. I
agree it would be good to be able to do it, but I don't agree on
overloading it on pg_hba.conf, which is complicated enough already.
And for security config, simplicity is pretty much always better.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

How to launch parallel aggregations ?

2021-05-18 Thread Esteban Zimanyi

Dear all

In MobilityDB we have defined parallel aggregations with a combine
function, e.g.,

CREATE AGGREGATE extent(tbox) (
  SFUNC = tbox_extent_transfn,
  STYPE = tbox,
  COMBINEFUNC = tbox_extent_combinefn,
  PARALLEL = safe
);

We would like to trigger the combine functions in the coverage tests but
for this it is required that the tables are VERY big. In particular for the
above aggregation, the combine function only is triggered when the table
has more than 300K rows.

As it is not very effective to have such a big table in the test database
used for the regression and the coverage tests I wonder whether it is
possible to set some parameters to launch the combine functions with tables
of, e.g., 10K rows, which are the bigger tables in our regression test
database.

Many thanks for your insights !

Esteban

Re: Re: Parallel scan with SubTransGetTopmostTransaction assert coredump

2021-05-18 Thread Greg Nancarrow

On Tue, May 18, 2021 at 11:27 AM Pengchengliu  wrote:
>
> Hi Greg,
>
>Actually I am very confused about ActiveSnapshot and TransactionSnapshot. 
> I don't know why main process send ActiveSnapshot and TransactionSnapshot 
> separately.  And what is exact difference between them?
>If you know that, could you explain that for me? It will be very 
> appreciated.

In the context of a parallel-worker, I am a little confused too, so I
can't explain it either.
It is not really explained in the file
"src\backend\access\transam\README.parallel", it only mentions the
following as part of the state that needs to be copied to each worker:

 - The transaction snapshot.
 - The active snapshot, which might be different from the transaction snapshot.

So they might be different, but exactly when and why?

When I debugged a typical parallel-SELECT case, I found that prior to
plan execution, GetTransactionSnapshot() was called and its return
value was stored in both the QueryDesc and the estate (es_snapshot),
which was then pushed on the ActiveSnapshot stack. So by the time
InitializeParallelDSM() was called, the (top) ActiveSnapshot was the
last snapshot returned from GetTransactionSnapshot().
So why InitializeParallelDSM() calls GetTransactionSnapshot() again is
not clear to me (because isn't then the ActiveSnapshot a potentially
earlier snapshot? - which it shouldn't be, AFAIK. And also, it's then
different to the non-parallel case).

>Before we know them exactly, I think we should not modify the 
> TransactionSnapshot to ActiveSnapshot in main process. If it is, the main 
> process should send ActiveSnapshot only.

I think it would be worth you trying my suggested change (if you have
a development environment, which I assume you have). Sure, IF it was
deemed a proper solution, you'd only send the one snapshot, and adjust
accordingly in ParallelWorkerMain(), but we need not worry about that
in order to test it.

Regards,
Greg Nancarrow
Fujitsu Australia

Re: How to launch parallel aggregations ?

On Tue, May 18, 2021 at 2:32 PM Esteban Zimanyi  wrote:
>
> Dear all
>
> In MobilityDB we have defined parallel aggregations with a combine function, 
> e.g.,
>
> CREATE AGGREGATE extent(tbox) (
>   SFUNC = tbox_extent_transfn,
>   STYPE = tbox,
>   COMBINEFUNC = tbox_extent_combinefn,
>   PARALLEL = safe
> );
>
> We would like to trigger the combine functions in the coverage tests but for 
> this it is required that the tables are VERY big. In particular for the above 
> aggregation, the combine function only is triggered when the table has more 
> than 300K rows.
>
> As it is not very effective to have such a big table in the test database 
> used for the regression and the coverage tests I wonder whether it is 
> possible to set some parameters to launch the combine functions with tables 
> of, e.g., 10K rows, which are the bigger tables in our regression test 
> database.
>
> Many thanks for your insights !

You could do something like below, just before your test:

-- encourage use of parallel plans
set parallel_setup_cost=0;
set parallel_tuple_cost=0;
set min_parallel_table_scan_size=0;
set max_parallel_workers_per_gather=2;

And after the test you can reset all of the above parameters.

Hope that helps!

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: How to launch parallel aggregations ?

2021-05-18 Thread Esteban Zimanyi

Thanks a lot! It works!

On Tue, May 18, 2021 at 11:15 AM Bharath Rupireddy <
bharath.rupireddyforpostg...@gmail.com> wrote:

> On Tue, May 18, 2021 at 2:32 PM Esteban Zimanyi 
> wrote:
> >
> > Dear all
> >
> > In MobilityDB we have defined parallel aggregations with a combine
> function, e.g.,
> >
> > CREATE AGGREGATE extent(tbox) (
> >   SFUNC = tbox_extent_transfn,
> >   STYPE = tbox,
> >   COMBINEFUNC = tbox_extent_combinefn,
> >   PARALLEL = safe
> > );
> >
> > We would like to trigger the combine functions in the coverage tests but
> for this it is required that the tables are VERY big. In particular for the
> above aggregation, the combine function only is triggered when the table
> has more than 300K rows.
> >
> > As it is not very effective to have such a big table in the test
> database used for the regression and the coverage tests I wonder whether it
> is possible to set some parameters to launch the combine functions with
> tables of, e.g., 10K rows, which are the bigger tables in our regression
> test database.
> >
> > Many thanks for your insights !
>
> You could do something like below, just before your test:
>
> -- encourage use of parallel plans
> set parallel_setup_cost=0;
> set parallel_tuple_cost=0;
> set min_parallel_table_scan_size=0;
> set max_parallel_workers_per_gather=2;
>
> And after the test you can reset all of the above parameters.
>
> Hope that helps!
>
> With Regards,
> Bharath Rupireddy.
> EnterpriseDB: http://www.enterprisedb.com
>

Re: PG 14 release notes, first draft

On Mon, May 10, 2021 at 3:03 PM Bruce Momjian  wrote:
>
> I have committed the first draft of the PG 14 release notes.  You can
> see the most current  build of them here:
>
> https://momjian.us/pgsql_docs/release-14.html
>

I think we need to mention in the release note that
vacuum_cleanup_index_scale_factor GUC parameter has been removed and
vacuum_cleanup_index_scale_factor storage parameter has been
deprecated (please refer to commit 9f3665fb and effdd3f3b63).

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

RE: [HACKERS] logical decoding of two-phase transactions

2021-05-18 Thread tanghy.f...@fujitsu.com

Hi Ajin

>The above patch had some changes missing which resulted in some tap
>tests failing. Sending an updated patchset. Keeping the patchset
>version the same.

Thanks for your patch.  I see a problem about Segmentation fault when using it. 
Please take a look at this.
The steps to reproduce the problem are as follows.

--publisher--
create table test (a int primary key, b varchar);
create publication pub for table test;

--subscriber--
create table test (a int primary key, b varchar);
create subscription sub connection 'dbname=postgres' publication pub 
with(two_phase=on);

Then, I prepare, commit, rollback transactions and TRUNCATE table in a sql as 
follows:
-
BEGIN;
INSERT INTO test SELECT i, md5(i::text) FROM generate_series(1, 1) s(i);
PREPARE TRANSACTION 't1';
COMMIT PREPARED 't1';

BEGIN;
INSERT INTO test SELECT i, md5(i::text) FROM generate_series(10001, 2) s(i);
PREPARE TRANSACTION 't2';
ROLLBACK PREPARED 't2';

TRUNCATE test;
-

To make sure the problem produce easily, I looped above operations in my sql 
file about 10 times, then I can 100% reproduce it and got segmentation fault in 
publisher log as follows:
-
2021-05-18 16:30:56.952 CST [548189] postmaster LOG:  server process (PID 
548222) was terminated by signal 11: Segmentation fault
2021-05-18 16:30:56.952 CST [548189] postmaster DETAIL:  Failed process was 
running: START_REPLICATION SLOT "sub" LOGICAL 0/0 (proto_version '3', two_phase 
'on', publication_names '"pub"')
-

Here is the core dump information :
-
#0  0x0090afe4 in pq_sendstring (buf=buf@entry=0x251ca80, str=0x0) at 
pqformat.c:199
#1  0x00ab0a2b in logicalrep_write_begin_prepare (out=0x251ca80, 
txn=txn@entry=0x25346e8) at proto.c:124
#2  0x7f9528842dd6 in pgoutput_begin_prepare (ctx=ctx@entry=0x2514700, 
txn=txn@entry=0x25346e8) at pgoutput.c:495
#3  0x7f9528843f70 in pgoutput_truncate (ctx=0x2514700, txn=0x25346e8, 
nrelations=1, relations=0x262f678, change=0x25370b8) at pgoutput.c:905
#4  0x00aa57cb in truncate_cb_wrapper (cache=, 
txn=, nrelations=, relations=, 
change=)
at logical.c:1103
#5  0x00abf333 in ReorderBufferApplyTruncate (streaming=false, 
change=0x25370b8, relations=0x262f678, nrelations=1, txn=0x25346e8, 
rb=0x2516710)
at reorderbuffer.c:1918
#6  ReorderBufferProcessTXN (rb=rb@entry=0x2516710, txn=0x25346e8, 
commit_lsn=commit_lsn@entry=27517176, snapshot_now=, 
command_id=command_id@entry=0,
streaming=streaming@entry=false) at reorderbuffer.c:2278
#7  0x00ac0b14 in ReorderBufferReplay (txn=, 
rb=rb@entry=0x2516710, xid=xid@entry=738, commit_lsn=commit_lsn@entry=27517176,
end_lsn=end_lsn@entry=27517544, 
commit_time=commit_time@entry=674644388404356, origin_id=0, origin_lsn=0) at 
reorderbuffer.c:2591
#8  0x00ac1713 in ReorderBufferCommit (rb=0x2516710, xid=xid@entry=738, 
commit_lsn=27517176, end_lsn=27517544, 
commit_time=commit_time@entry=674644388404356,
origin_id=origin_id@entry=0, origin_lsn=0) at reorderbuffer.c:2615
#9  0x00a9f702 in DecodeCommit (ctx=ctx@entry=0x2514700, 
buf=buf@entry=0x7ffdd027c2b0, parsed=parsed@entry=0x7ffdd027c140, 
xid=xid@entry=738,
two_phase=) at decode.c:742
#10 0x00a9fc6c in DecodeXactOp (ctx=ctx@entry=0x2514700, 
buf=buf@entry=0x7ffdd027c2b0) at decode.c:278
#11 0x00aa1b75 in LogicalDecodingProcessRecord (ctx=0x2514700, 
record=0x2514ac0) at decode.c:142
#12 0x00af6db1 in XLogSendLogical () at walsender.c:2876
#13 0x00afb6aa in WalSndLoop (send_data=send_data@entry=0xaf6d49 
) at walsender.c:2306
#14 0x00afbdac in StartLogicalReplication (cmd=cmd@entry=0x24da288) at 
walsender.c:1206
#15 0x00afd646 in exec_replication_command (
cmd_string=cmd_string@entry=0x2452570 "START_REPLICATION SLOT \"sub\" 
LOGICAL 0/0 (proto_version '3', two_phase 'on', publication_names '\"pub\"')") 
at walsender.c:1646
#16 0x00ba3514 in PostgresMain (argc=argc@entry=1, 
argv=argv@entry=0x7ffdd027c560, dbname=, username=) at postgres.c:4482
#17 0x00a7284a in BackendRun (port=port@entry=0x2477b60) at 
postmaster.c:4491
#18 0x00a78bba in BackendStartup (port=port@entry=0x2477b60) at 
postmaster.c:4213
#19 0x00a78ff9 in ServerLoop () at postmaster.c:1745
#20 0x00a7bbdf in PostmasterMain (argc=argc@entry=3, 
argv=argv@entry=0x244bae0) at postmaster.c:1417
#21 0x0090dc80 in main (argc=3, argv=0x244bae0) at main.c:209
-

I noticed that it called pgoutput_truncate function and pgoutput_begin_prepare 
function. It seems odd because TRUNCATE is not in a prepared transaction in my 
case.

I tried to debug this to learn more and found that in pgoutput_truncate 
function, the value of in_prepared_txn was true. Later, it got a segmentation 
fault when it tried to get gid in logicalrep_write_begin_prepare function - it 
has no gid so we got the segmentation fault.

FYI:
I also

Re: [HACKERS] logical decoding of two-phase transactions

On Thu, May 13, 2021 at 3:20 PM Peter Smith  wrote:
>
> Please find attached the latest patch set v75*
>

Review comments for v75-0001-Add-support-for-prepared-transactions-to-built-i:
===
1.
-   CREATE_REPLICATION_SLOT slot_name [
TEMPORARY ] { PHYSICAL [
RESERVE_WAL ] | LOGICAL
output_plugin [
EXPORT_SNAPSHOT |
NOEXPORT_SNAPSHOT | USE_SNAPSHOT
] }
+   CREATE_REPLICATION_SLOT slot_name [
TEMPORARY ] [ TWO_PHASE ] {
PHYSICAL [ RESERVE_WAL ] |
LOGICAL output_plugin [
EXPORT_SNAPSHOT |
NOEXPORT_SNAPSHOT | USE_SNAPSHOT
] }

Can we do some testing of the code related to this in some way? One
random idea could be to change the current subscriber-side code just
for testing purposes to see if this works. Can we enhance and use
pg_recvlogical to test this? It is possible that if you address
comment number 13 below, this can be tested with Create Subscription
command.

2.
-   belong to the same transaction. It also sends changes of large in-progress
-   transactions between a pair of Stream Start and Stream Stop messages. The
-   last stream of such a transaction contains Stream Commit or Stream Abort
-   message.
+   belong to the same transaction. Similarly, all messages between a pair of
+   Begin Prepare and Commit Prepared messages belong to the same transaction.

I think here we need to write Prepare instead of Commit Prepared
because Commit Prepared for a transaction can come at a later point of
time and all the messages in-between won't belong to the same
transaction.

3.
+
+
+
+The following messages (Begin Prepare, Prepare, Commit Prepared,
Rollback Prepared)
+are available since protocol version 3.
+

I am not sure here marker like "TWO_PHASE Messages" is required. We
don't have any such marker for streaming messages.

4.
+
+Int64
+
+Timestamp of the prepare transaction.

Isn't it better to write this description as "Prepare timestamp of the
transaction" to match with the similar description of Commit
timestamp. Also, there are similar occurances in the patch at other
places, change those as well.

5.
+Begin Prepare
+
+
...
+
+Int32
+
+Xid of the subtransaction (will be same as xid of the
transaction for top-level
+transactions).

The above description seems wrong to me. It should be Xid of the
transaction as we won't receive Xid of subtransaction in Begin
message. The same applies to the prepare/commit prepared/rollback
prepared transaction messages as well, so change that as well
accordingly.

6.
+Byte1('P')
+
+Identifies this message as a two-phase prepare
transaction message.
+

In all the similar messages, we are using "Identifies the message as
...". I feel it is better to be consistent in this and similar
messages in the patch.

7.
+
+
+Rollback Prepared
+
..
+
+Int64
+
+The LSN of the prepare.
+

This should be end LSN of the prepared transaction.

8.
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+ TimestampTz origin_prepare_timestamp)
..
..
+ /*
+ * We are neither expecting the collisions of GXACTs (same gid)
+ * between publisher and subscribers nor the apply worker restarts
+ * after prepared xacts,

The second part of the comment ".. nor the apply worker restarts after
prepared xacts .." is no longer true after commit 8bdb1332eb[1]. So,
we can remove it.

9.
+ /*
+ * Does the subscription have tables?
+ *
+ * If there were not-READY relations found then we know it does. But if
+ * table_state_no_ready was empty we still need to check again to see
+ * if there are 0 tables.
+ */
+ has_subrels = (list_length(table_states_not_ready) > 0) ||

Typo in comments. /table_state_no_ready/table_state_not_ready

10.
+ if (!twophase)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("unrecognized subscription parameter: \"%s\"", defel->defname)));

errmsg is not aligned properly. Can we make the error message clear,
something like: "cannot change two_phase option"

11.
@@ -69,7 +69,8 @@ parse_subscription_options(List *options,
 char **synchronous_commit,
 bool *refresh,
 bool *binary_given, bool *binary,
-bool *streaming_given, bool *streaming)
+bool *streaming_given, bool *streaming,
+bool *twophase_given, bool *twophase)

This function already has 14 parameters and this patch adds 2 new
ones. Isn't it better to have a struct (ParseSubOptions) for these
parameters? I think that might lead to some code churn but we can have
that as a separate patch on top of which we can create two_pc patch.

12.
* The subscription two_phase commit implementation requires
+ * that replication has passed the initial table
+ * synchronization phase before the two_phase becomes properly
+ * enabled.

Can we slightly modify the starting of this sentence as:"The
subscription option 'two_phase' requires that ..."

13.
@@ -507,7 +558,16 @@ CreateSubscription(CreateSubscriptionStmt *stmt,
bool isTopLevel)
  {
  Assert(slotn

Re: Parallel scan with SubTransGetTopmostTransaction assert coredump

2021-05-18 Thread houzj.f...@fujitsu.com

> To: Pengchengliu 
> Cc: Greg Nancarrow ; Andres Freund ; 
> PostgreSQL-development 
> Subject: Re: Re: Parallel scan with SubTransGetTopmostTransaction assert 
> coredump

> I've also seen the reports of the same 
> Assert(TransactionIdFollowsOrEquals(xid, TransactionXmin)) with a subsequent 
> crash in a parallel worker in PostgreSQL v11-based 
> build, Though I was unable to investigate deeper and reproduce the issue. The 
> details above in the thread make me think it is a real and 
> long-time-persistent error that is 
> surely worth to be fixed.

I followed Liu's reproduce steps and successfully reproduce it in about half an 
hour running.
My compile option is : " ./configure --enable-cassert --prefix=/home/pgsql".

After applying greg-san's change, the coredump did not happened in two hour(it 
is still running).
Note, I have not taken a deep look into the change, just provide some test 
information in advance.

Best regards,
houzj

Re: Forget close an open relation in ReorderBufferProcessTXN()

On Tue, May 18, 2021 at 1:29 PM osumi.takami...@fujitsu.com
 wrote:
>
> On Monday, May 17, 2021 6:45 PM Amit Kapila  wrote:
> >
> > We allow taking locks on system catalogs, so why prohibit
> > user_catalog_tables? However, I agree that if we want plugins to acquire the
> > lock on user_catalog_tables then we should either prohibit decoding of such
> > relations or do something else to avoid deadlock hazards.
> OK.
>
> Although we have not concluded the range of logical decoding of 
> user_catalog_table
> (like we should exclude TRUNCATE command only or all operations on that type 
> of table),
> I'm worried that disallowing the logical decoding of user_catalog_table 
> produces
> the deadlock still. It's because disabling it by itself does not affect the
> lock taken by TRUNCATE command. What I have in mind is an example below.
>
> (1) plugin (e.g. pgoutput) is designed to take a lock on user_catalog_table.
> (2) logical replication is set up in synchronous mode.
> (3) TRUNCATE command takes an access exclusive lock on the user_catalog_table.
> (4) This time, we don't do anything for the TRUNCATE decoding.
> (5) the plugin tries to take a lock on the truncated table
> but, it can't due to the lock by TRUNCATE command.
>

If you skip decoding of truncate then we won't invoke plugin API so
step 5 will be skipped.

> I was not sure that the place where the plugin takes the lock is in 
> truncate_cb
> or somewhere else not directly related to decoding of the user_catalog_table 
> itself,
> so I might be wrong. However, in this case,
> the solution would be not disabling the decoding of user_catalog_table
> but prohibiting TRUNCATE command on user_catalog_table in synchronous_mode.
> If this is true, I need to extend an output plugin and simulate the deadlock 
> first
> and remove it by fixing the TRUNCATE side. Thoughts ?
>

I suggest not spending too much time reproducing this because it is
quite clear that it will lead to deadlock if the plugin acquires lock
on user_catalog_table and we allow decoding of truncate. But if you
want to see how that happens you can try as well.

-- 
With Regards,
Amit Kapila.

Re: allow specifying direct role membership in pg_hba.conf

2021-05-18 Thread Chapman Flack

On 05/18/21 04:54, Magnus Hagander wrote:

> I mean, if you have
> hostssl somedatabase someuser 10.0.0.0/24 gss
> hostssl somedatabase supseruser 10.0.0.0/24 gss tls_min_version=1.3
> 
> One would reasonably expect that "someuser" can connect with whatever
> the default version i for tls_min_versino, whereas "superuser" would
> require a minimum of 1.3. But that's *not* what would happen --
> superuser would also be allowed to connect with a lower version if
> that's allowed in the global set.

Negatory. "superuser" would be allowed to send a StartupMessage
containing the strings "somedatabase" and "superuser" (and possibly
some settings of options) over a lower version if that's allowed
in the global set ... and would then have the connection rejected
because the negotiated protocol was lower than 1.3, without seeing
any authentication message or having a chance to send any sensitive
authentication credentials.

So the risk of any information exposure over a too-low TLS version
is limited to the name of a database, the name of a user, and possibly
the settings of some options, and no sensitive authentication data.

Regards,
-Chap

Re: Parallel scan with SubTransGetTopmostTransaction assert coredump

2021-05-18 Thread Greg Nancarrow

On Tue, May 18, 2021 at 9:41 PM houzj.f...@fujitsu.com
 wrote:
>
> > To: Pengchengliu 
> > Cc: Greg Nancarrow ; Andres Freund 
> > ; PostgreSQL-development 
> > Subject: Re: Re: Parallel scan with SubTransGetTopmostTransaction assert 
> > coredump
>
> > I've also seen the reports of the same 
> > Assert(TransactionIdFollowsOrEquals(xid, TransactionXmin)) with a 
> > subsequent crash in a parallel worker in PostgreSQL v11-based
> > build, Though I was unable to investigate deeper and reproduce the issue. 
> > The details above in the thread make me think it is a real and 
> > long-time-persistent error that is
> > surely worth to be fixed.
>
> I followed Liu's reproduce steps and successfully reproduce it in about half 
> an hour running.
> My compile option is : " ./configure --enable-cassert --prefix=/home/pgsql".
>
> After applying greg-san's change, the coredump did not happened in two 
> hour(it is still running).
> Note, I have not taken a deep look into the change, just provide some test 
> information in advance.
>

+1
Thanks for doing that.
I'm unsure if that "fix" is the right approach, so please investigate it too.

Regards,
Greg Nancarrow
Fujitsu Australia

Re: postgres_fdw - should we tighten up batch_size, fetch_size options against non-numeric values?

2021-05-18 Thread Ashutosh Bapat

On Mon, May 17, 2021 at 7:50 PM Bharath Rupireddy
 wrote:
> If we do that, then the options that are only accepting unquoted
> integers (i.e. 123, 456 etc.) and throwing errors for the quoted
> integers ('123', '456' etc.) will then start to accept the quoted
> integers. Other hackers might not agree to this change.

I guess the options which weren't accepting quoted strings, will catch
these errors at the time of parsing itself. Even if that's not true, I
would see that as an improvement. Anyway, I won't stretch this
further.

>
> Another way is to have new API like
> defGetInt32_2/defGetInt64_2/defGetNumeric_2 (or some other better
> names) which basically accept both quoted and unquoted strings, see
> [1] for a rough sketch of the function. These API can be useful if an
> option needs to be parsed in both quoted and unquoted form. Or we can
> also have these functions as [2] for only parsing quoted options. I
> prefer [2] so that these API can be used without any code duplication.
> Thoughts?

I am not sure whether we want to maintain two copies. In that case
your first patch is fine.

> Note
> that I have not added any test case as this change is a trivial thing.
> If required, I can add one.

That will help to make sure that we preserve the behaviour even
through code changes.

-- 
Best Wishes,
Ashutosh Bapat

Re: postgres_fdw - should we tighten up batch_size, fetch_size options against non-numeric values?





On 2021/05/17 18:58, Bharath Rupireddy wrote:

Hi,

It looks like the values such as '123.456', '789.123' '100$%$#$#',
'9,223,372,' are accepted and treated as valid integers for
postgres_fdw options batch_size and fetch_size. Whereas this is not
the case with fdw_startup_cost and fdw_tuple_cost options for which an
error is thrown. Attaching a patch to fix that.


This looks an improvement. But one issue is that the restore of
dump file taken by pg_dump from v13 may fail for v14 with this patch
if it contains invalid setting of fetch_size, e.g., "fetch_size '123.456'".
OTOH, since batch_size was added in v14, it has no such issue.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: postgres_fdw - should we tighten up batch_size, fetch_size options against non-numeric values?

Fujii Masao  writes:
> On 2021/05/17 18:58, Bharath Rupireddy wrote:
>> It looks like the values such as '123.456', '789.123' '100$%$#$#',
>> '9,223,372,' are accepted and treated as valid integers for
>> postgres_fdw options batch_size and fetch_size. Whereas this is not
>> the case with fdw_startup_cost and fdw_tuple_cost options for which an
>> error is thrown. Attaching a patch to fix that.

> This looks an improvement. But one issue is that the restore of
> dump file taken by pg_dump from v13 may fail for v14 with this patch
> if it contains invalid setting of fetch_size, e.g., "fetch_size '123.456'".
> OTOH, since batch_size was added in v14, it has no such issue.

Maybe better to just silently round to integer?  I think that's
what we generally do with integer GUCs these days, eg

regression=# set work_mem = 102.9;
SET
regression=# show work_mem;
 work_mem 
--
 103kB
(1 row)

I agree with throwing an error for non-numeric junk though.
Allowing that on the grounds of backwards compatibility
seems like too much of a stretch.

regards, tom lane

Multiple pg_waldump --rmgr options

2021-05-18 Thread Heikki Linnakangas


I wanted to dump all heap WAL records with pg_waldump, so I did this:


$ pg_waldump --rmgr=heap --rmgr=heap2 data/pg_wal/00010001 
--stat=record
Type   N  (%)  Record size  
(%) FPI size  (%)Combined size  (%)
   -  ---  ---  
---   ----  ---
Heap2/PRUNE  268 (  8.74)18192 
(  2.73)0 (  0.00)18192 (  1.74)
Heap2/VACUUM  55 (  1.79) 4940 
(  0.74)0 (  0.00) 4940 (  0.47)
Heap2/FREEZE_PAGE277 (  9.03)   186868 
( 28.03)0 (  0.00)   186868 ( 17.86)
Heap2/VISIBLE467 ( 15.23)27783 
(  4.17)   376832 ( 99.34)   404615 ( 38.68)
Heap2/MULTI_INSERT  1944 ( 63.38)   354800 
( 53.21) 2520 (  0.66)   357320 ( 34.16)
Heap2/MULTI_INSERT+INIT   56 (  1.83)74152 
( 11.12)0 (  0.00)74152 (  7.09)
    
  
Total   3067666735 
[63.74%]   379352 [36.26%]  1046087 [100%]
pg_waldump: fatal: error in WAL record at 0/1680118: invalid record length at 
0/1680150: wanted 24, got 0


That didn't do what I wanted. It only printed the Heap2 records, not 
Heap, even though I specified both. The reason is that if you specify 
multiple --rmgr options, only the last one takes effect.


I propose the attached to allow selecting multiple rmgrs, by giving 
multiple --rmgr options. With that, it works the way I expected:



$ pg_waldump --rmgr=heap --rmgr=heap2 data/pg_wal/00010001 
--stat=record
Type   N  (%)  Record size  
(%) FPI size  (%)Combined size  (%)
   -  ---  ---  
---   ----  ---
Heap2/PRUNE  268 (  1.77)18192 
(  0.71)0 (  0.00)18192 (  0.55)
Heap2/VACUUM  55 (  0.36) 4940 
(  0.19)0 (  0.00) 4940 (  0.15)
Heap2/FREEZE_PAGE277 (  1.83)   186868 
(  7.33)0 (  0.00)   186868 (  5.67)
Heap2/VISIBLE467 (  3.09)27783 
(  1.09)   376832 ( 50.37)   404615 ( 12.27)
Heap2/MULTI_INSERT  1944 ( 12.86)   354800 
( 13.91) 2520 (  0.34)   357320 ( 10.83)
Heap2/MULTI_INSERT+INIT   56 (  0.37)74152 
(  2.91)0 (  0.00)74152 (  2.25)
Heap/INSERT 9948 ( 65.80)  1433891 
( 56.22) 8612 (  1.15)  1442503 ( 43.73)
Heap/DELETE  942 (  6.23)50868 
(  1.99)0 (  0.00)50868 (  1.54)
Heap/UPDATE  193 (  1.28)   101265 
(  3.97) 9556 (  1.28)   110821 (  3.36)
Heap/HOT_UPDATE  349 (  2.31)36910 
(  1.45) 1300 (  0.17)38210 (  1.16)
Heap/LOCK209 (  1.38)11481 
(  0.45)   316828 ( 42.35)   328309 (  9.95)
Heap/INPLACE 212 (  1.40)44279 
(  1.74)32444 (  4.34)76723 (  2.33)
Heap/INSERT+INIT 184 (  1.22)   188803 
(  7.40)0 (  0.00)   188803 (  5.72)
Heap/UPDATE+INIT  15 (  0.10)16273 
(  0.64)0 (  0.00)16273 (  0.49)
    
  
Total  15119   2550505 
[77.32%]   748092 [22.68%]  3298597 [100%]
pg_waldump: fatal: error in WAL record at 0/1680150: invalid record length at 
0/16801C8: wanted 24, got 0


- Heikki
>From 991296f690d79a12670f1dca34

Re: PG 14 release notes, first draft

On Tue, May 18, 2021 at 06:28:49PM +0900, Masahiko Sawada wrote:
> On Mon, May 10, 2021 at 3:03 PM Bruce Momjian  wrote:
> >
> > I have committed the first draft of the PG 14 release notes.  You can
> > see the most current  build of them here:
> >
> > https://momjian.us/pgsql_docs/release-14.html
> >
> 
> I think we need to mention in the release note that
> vacuum_cleanup_index_scale_factor GUC parameter has been removed and
> vacuum_cleanup_index_scale_factor storage parameter has been
> deprecated (please refer to commit 9f3665fb and effdd3f3b63).

Looking at the full commit message:

commit 9f3665fbfc
Author: Peter Geoghegan 
Date:   Wed Mar 10 16:27:01 2021 -0800

Don't consider newly inserted tuples in nbtree VACUUM.

Remove the entire idea of "stale stats" within nbtree VACUUM (stop
caring about stats involving the number of inserted tuples).  Also
remove the vacuum_cleanup_index_scale_factor GUC/param on the master
branch (though just disable them on postgres 13).

The vacuum_cleanup_index_scale_factor/stats interface made the 
nbtree AM
partially responsible for deciding when pg_class.reltuples stats 
needed
to be updated.  This seems contrary to the spirit of the index AM 
API,
though -- it is not actually necessary for an index AM's bulk 
delete and
cleanup callbacks to provide accurate stats when it happens to be
inconvenient.  The core code owns that.  (Index AMs have the 
authority
to perform or not perform certain kinds of deferred cleanup based on
their own considerations, such as page deletion and recycling, but 
that
has little to do with pg_class.reltuples/num_index_tuples.)

This issue was fairly harmless until the introduction of the
autovacuum_vacuum_insert_threshold feature by commit b07642db, 
which had
an undesirable interaction with the 
vacuum_cleanup_index_scale_factor
mechanism: it made insert-driven autovacuums perform full index 
scans,
even though there is no real benefit to doing so.  This has been 
tied to
a regression with an append-only insert benchmark [1].

Also have remaining cases that perform a full scan of an index 
during a
cleanup-only nbtree VACUUM indicate that the final tuple count is 
only
an estimate.  This prevents vacuumlazy.c from setting the index's
pg_class.reltuples in those cases (it will now only update pg_class 
when
vacuumlazy.c had TIDs for nbtree to bulk delete).  This arguably 
fixes
an oversight in deduplication-related bugfix commit 48e12913.

[1] 
https://smalldatum.blogspot.com/2021/01/insert-benchmark-postgres-is-still.html

Author: Peter Geoghegan 
Reviewed-By: Masahiko Sawada 
Discussion: 
https://postgr.es/m/cad21aoa4whthn5uu6+wscz7+j_rcejmcuh94qcoupub42sh...@mail.gmail.com
--> Backpatch: 13-, where autovacuum_vacuum_insert_threshold was added.

This was backpatched into PG 13.3, which was released last week:

  Disable the vacuum_cleanup_index_scale_factor
  parameter and storage option (Peter Geoghegan)

  The notion of tracking stale index statistics 
proved
  to interact badly with
  the autovacuum_vacuum_insert_threshold 
parameter,
  resulting in unnecessary full-index scans and consequent 
degradation
  of autovacuum performance.  The latter mechanism seems superior, 
so
  remove the stale-statistics logic.  The control parameter for 
that,
  vacuum_cleanup_index_scale_factor, will be
  removed entirely in v14.  In v13, it remains present to avoid
  breaking existing configuration files, but it no longer does
  anything.

Therefore, it didn't show up in my src/tools/git_changelog output, and I
did not include it.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: postgres_fdw - should we tighten up batch_size, fetch_size options against non-numeric values?

On Tue, May 18, 2021 at 7:19 PM Tom Lane  wrote:
>
> Fujii Masao  writes:
> > On 2021/05/17 18:58, Bharath Rupireddy wrote:
> >> It looks like the values such as '123.456', '789.123' '100$%$#$#',
> >> '9,223,372,' are accepted and treated as valid integers for
> >> postgres_fdw options batch_size and fetch_size. Whereas this is not
> >> the case with fdw_startup_cost and fdw_tuple_cost options for which an
> >> error is thrown. Attaching a patch to fix that.
>
> > This looks an improvement. But one issue is that the restore of
> > dump file taken by pg_dump from v13 may fail for v14 with this patch
> > if it contains invalid setting of fetch_size, e.g., "fetch_size '123.456'".
> > OTOH, since batch_size was added in v14, it has no such issue.
>
> Maybe better to just silently round to integer?  I think that's
> what we generally do with integer GUCs these days, eg
>
> regression=# set work_mem = 102.9;
> SET
> regression=# show work_mem;
>  work_mem
> --
>  103kB
> (1 row)

I think we can use parse_int to parse the fetch_size and batch_size as
integers, which also rounds off decimals to integers and returns false
for non-numeric junk. But, it accepts both quoted and unquoted
integers, something like batch_size 100 and batch_size '100', which I
feel is okay because the reloptions also accept both.

While on this, we can also use parse_real for fdw_startup_cost and
fdw_tuple_cost options but with that they will accept both quoted and
unquoted real values.

Thoughts?

> I agree with throwing an error for non-numeric junk though.
> Allowing that on the grounds of backwards compatibility
> seems like too much of a stretch.

+1.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Teaching users how they can get the most out of HOT in Postgres 14

(I had missed this discussion due to the mismatched thread subject..)

On Fri, May 14, 2021 at 11:14 AM Michael Paquier  wrote:
>
> On Thu, May 13, 2021 at 01:27:44PM -0700, Peter Geoghegan wrote:
> > Almost all of the benefit of the optimization is available with the
> > current BYPASS_THRESHOLD_PAGES threshold (2% of heap pages have
> > LP_DEAD items), which has less risk than a higher threshold. I don't
> > think it matters much if we have the occasional unnecessary round of
> > index vacuuming on account of not applying the optimization. The truly
> > important benefit of the optimization is to not do unnecessary index
> > vacuuming all the time in the extreme cases where it's really hard to
> > justify. There is currently zero evidence that anything higher than 2%
> > will ever help anybody to an appreciably degree. (There is also no
> > evidence that the optimization will ever need to be disabled, but I
> > accept the need to be conservative and offer an off switch -- the
> > precautionary principle applies when talking about new harms.)
> >
> > Not having to scan every index on every VACUUM, but only every 5th or
> > so VACUUM is a huge improvement. But going from every 5th VACUUM to
> > every 10th VACUUM? That's at best a tiny additional improvement in
> > exchange for what I'd guess is a roughly linear increase in risk
> > (maybe a greater-than-linear increase, even). That's an awful deal.
>
> Perhaps that's an awful deal, but based on which facts can you really
> say that this new behavior of needing at least 2% of relation pages
> with some dead items to clean up indexes is not a worse deal in some
> cases?  This may cause more problems for the in-core index AMs, as
> much as it could impact any out-of-core index AM, no?  What about
> other values like 1%, or even 5%?  My guess is that there would be an
> ask to have more control on that, though that stands as my opinion.

I'm concerned how users can tune that scale type parameter that can be
configurable between 0.0 and 0.05. I think that users basically don't
pay attention to how many blocks are updated by UPDATE/DELETE. Unlike
old vacuum_cleanup_index_scale_factor, increasing this parameter would
directly affect index bloats. If the user can accept more index bloat
to speed up (auto)vacuum, they can use vacuum_index_cleanup instead.

I prefer to have an on/off switch just in case. I remember I also
commented the same thing before. We’ve discussed a way to control
whether or not to enable the skipping optimization by adding a new
mode to INDEX_CLEANUP option, as Peter mentioned. For example, we can
use the new mode “auto” (or “smart”) mode by default, enabling all
skipping optimizations, and specifying “on” disables them. Or we can
add “force” mode to disable all skipping optimizations while leaving
the existing modes as they are. Anyway, I think it’s not a good idea
to add a new GUC parameter that we’re not sure how to tune.

IIUC skipping index vacuum when less than 2% of relation pages with at
least one LP_DEAD is a table’s optimization rather than a btree
index’s optimization. Since we’re not likely to set many pages
all-visible or collect many dead tuples in that case, we can skip
index vacuuming and table vacuuming. IIUC this case, fortunately, goes
well together btree indexes’ bottom-up deletion. If this skipping
behavior badly affects other indexes AMs, this optimization should be
considered within btree indexes, although we will need a way for index
AMs to consider and tell the vacuum strategy. But I guess this works
well in most cases so having an on/off switch suffice.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: PG 14 release notes, first draft

On Tue, May 18, 2021 at 11:07 PM Bruce Momjian  wrote:
>
> On Tue, May 18, 2021 at 06:28:49PM +0900, Masahiko Sawada wrote:
> > On Mon, May 10, 2021 at 3:03 PM Bruce Momjian  wrote:
> > >
> > > I have committed the first draft of the PG 14 release notes.  You can
> > > see the most current  build of them here:
> > >
> > > https://momjian.us/pgsql_docs/release-14.html
> > >
> >
> > I think we need to mention in the release note that
> > vacuum_cleanup_index_scale_factor GUC parameter has been removed and
> > vacuum_cleanup_index_scale_factor storage parameter has been
> > deprecated (please refer to commit 9f3665fb and effdd3f3b63).
>
> Looking at the full commit message:
>
> commit 9f3665fbfc
> Author: Peter Geoghegan 
> Date:   Wed Mar 10 16:27:01 2021 -0800
>
> Don't consider newly inserted tuples in nbtree VACUUM.
>
> Remove the entire idea of "stale stats" within nbtree VACUUM (stop
> caring about stats involving the number of inserted tuples).  Also
> remove the vacuum_cleanup_index_scale_factor GUC/param on the 
> master
> branch (though just disable them on postgres 13).
>
> The vacuum_cleanup_index_scale_factor/stats interface made the 
> nbtree AM
> partially responsible for deciding when pg_class.reltuples stats 
> needed
> to be updated.  This seems contrary to the spirit of the index AM 
> API,
> though -- it is not actually necessary for an index AM's bulk 
> delete and
> cleanup callbacks to provide accurate stats when it happens to be
> inconvenient.  The core code owns that.  (Index AMs have the 
> authority
> to perform or not perform certain kinds of deferred cleanup based 
> on
> their own considerations, such as page deletion and recycling, 
> but that
> has little to do with pg_class.reltuples/num_index_tuples.)
>
> This issue was fairly harmless until the introduction of the
> autovacuum_vacuum_insert_threshold feature by commit b07642db, 
> which had
> an undesirable interaction with the 
> vacuum_cleanup_index_scale_factor
> mechanism: it made insert-driven autovacuums perform full index 
> scans,
> even though there is no real benefit to doing so.  This has been 
> tied to
> a regression with an append-only insert benchmark [1].
>
> Also have remaining cases that perform a full scan of an index 
> during a
> cleanup-only nbtree VACUUM indicate that the final tuple count is 
> only
> an estimate.  This prevents vacuumlazy.c from setting the index's
> pg_class.reltuples in those cases (it will now only update 
> pg_class when
> vacuumlazy.c had TIDs for nbtree to bulk delete).  This arguably 
> fixes
> an oversight in deduplication-related bugfix commit 48e12913.
>
> [1] 
> https://smalldatum.blogspot.com/2021/01/insert-benchmark-postgres-is-still.html
>
> Author: Peter Geoghegan 
> Reviewed-By: Masahiko Sawada 
> Discussion: 
> https://postgr.es/m/cad21aoa4whthn5uu6+wscz7+j_rcejmcuh94qcoupub42sh...@mail.gmail.com
> --> Backpatch: 13-, where autovacuum_vacuum_insert_threshold was 
> added.
>
> This was backpatched into PG 13.3, which was released last week:
>
> 
> 
>  
>   Disable the vacuum_cleanup_index_scale_factor
>   parameter and storage option (Peter Geoghegan)
>  
>
>  
>   The notion of tracking stale index statistics 
> proved
>   to interact badly with
>   the autovacuum_vacuum_insert_threshold 
> parameter,
>   resulting in unnecessary full-index scans and consequent 
> degradation
>   of autovacuum performance.  The latter mechanism seems 
> superior, so
>   remove the stale-statistics logic.  The control parameter for 
> that,
>   vacuum_cleanup_index_scale_factor, will be
>   removed entirely in v14.  In v13, it remains present to avoid
>   breaking existing configuration files, but it no longer does
>   anything.
>  
> 
>
> Therefore, it didn't show up in my src/tools/git_changelog output, and I
> did not include it.
>

Thanks for your explanation. I understood and agreed not to include it
in PG14 release note.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: postgres_fdw - should we tighten up batch_size, fetch_size options against non-numeric values?

On Tue, May 18, 2021 at 7:46 PM Bharath Rupireddy
 wrote:
>
> On Tue, May 18, 2021 at 7:19 PM Tom Lane  wrote:
> >
> > Fujii Masao  writes:
> > > On 2021/05/17 18:58, Bharath Rupireddy wrote:
> > >> It looks like the values such as '123.456', '789.123' '100$%$#$#',
> > >> '9,223,372,' are accepted and treated as valid integers for
> > >> postgres_fdw options batch_size and fetch_size. Whereas this is not
> > >> the case with fdw_startup_cost and fdw_tuple_cost options for which an
> > >> error is thrown. Attaching a patch to fix that.
> >
> > > This looks an improvement. But one issue is that the restore of
> > > dump file taken by pg_dump from v13 may fail for v14 with this patch
> > > if it contains invalid setting of fetch_size, e.g., "fetch_size 
> > > '123.456'".
> > > OTOH, since batch_size was added in v14, it has no such issue.
> >
> > Maybe better to just silently round to integer?  I think that's
> > what we generally do with integer GUCs these days, eg
> >
> > regression=# set work_mem = 102.9;
> > SET
> > regression=# show work_mem;
> >  work_mem
> > --
> >  103kB
> > (1 row)
>
> I think we can use parse_int to parse the fetch_size and batch_size as
> integers, which also rounds off decimals to integers and returns false
> for non-numeric junk. But, it accepts both quoted and unquoted
> integers, something like batch_size 100 and batch_size '100', which I
> feel is okay because the reloptions also accept both.
>
> While on this, we can also use parse_real for fdw_startup_cost and
> fdw_tuple_cost options but with that they will accept both quoted and
> unquoted real values.

I'm sorry about saying that the unquoted integers are accepted with
batch_size, fetch_size, but actually the parser is throwing the syntax
error.

So, we can safely use parse_int for batch_size and fetch_size,
parse_real for fdw_tuple_cost and fdw_startup_cost without changing
any behaviour.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: PG 14 release notes, first draft

2021-05-18 Thread Justin Pryzby

On Tue, May 18, 2021 at 10:07:25AM -0400, Bruce Momjian wrote:
> On Tue, May 18, 2021 at 06:28:49PM +0900, Masahiko Sawada wrote:
> > On Mon, May 10, 2021 at 3:03 PM Bruce Momjian  wrote:
> > >
> > > I have committed the first draft of the PG 14 release notes.  You can
> > > see the most current  build of them here:
> > >
> > > https://momjian.us/pgsql_docs/release-14.html
> > 
> > I think we need to mention in the release note that
> > vacuum_cleanup_index_scale_factor GUC parameter has been removed and
> > vacuum_cleanup_index_scale_factor storage parameter has been
> > deprecated (please refer to commit 9f3665fb and effdd3f3b63).
> 
> Looking at the full commit message:
> 
>   commit 9f3665fbfc
>   Author: Peter Geoghegan 
>   Date:   Wed Mar 10 16:27:01 2021 -0800
>   
>   Don't consider newly inserted tuples in nbtree VACUUM.
>   
>   Remove the entire idea of "stale stats" within nbtree VACUUM (stop
>   caring about stats involving the number of inserted tuples).  Also
>   remove the vacuum_cleanup_index_scale_factor GUC/param on the master
>   branch (though just disable them on postgres 13).

> This was backpatched into PG 13.3, which was released last week:

> remove the stale-statistics logic.  The control parameter for 
> that,
> vacuum_cleanup_index_scale_factor, will be
> removed entirely in v14.  In v13, it remains present to avoid
> breaking existing configuration files, but it no longer does
> anything.
>
>   
> 
> Therefore, it didn't show up in my src/tools/git_changelog output, and I
> did not include it.

Normally, stuff that was backpatched isn't included in major release notes,
since the change would/could normally happen during a minor -> minor+1 release.

As things stand, in this case I think it *should* be included, since the
backpatched change isn't the same as the change to HEAD (removing the GUC).
The git_changelog output might well be wrong in this case (or, arguably, the
"remove the GUC entirely" should've been a separate master-only commit than the
"make the GUC do nothing" commit).

However, Peter indicated an intent to add a reloption to disable the vacuum
optimization, so maybe the removal of the GUC could be documented at that time.

-- 
Justin

Re: PG 14 release notes, first draft

On Tue, May 18, 2021 at 7:44 AM Justin Pryzby  wrote:
> As things stand, in this case I think it *should* be included, since the
> backpatched change isn't the same as the change to HEAD (removing the GUC).
> The git_changelog output might well be wrong in this case (or, arguably, the
> "remove the GUC entirely" should've been a separate master-only commit than 
> the
> "make the GUC do nothing" commit).

I suppose that's true -- maybe it should be listed separately, because
the GUC is removed in 14 only.

> However, Peter indicated an intent to add a reloption to disable the vacuum
> optimization, so maybe the removal of the GUC could be documented at that 
> time.

This is unrelated to the bypass indexes in VACUUM thing.

-- 
Peter Geoghegan

Re: allow specifying direct role membership in pg_hba.conf

2021-05-18 Thread Andrew Dunstan



On 5/18/21 8:05 AM, Chapman Flack wrote:
> On 05/18/21 04:54, Magnus Hagander wrote:
>
>> I mean, if you have
>> hostssl somedatabase someuser 10.0.0.0/24 gss
>> hostssl somedatabase supseruser 10.0.0.0/24 gss tls_min_version=1.3
>>
>> One would reasonably expect that "someuser" can connect with whatever
>> the default version i for tls_min_versino, whereas "superuser" would
>> require a minimum of 1.3. But that's *not* what would happen --
>> superuser would also be allowed to connect with a lower version if
>> that's allowed in the global set.
> Negatory. "superuser" would be allowed to send a StartupMessage
> containing the strings "somedatabase" and "superuser" (and possibly
> some settings of options) over a lower version if that's allowed
> in the global set ... and would then have the connection rejected
> because the negotiated protocol was lower than 1.3, without seeing
> any authentication message or having a chance to send any sensitive
> authentication credentials.
>
> So the risk of any information exposure over a too-low TLS version
> is limited to the name of a database, the name of a user, and possibly
> the settings of some options, and no sensitive authentication data.
>


We are way off $subject. If we want to continue this discussion please
use an appropriate subject.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: PG 14 release notes, first draft

On Tue, May 18, 2021 at 09:44:09AM -0500, Justin Pryzby wrote:
> As things stand, in this case I think it *should* be included, since the
> backpatched change isn't the same as the change to HEAD (removing the GUC).
> The git_changelog output might well be wrong in this case (or, arguably, the
> "remove the GUC entirely" should've been a separate master-only commit than 
> the
> "make the GUC do nothing" commit).

I think having the same commit message for different patches to
different branches is an unwise behavior, particularly if the commit is
release-note worthy.  (I think it is fine if the patch is purely
mechanical and hence not release-note worthy.)  The master patch is hash
9f3665fbfc and the PG 13 patch is hash 9663d12446.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: PG 14 release notes, first draft

On Tue, May 18, 2021 at 07:51:00AM -0700, Peter Geoghegan wrote:
> On Tue, May 18, 2021 at 7:44 AM Justin Pryzby  wrote:
> > As things stand, in this case I think it *should* be included, since the
> > backpatched change isn't the same as the change to HEAD (removing the GUC).
> > The git_changelog output might well be wrong in this case (or, arguably, the
> > "remove the GUC entirely" should've been a separate master-only commit than 
> > the
> > "make the GUC do nothing" commit).
> 
> I suppose that's true -- maybe it should be listed separately, because
> the GUC is removed in 14 only.

OK, this is a mess then.  Would someone please give me the full text for
this, including the commit hash(es)?  Is the PG 13.3 release note text
accurate?

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: Multiple pg_waldump --rmgr options

2021-05-18 Thread Julien Rouhaud

On Tue, May 18, 2021 at 04:50:31PM +0300, Heikki Linnakangas wrote:
> I wanted to dump all heap WAL records with pg_waldump, so I did this:
> 
> > $ pg_waldump --rmgr=heap --rmgr=heap2 data/pg_wal/00010001 
> > --stat=record
> > Type   N  (%)  Record 
> > size  (%) FPI size  (%)Combined size  (%)
> >    -  ---  
> > ---  ---   ---- 
> >  ---
> > Heap2/PRUNE  268 (  8.74)
> > 18192 (  2.73)0 (  0.00)18192 (  1.74)
> > Heap2/VACUUM  55 (  1.79) 
> > 4940 (  0.74)0 (  0.00) 4940 (  0.47)
> > Heap2/FREEZE_PAGE277 (  9.03)   
> > 186868 ( 28.03)0 (  0.00)   186868 ( 17.86)
> > Heap2/VISIBLE467 ( 15.23)
> > 27783 (  4.17)   376832 ( 99.34)   404615 ( 38.68)
> > Heap2/MULTI_INSERT  1944 ( 63.38)   
> > 354800 ( 53.21) 2520 (  0.66)   357320 ( 34.16)
> > Heap2/MULTI_INSERT+INIT   56 (  1.83)
> > 74152 ( 11.12)0 (  0.00)74152 (  7.09)
> >   
> >     
> > Total   3067
> > 666735 [63.74%]   379352 [36.26%]  1046087 [100%]
> > pg_waldump: fatal: error in WAL record at 0/1680118: invalid record length 
> > at 0/1680150: wanted 24, got 0
> 
> That didn't do what I wanted. It only printed the Heap2 records, not Heap,
> even though I specified both. The reason is that if you specify multiple
> --rmgr options, only the last one takes effect.
> 
> I propose the attached to allow selecting multiple rmgrs, by giving multiple
> --rmgr options. With that, it works the way I expected:

The change and the patch look sensible to me.

Re: PG 14 release notes, first draft

On Tue, May 18, 2021 at 8:09 AM Bruce Momjian  wrote:
> > I suppose that's true -- maybe it should be listed separately, because
> > the GUC is removed in 14 only.
>
> OK, this is a mess then.  Would someone please give me the full text for
> this, including the commit hash(es)?  Is the PG 13.3 release note text
> accurate?

The 13.3 release notes say this:

"""
Disable the vacuum_cleanup_index_scale_factor parameter and storage
option (Peter Geoghegan)

The notion of tracking “stale” index statistics proved to interact
badly with the autovacuum_vacuum_insert_threshold parameter, resulting
in unnecessary full-index scans and consequent degradation of
autovacuum performance. The latter mechanism seems superior, so remove
the stale-statistics logic. The control parameter for that,
vacuum_cleanup_index_scale_factor, will be removed entirely in v14. In
v13, it remains present to avoid breaking existing configuration
files, but it no longer does anything.
"""

I think that this is slightly inaccurate, though that's probably of
little consequence. The autovacuum_vacuum_insert_threshold GUC was in
fact removed in v14, but the reloption was ultimately not removed from
HEAD/v14 for compatibility reasons. This is not apparent just from
commit 9f3665fb -- there was a clean-up commit (commit effdd3f3) that
added the reloption back following further discussion. So the
equivalent reloption remains in place though is disabled, just for
compatibility purposes -- on v13 and v14. The GUC is where v13 and v14
differ -- only v13 still has the GUC to avoid breaking user's
postgresql.conf files (though it's also delisted). Deprecating a
reloption is much harder than deprecating a GUC.

In my opinion this should be interpreted as already handled by the
backpatch to 13.3, and so not necessary to handle again now -- despite
the GUC thing. It's possible that no users set the GUC at all, because
it wasn't particularly well thought out. This entire situation is
probably unprecedented (we just don't deprecate reloptions very
often), so I defer to your judgement, Bruce.

-- 
Peter Geoghegan

Re: PG 14 release notes, first draft

On Tue, May 18, 2021 at 08:34:55AM -0700, Peter Geoghegan wrote:
> In my opinion this should be interpreted as already handled by the
> backpatch to 13.3, and so not necessary to handle again now -- despite
> the GUC thing. It's possible that no users set the GUC at all, because
> it wasn't particularly well thought out. This entire situation is
> probably unprecedented (we just don't deprecate reloptions very
> often), so I defer to your judgement, Bruce.

I am thiking the vacuum_cleanup_index_scale_factor existance in a PG 14
postgresql.conf will throw an error, unlike 13.x, so I do think we need
to mention this so people will know to remove it from their
postgresql.conf before upgrades, right?  I don't think the PG 13.3
release note mention really makes it clear it has to be removed.  In a
dump/restore, so we retain the reloption
vacuum_cleanup_index_scale_factor and just ignore it, or drop it on
restore?  I am hoping it is the later.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: Added missing tab completion for alter subscription set option

2021-05-18 Thread Alvaro Herrera

On 2021-May-14, vignesh C wrote:

> While I was reviewing one of the logical decoding features, I found
> Streaming and binary options were missing in tab completion for the
> alter subscription set option, the attached patch has the changes for
> the same.
> Thoughts?

I wish we didn't have to keep knowledge in the psql source on which
option names are to be used for each command.  If we had some function
 SELECT pg_completion_options('alter subscription set');
that returned the list of options usable for each command, we wouldn't
have to ... psql would just retrieve the list of options for the current
command.

Maintaining such a list does not seem hard -- for example we could just
have a function alongside parse_subscription_option() that returns the
names that are recognized by that one.  If we drive the implementation
of both off a single struct, it would never be outdated.

-- 
Álvaro Herrera39°49'30"S 73°17'W
"[PostgreSQL] is a great group; in my opinion it is THE best open source
development communities in existence anywhere."(Lamar Owen)

Re: PG 14 release notes, first draft

On Tue, May 18, 2021 at 8:48 AM Bruce Momjian  wrote:
> I am thiking the vacuum_cleanup_index_scale_factor existance in a PG 14
> postgresql.conf will throw an error, unlike 13.x, so I do think we need
> to mention this so people will know to remove it from their
> postgresql.conf before upgrades, right?  I don't think the PG 13.3
> release note mention really makes it clear it has to be removed.  In a
> dump/restore, so we retain the reloption
> vacuum_cleanup_index_scale_factor and just ignore it, or drop it on
> restore?  I am hoping it is the later.

There is no dump/restore hazard on upgrade to 14, since the
vacuum_cleanup_index_scale_factor reloption remains in place (it's
just not in psql tab completion anymore, nor is it documented, etc).

It is possible (though I would certainly say unlikely) that the
vacuum_cleanup_index_scale_factor GUC will be in somebody's
postgresql.conf from an earlier version, and won't be recognized on
upgrade to v14. So maybe you need to say something about that
particular issue -- which could be framed as finishing off the process
started by the 13.3 commit. But that's it.

-- 
Peter Geoghegan

Re: PG 14 release notes, first draft

On Tue, May 18, 2021 at 08:54:56AM -0700, Peter Geoghegan wrote:
> On Tue, May 18, 2021 at 8:48 AM Bruce Momjian  wrote:
> > I am thiking the vacuum_cleanup_index_scale_factor existance in a PG 14
> > postgresql.conf will throw an error, unlike 13.x, so I do think we need
> > to mention this so people will know to remove it from their
> > postgresql.conf before upgrades, right?  I don't think the PG 13.3
> > release note mention really makes it clear it has to be removed.  In a
> > dump/restore, so we retain the reloption
> > vacuum_cleanup_index_scale_factor and just ignore it, or drop it on
> > restore?  I am hoping it is the later.
> 
> There is no dump/restore hazard on upgrade to 14, since the
> vacuum_cleanup_index_scale_factor reloption remains in place (it's
> just not in psql tab completion anymore, nor is it documented, etc).

So it is the former behavior --- "so we retain the reloption
vacuum_cleanup_index_scale_factor and just ignore it"?

> It is possible (though I would certainly say unlikely) that the
> vacuum_cleanup_index_scale_factor GUC will be in somebody's
> postgresql.conf from an earlier version, and won't be recognized on
> upgrade to v14. So maybe you need to say something about that
> particular issue -- which could be framed as finishing off the process
> started by the 13.3 commit. But that's it.

Yes.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: PG 14 release notes, first draft

We do retain the reloption, but ignore it. Purely to avoid the dump and
reload hazard. I think that you could reasonably tell users that it's gone
completely, because it does nothing on either 13 or 14. It's hidden from
them to the extent that that's possible.

Peter Geoghegan
(Sent from my phone)

Re: seawasp failing, maybe in glibc allocator

2021-05-18 Thread Fabien COELHO




The issue is non-deterministically triggered in contrib checks, either in
int or ltree, but not elsewhere. This suggests issues specific to these
modules, or triggered by these modules. Hmmm…


Hmm, yeah.  A couple of different ways that ltreetest fails without crashing:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=seawasp&dt=2021-05-13%2001%3A17%3A15
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=seawasp&dt=2021-05-12%2017%3A17%3A15

Otherwise it's mostly free() blowing up, and one case of an assertion
failure in llvm::StringMapImpl::RemoveKey, I guess before free() is
reached:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=seawasp&dt=2021-05-14%2009%3A17%3A15


It seems that the upload of the valgrind run (many hours…) failed on "413 
request entity too large", and everything seems to have been cleaned 
despite the "--keepall" I think I put when I started the run.


Not sure about the best way to proceed.

--
Fabien.

Re: Removed extra memory allocations from create_list_bounds

2021-05-18 Thread Nitin Jadhav

> The CFBOT had no issues with the patches, so I suspect an issue on your side.
> http://cfbot.cputube.org/nitin-jadhav.html

I am getting the following error when I try to apply in my machine.

$ git apply 
../patches/0001-Removed-extra-memory-allocations-from-create_list_bo.patch
../patches/0001-Removed-extra-memory-allocations-from-create_list_bo.patch:18:
trailing whitespace.
/*
../patches/0001-Removed-extra-memory-allocations-from-create_list_bo.patch:19:
trailing whitespace.
 * get_non_null_count_list_bounds
../patches/0001-Removed-extra-memory-allocations-from-create_list_bo.patch:20:
trailing whitespace.
 * Calculates the total number of non null bound values of
../patches/0001-Removed-extra-memory-allocations-from-create_list_bo.patch:21:
trailing whitespace.
 * all the partitions.
../patches/0001-Removed-extra-memory-allocations-from-create_list_bo.patch:22:
trailing whitespace.
 */
error: patch failed: src/backend/partitioning/partbounds.c:432
error: src/backend/partitioning/partbounds.c: patch does not apply

However I was able to apply it by adding '--reject --whitespace=fix'.

$ git apply --reject --whitespace=fix
../patches/0001-Removed-extra-memory-allocations-from-create_list_bo.patch
../patches/0001-Removed-extra-memory-allocations-from-create_list_bo.patch:18:
trailing whitespace.
/*
../patches/0001-Removed-extra-memory-allocations-from-create_list_bo.patch:19:
trailing whitespace.
 * get_non_null_count_list_bounds
../patches/0001-Removed-extra-memory-allocations-from-create_list_bo.patch:20:
trailing whitespace.
 * Calculates the total number of non null bound values of
../patches/0001-Removed-extra-memory-allocations-from-create_list_bo.patch:21:
trailing whitespace.
 * all the partitions.
../patches/0001-Removed-extra-memory-allocations-from-create_list_bo.patch:22:
trailing whitespace.
 */
Checking patch src/backend/partitioning/partbounds.c...
Applied patch src/backend/partitioning/partbounds.c cleanly.
warning: squelched 30 whitespace errors
warning: 35 lines add whitespace errors.

I have rebased all the patches on top of
'v2_0001-removed_extra_mem_alloc_from_create_list_bounds.patch'.
Attaching all the patches here.

--
Thanks & Regards,
Nitin Jadhav

On Mon, May 17, 2021 at 8:33 PM Justin Pryzby  wrote:
>
> On Mon, May 17, 2021 at 08:22:25PM +0530, Nitin Jadhav wrote:
> > I agree and thanks for creating those patches. I am not able to apply
> > the patch on the latest HEAD. Kindly check and upload the modified
> > patches.
>
> The CFBOT had no issues with the patches, so I suspect an issue on your side.
> http://cfbot.cputube.org/nitin-jadhav.html
>
> --
> Justin

v2_0001-removed_extra_mem_alloc_from_create_list_bounds.patch
Description: Binary data

v2_0003-Same-for-create_hash_bounds-PartitionHashBound.patch
Description: Binary data

v2_0004-Allocate-datum-arrays-in-bulk-to-avoid-palloc-overhead.patch
Description: Binary data

v2_0005-create_range_bounds-pfree-intermediate-results.patch
Description: Binary data

v2_0002-allocate-the-PartitionListValue-as-a-single-chunk.patch
Description: Binary data

Re: Removed extra memory allocations from create_list_bounds

2021-05-18 Thread Robert Haas

On Tue, May 18, 2021 at 1:29 PM Nitin Jadhav
 wrote:
> > The CFBOT had no issues with the patches, so I suspect an issue on your 
> > side.
> > http://cfbot.cputube.org/nitin-jadhav.html
>
> I am getting the following error when I try to apply in my machine.
>
> $ git apply 
> ../patches/0001-Removed-extra-memory-allocations-from-create_list_bo.patch
> ../patches/0001-Removed-extra-memory-allocations-from-create_list_bo.patch:18:
> trailing whitespace.

'git apply' is very picky. Use 'patch -p1' to apply your patches instead.

Also, use 'git diff --check' or 'git log --check' before generating
patches to send, and fix any whitespace errors before submitting.

I see that you have made a theoretical argument for why this should be
good for performance, but it would be better to have some test results
that show that it works out in practice. Sometimes things seem like
they ought to be more efficient but turn out to be less efficient when
they are actually tried.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Performance degradation of REFRESH MATERIALIZED VIEW

2021-05-18 Thread Andres Freund

Hi,

On 2021-05-18 11:20:07 +0900, Masahiko Sawada wrote:
> Yes. It depends on how much the matview refresh gets slower but I
> think the problem here is that users always are forced to pay the cost
> for freezing tuple during refreshing the matview. There is no way to
> disable it unlike FREEZE option of COPY command.
> 
> I’ve done benchmarks for matview refresh on my machine (FreeBSD 12.1,
> AMD Ryzen 5 PRO 3400GE, 24GB RAM) with four codes: HEAD, HEAD +
> Andres’s patch, one before 39b66a91b, and HEAD without
> TABLE_INSERT_FROZEN.
> 
> The workload is to refresh the matview that simply selects 50M tuples
> (about 1.7 GB). Here are the average execution times of three trials
> for each code:
> 
> 1) head: 42.263 sec
> 2) head w/ Andres’s patch: 40.194 sec
> 3) before 39b66a91b commit: 38.143 sec
> 4) head w/o freezing tuples: 32.413 sec

I don't see such a big difference between andres-freeze/non-freeze. Is
there any chance there's some noise in there? I found that I need to
disable autovacuum and ensure that there's a checkpoint just before the
REFRESH to get halfway meaningful numbers, as well as a min/max_wal_size
ensuring that only recycled WAL is used.


> I also observed 5% degradation by comparing 1 and 2 but am not sure
> where the overhead came from. I agree with Andres’s proposal. It’s a
> straightforward approach.

What degradation are you referencing here?


I compared your case 2 with 4 - as far as I can see the remaining
performance difference is from the the difference in WAL records
emitted:

freeze-andres:

Type   N  (%)  Record size  
(%) FPI size  (%)Combined size  (%)
   -  ---  ---  
---   ----  ---
XLOG/CHECKPOINT_ONLINE 1 (  0.00)  114 
(  0.00)0 (  0.00)  114 (  0.00)
Transaction/COMMIT 1 (  0.00)  949 
(  0.00)0 (  0.00)  949 (  0.00)
Storage/CREATE 1 (  0.00)   42 
(  0.00)0 (  0.00)   42 (  0.00)
Standby/LOCK   3 (  0.00)  138 
(  0.00)0 (  0.00)  138 (  0.00)
Standby/RUNNING_XACTS  2 (  0.00)  104 
(  0.00)0 (  0.00)  104 (  0.00)
Heap2/VISIBLE  44248 (  0.44)  2610642 
(  0.44)16384 ( 14.44)  2627026 (  0.44)
Heap2/MULTI_INSERT 5 (  0.00) 1125 
(  0.00) 6696 (  5.90) 7821 (  0.00)
Heap/INSERT  9955755 ( 99.12)587389836 
( 99.12) 5128 (  4.52)587394964 ( 99.10)
Heap/DELETE   13 (  0.00)  702 
(  0.00)0 (  0.00)  702 (  0.00)
Heap/UPDATE2 (  0.00)  202 
(  0.00)0 (  0.00)  202 (  0.00)
Heap/HOT_UPDATE1 (  0.00)   65 
(  0.00) 4372 (  3.85) 4437 (  0.00)
Heap/INSERT+INIT   44248 (  0.44)  2610632 
(  0.44)0 (  0.00)  2610632 (  0.44)
Btree/INSERT_LEAF 33 (  0.00) 2030 
(  0.00)80864 ( 71.28)82894 (  0.01)
    
  
Total   10044313 592616581 
[99.98%]   113444 [0.02%] 592730025 [100%]

nofreeze:

Type   N  (%)  Record size  
(%) FPI size  (%)Combined size  (%)
   -  ---  ---  
---   ----  ---
XLOG/NEXTOID   1 (  0.00)   30 
(  0.00)0 (  0.00)   30 (  0.00)
Transaction/COMMIT 1 (  0.00)  949 
(  0.00)0 (  0.00)  949 (  0.00)
Storage/CREATE 1 (  0.00)   42 
(  0.00)0 (  0.00)   42 (  0.00)
Standby/LOCK   3 (  0.00)  138 
(  0.00)0 (  0.00)  138 (  0.00)
Standby/RUNNING_X

Re: Performance degradation of REFRESH MATERIALIZED VIEW

2021-05-18 Thread Tomas Vondra


On 5/18/21 4:20 AM, Masahiko Sawada wrote:
> ...



I think the changes for heap_multi_insert() are fine so we can revert
only heap_insert() part if we revert something from the v14 tree,
although we will end up not inserting frozen tuples into toast tables.



I'd be somewhat unhappy about reverting just this bit, because it'd mean
that we freeze rows in the main table but not rows in the TOAST tables
(that was kinda why we concluded we need the heap_insert part too).

I'm still a bit puzzled where does the extra overhead (in cases when
freeze is not requested) come from, TBH.


Which cases do you mean? Doesn't matview refresh always request to
freeze tuples even after applying the patch proposed on this thread?



Oh, I didn't realize that! That'd make this much less of an issue, I'd
say, because if we're intentionally freezing the rows it's reasonable to
pay a bit of time (in exchange for not having to do it later). The
original +25% was a bit too much, of course, but +5% seems reasonable.


Yes. It depends on how much the matview refresh gets slower but I
think the problem here is that users always are forced to pay the cost
for freezing tuple during refreshing the matview. There is no way to
disable it unlike FREEZE option of COPY command.



Yeah, I see your point. I agree it's unfortunate there's no way to 
disable freezing during REFRESH MV. For most users that trade-off is 
probably fine, but for some cases (matviews refreshed often, or cases 
where it's fine to pay more but later) it may be an issue.


From this POV, however, it may not be enough to optimize the current 
freezing code - it's always going to be a bit slower than before. So the 
only *real* solution may be adding a FREEZE option to the REFRESH 
MATERIALIZED VIEW command.



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Performance degradation of REFRESH MATERIALIZED VIEW

2021-05-18 Thread Tomas Vondra





On 5/18/21 8:08 PM, Andres Freund wrote:

Hi,

On 2021-05-18 11:20:07 +0900, Masahiko Sawada wrote:

Yes. It depends on how much the matview refresh gets slower but I
think the problem here is that users always are forced to pay the cost
for freezing tuple during refreshing the matview. There is no way to
disable it unlike FREEZE option of COPY command.

I’ve done benchmarks for matview refresh on my machine (FreeBSD 12.1,
AMD Ryzen 5 PRO 3400GE, 24GB RAM) with four codes: HEAD, HEAD +
Andres’s patch, one before 39b66a91b, and HEAD without
TABLE_INSERT_FROZEN.

The workload is to refresh the matview that simply selects 50M tuples
(about 1.7 GB). Here are the average execution times of three trials
for each code:

1) head: 42.263 sec
2) head w/ Andres’s patch: 40.194 sec
3) before 39b66a91b commit: 38.143 sec
4) head w/o freezing tuples: 32.413 sec


I don't see such a big difference between andres-freeze/non-freeze. Is
there any chance there's some noise in there? I found that I need to
disable autovacuum and ensure that there's a checkpoint just before the
REFRESH to get halfway meaningful numbers, as well as a min/max_wal_size
ensuring that only recycled WAL is used.



I also observed 5% degradation by comparing 1 and 2 but am not sure
where the overhead came from. I agree with Andres’s proposal. It’s a
straightforward approach.


What degradation are you referencing here?


I compared your case 2 with 4 - as far as I can see the remaining
performance difference is from the the difference in WAL records
emitted:

freeze-andres:

Type   N  (%)  Record size  
(%) FPI size  (%)Combined size  (%)
   -  ---  ---  
---   ----  ---
XLOG/CHECKPOINT_ONLINE 1 (  0.00)  114 
(  0.00)0 (  0.00)  114 (  0.00)
Transaction/COMMIT 1 (  0.00)  949 
(  0.00)0 (  0.00)  949 (  0.00)
Storage/CREATE 1 (  0.00)   42 
(  0.00)0 (  0.00)   42 (  0.00)
Standby/LOCK   3 (  0.00)  138 
(  0.00)0 (  0.00)  138 (  0.00)
Standby/RUNNING_XACTS  2 (  0.00)  104 
(  0.00)0 (  0.00)  104 (  0.00)
Heap2/VISIBLE  44248 (  0.44)  2610642 
(  0.44)16384 ( 14.44)  2627026 (  0.44)
Heap2/MULTI_INSERT 5 (  0.00) 1125 
(  0.00) 6696 (  5.90) 7821 (  0.00)
Heap/INSERT  9955755 ( 99.12)587389836 
( 99.12) 5128 (  4.52)587394964 ( 99.10)
Heap/DELETE   13 (  0.00)  702 
(  0.00)0 (  0.00)  702 (  0.00)
Heap/UPDATE2 (  0.00)  202 
(  0.00)0 (  0.00)  202 (  0.00)
Heap/HOT_UPDATE1 (  0.00)   65 
(  0.00) 4372 (  3.85) 4437 (  0.00)
Heap/INSERT+INIT   44248 (  0.44)  2610632 
(  0.44)0 (  0.00)  2610632 (  0.44)
Btree/INSERT_LEAF 33 (  0.00) 2030 
(  0.00)80864 ( 71.28)82894 (  0.01)
    
   
Total   10044313 592616581 
[99.98%]   113444 [0.02%] 592730025 [100%]

nofreeze:

Type   N  (%)  Record size  
(%) FPI size  (%)Combined size  (%)
   -  ---  ---  
---   ----  ---
XLOG/NEXTOID   1 (  0.00)   30 
(  0.00)0 (  0.00)   30 (  0.00)
Transaction/COMMIT 1 (  0.00)  949 
(  0.00)0 (  0.00)  949 (  0.00)
Storage/CREATE 1 (  0.00)   42 
(  0.00)0 (  0.00)   42 (  0.00)
Standby/LOCK   3 (  0.00)  138 
(  0.00)0 (  0.00)  138 (  0.00)
Standby/

Re: Performance degradation of REFRESH MATERIALIZED VIEW

2021-05-18 Thread Andres Freund

Hi,

On 2021-05-18 20:34:08 +0200, Tomas Vondra wrote:
> Yeah, I see your point. I agree it's unfortunate there's no way to disable
> freezing during REFRESH MV. For most users that trade-off is probably fine,
> but for some cases (matviews refreshed often, or cases where it's fine to
> pay more but later) it may be an issue.
>
> From this POV, however, it may not be enough to optimize the current
> freezing code - it's always going to be a bit slower than before.

But the intrinsic overhead is *tiny*. Setting a few bits, with the other
costs amortized over a lot of pages. As far as I can tell the measurable
overhead is that the increased WAL logging - which is not necessary.

Greetings,

Andres Freund

Re: Performance degradation of REFRESH MATERIALIZED VIEW

2021-05-18 Thread Andres Freund

Hi,

On 2021-05-18 20:43:41 +0200, Tomas Vondra wrote:
> Yeah, emitting WAL is not exactly cheap, although it's just a little bit
> more (0.44%). I haven't looked into the details, but I wonder why it has
> such disproportionate impact (although, the 32 vs. 40 sec may be off).

I couldn't reproduce this large a performance difference - I saw more
like 10% instead of 25%.

> > I dimly remember that we explicitly discussed that we do *not* want to
> > emit WAL records here?

> Ummm, in which thread?

https://postgr.es/m/20190408010427.4l63qr7h2fjcyp77%40alap3.anarazel.de

On 2019-04-07 18:04:27 -0700, Andres Freund wrote:
> This avoids an extra WAL record for setting empty pages to all visible,
> by adding XLH_INSERT_ALL_VISIBLE_SET & XLH_INSERT_ALL_FROZEN_SET, and
> setting those when appropriate in heap_multi_insert.  Unfortunately
> currently visibilitymap_set() doesn't really properly allow to do this,
> as it has embedded WAL logging for heap.
>
> I think we should remove the WAL logging from visibilitymap_set(), and
> move it to a separate, heap specific, function.

It'd probably be sufficient for the current purpose to change
visibilitymap_set()'s documentation to say that recptr can also be
passed in if the action is already covered by a WAL record, and that
it's the callers responsibility to think through the correctness
issues. Here it's easy, because any error will just throw the relation
away.

We do need to to include all-visible / FSM change in the WAL, so
crash-recovery / standbys end up with the same result as a primary
running normally. We already have the information, via
XLH_INSERT_ALL_FROZEN_SET. I think all we need to do is to add a
visibilitymap_set() in the redo routines if XLH_INSERT_ALL_FROZEN_SET.

Greetings,

Andres Freund

Re: PG 14 release notes, first draft

On Tue, May 18, 2021 at 09:01:44AM -0700, Peter Geoghegan wrote:
> We do retain the reloption, but ignore it. Purely to avoid the dump and reload
> hazard. I think that you could reasonably tell users that it's gone 
> completely,
> because it does nothing on either 13 or 14. It's hidden from them to the 
> extent
> that that's possible. 

I went with this release note text:





Remove server variable vacuum_cleanup_index_scale_factor (Peter 
Geoghegan)



This setting was disabled in PostgreSQL version 13.3.



-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: Lowering the ever-growing heap->pd_lower

2021-05-18 Thread Matthias van de Meent

On Mon, 3 May 2021 at 16:39, Matthias van de Meent
 wrote:
> I am planning on fixing this patch sometime
> before the next commit fest so that we can truncate the LP array
> during hot pruning as well, instead of only doing so in the 2nd VACUUM
> pass.

PFA the updated version of this patch. Apart from adding line pointer
truncation in PageRepairFragmentation (as in the earlier patches), I
also altered PageTruncateLinePointerArray to clean up all trailing
line pointers, even if it was the last item on the page.

This means that for 32-bit systems, pages that have once had tuples
(but have been cleared since) can now be used again for
MaxHeapTupleSize insertions. Without this patch, an emptied page would
always have at least one line pointer left, which equates to
MaxHeapTupleSize actual free space, but PageGetFreeSpace always
subtracts sizeof(ItemIdData), leaving the perceived free space as
reported to the FSM less than MaxHeapTupleSize if the page has any
line pointers.

For 64-bit systems, this is not as much of a problem, because
MaxHeapTupleSize is 4 bytes smaller on those systems, which leaves us
with 1 line pointer as margin for the FSM to recognise the page as
free enough for one MaxHeapTupleSize item.

With regards,

Matthias van de Meent
From 96e518ea22886702fa58eacdbf1cb1f8eedf290e Mon Sep 17 00:00:00 2001
From: Matthias van de Meent 
Date: Tue, 18 May 2021 20:18:12 +0200
Subject: [PATCH v6] Improve usage of line pointer array truncation in heapam.

This will allow reuse of what is effectively free space for data as well as
new line pointers, instead of keeping it reserved for line pointers only.

It further improves efficiency of HeapAM's local updates by not requiring a
2nd run of VACUUM to come by before we can reuse line pointers left over from
temporarily long HOT chains.

Additionally, it improves the current application of line pointer truncation
truncate down to 0 line pointers; improving page applicability for
PageIsEmpty()-based optimizations and making empty heap pages applicable for
MaxHeapTupleSize-sized tuple insertions on 32-bit systems.
---
 src/backend/storage/page/bufpage.c | 33 +-
 1 file changed, 23 insertions(+), 10 deletions(-)

diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 82ca91f597..3c9fc2bf34 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -688,10 +688,13 @@ compactify_tuples(itemIdCompact itemidbase, int nitems, Page page, bool presorte
  *
  * This routine is usable for heap pages only, but see PageIndexMultiDelete.
  *
- * Never removes unused line pointers.  PageTruncateLinePointerArray can
- * safely remove some unused line pointers.  It ought to be safe for this
- * routine to free unused line pointers in roughly the same way, but it's not
- * clear that that would be beneficial.
+ * Also removes line pointers, similar to PageTruncateLinePointerArray.
+ * The main benefit for also truncating the line pointer array here is that
+ * now HOT workloads can also benefit from this optimization. The remaining
+ * line pointers from a temporary increase in HOT chain length on this page
+ * will now be cleaned up as well, even if no logical tuple is ever deleted
+ * from the page.  This allows for more extended use of local updates,
+ * without needing to wait for a second run of VACUUM:
  *
  * PageTruncateLinePointerArray is only called during VACUUM's second pass
  * over the heap.  Any unused line pointers that it sees are likely to have
@@ -718,6 +721,7 @@ PageRepairFragmentation(Page page)
 	int			nline,
 nstorage,
 nunused;
+	OffsetNumber lastUsed = InvalidOffsetNumber;
 	int			i;
 	Size		totallen;
 	bool		presorted = true;	/* For now */
@@ -751,6 +755,7 @@ PageRepairFragmentation(Page page)
 		lp = PageGetItemId(page, i);
 		if (ItemIdIsUsed(lp))
 		{
+			lastUsed = i;
 			if (ItemIdHasStorage(lp))
 			{
 itemidptr->offsetindex = i - 1;
@@ -798,6 +803,19 @@ PageRepairFragmentation(Page page)
 		compactify_tuples(itemidbase, nstorage, page, presorted);
 	}

+	/* The last line pointer is not the last used line pointer */
+	if (lastUsed != nline)
+	{
+		int nunusedend = nline - lastUsed;
+
+		Assert(nunused >= nunusedend && nunusedend > 0);
+
+		/* remove trailing unused line pointers from the count */
+		nunused -= nunusedend;
+		/* truncate the line pointer array */
+		((PageHeader) page)->pd_lower -= (sizeof(ItemIdData) * nunusedend);
+	}
+
 	/* Set hint bit for PageAddItemExtended */
 	if (nunused > 0)
 		PageSetHasFreeLinePointers(page);
@@ -815,11 +833,6 @@ PageRepairFragmentation(Page page)
  * pointer on the page (if VACUUM didn't have an LP_DEAD item on the page that
  * it just set to LP_UNUSED then it should not call here).
  *
- * We avoid truncating the line pointer array to 0 items, if necessary by
- * leaving behind a single remaining LP_UNUSED item.  This is a little
- * arbitrary, but it seems like a good idea to avoid leavi

Re: Lowering the ever-growing heap->pd_lower

On Tue, May 18, 2021 at 12:29 PM Matthias van de Meent
 wrote:
> PFA the updated version of this patch. Apart from adding line pointer
> truncation in PageRepairFragmentation (as in the earlier patches), I
> also altered PageTruncateLinePointerArray to clean up all trailing
> line pointers, even if it was the last item on the page.

Can you show a practical benefit to this patch, such as an improvement
in throughout or in efficiency for a given workload?

It was easy to see that having something was better than having
nothing at all. But things are of course different now that we have
PageTruncateLinePointerArray().

-- 
Peter Geoghegan

pgbench test failing on 14beta1 on Debian/i386

2021-05-18 Thread Christoph Berg

Hi,

I can reproducibly get build failures in pgbench on 32-bit i386
Debian, both on sid and buster. (The older Debian stretch and Ubuntu
bionic are unaffected. Other architectures are also fine.)

https://pgdgbuild.dus.dg-i.net/view/Binaries/job/postgresql-14-binaries/635/

https://pgdgbuild.dus.dg-i.net/view/Binaries/job/postgresql-14-binaries/635/architecture=i386,distribution=sid/consoleFull

17:39:41 make[2]: Entering directory '/<>/build/src/bin/pgbench'
17:39:41 rm -rf '/<>/build/src/bin/pgbench'/tmp_check
17:39:41 /bin/mkdir -p '/<>/build/src/bin/pgbench'/tmp_check
17:39:41 cd /<>/build/../src/bin/pgbench && 
TESTDIR='/<>/build/src/bin/pgbench' 
PATH="/<>/build/tmp_install/usr/lib/postgresql/14/bin:$PATH" 
LD_LIBRARY_PATH="/<>/build/tmp_install/usr/lib/i386-linux-gnu"  
PGPORT='65432' 
PG_REGRESS='/<>/build/src/bin/pgbench/../../../src/test/regress/pg_regress'
 REGRESS_SHLIB='/<>/build/src/test/regress/regress.so' 
/usr/bin/prove -I /<>/build/../src/test/perl/ -I 
/<>/build/../src/bin/pgbench --verbose t/*.pl
17:39:50 
17:39:50 #   Failed test 'pgbench expressions stderr /(?^:command=113.: boolean 
true\b)/'
17:39:50 #   at t/001_pgbench_with_server.pl line 421.
17:39:50 #   'pgbench: setting random seed to 5432
17:39:50 # starting vacuum...end.
17:39:50 # debug(script=0,command=1): int 13
17:39:50 # debug(script=0,command=2): int 116
17:39:50 # debug(script=0,command=3): int 1498
17:39:50 # debug(script=0,command=4): int 4
17:39:50 # debug(script=0,command=5): int 5
17:39:50 # debug(script=0,command=6): int 6
17:39:50 # debug(script=0,command=7): int 7
17:39:50 # debug(script=0,command=8): int 8
17:39:50 # debug(script=0,command=9): int 9
17:39:50 # debug(script=0,command=10): int 10
17:39:50 # debug(script=0,command=11): int 11
17:39:50 # debug(script=0,command=12): int 12
17:39:50 # debug(script=0,command=13): double 13.856406460551
17:39:50 # debug(script=0,command=14): double 14.8514851485149
17:39:50 # debug(script=0,command=15): double 15.39380400259
17:39:50 # debug(script=0,command=16): double 16
17:39:50 # debug(script=0,command=17): double 17.094
17:39:50 # debug(script=0,command=20): int 1
17:39:50 # debug(script=0,command=21): double -27
17:39:50 # debug(script=0,command=22): double 1024
17:39:50 # debug(script=0,command=23): double 1
17:39:50 # debug(script=0,command=24): double 1
17:39:50 # debug(script=0,command=25): double -0.125
17:39:50 # debug(script=0,command=26): double -0.125
17:39:50 # debug(script=0,command=27): double -0.00032
17:39:50 # debug(script=0,command=28): double 8.50705917302346e+37
17:39:50 # debug(script=0,command=29): double 1e+30
17:39:50 # debug(script=0,command=30): boolean false
17:39:50 # debug(script=0,command=31): boolean true
17:39:50 # debug(script=0,command=32): int 32
17:39:50 # debug(script=0,command=33): int 33
17:39:50 # debug(script=0,command=34): double 34
17:39:50 # debug(script=0,command=35): int 35
17:39:50 # debug(script=0,command=36): int 36
17:39:50 # debug(script=0,command=37): double 37.002
17:39:50 # debug(script=0,command=38): int 38
17:39:50 # debug(script=0,command=39): int 39
17:39:50 # debug(script=0,command=40): boolean true
17:39:50 # debug(script=0,command=41): null
17:39:50 # debug(script=0,command=42): null
17:39:50 # debug(script=0,command=43): boolean true
17:39:50 # debug(script=0,command=44): boolean true
17:39:50 # debug(script=0,command=45): boolean true
17:39:50 # debug(script=0,command=46): int 46
17:39:50 # debug(script=0,command=47): boolean true
17:39:50 # debug(script=0,command=48): boolean true
17:39:50 # debug(script=0,command=49): int -5817877081768721676
17:39:50 # debug(script=0,command=50): boolean true
17:39:50 # debug(script=0,command=51): int -7793829335365542153
17:39:50 # debug(script=0,command=52): int -1464711246773187029
17:39:50 # debug(script=0,command=53): boolean true
17:39:50 # debug(script=0,command=55): int -1
17:39:50 # debug(script=0,command=56): int -1
17:39:50 # debug(script=0,command=57): int 1
17:39:50 # debug(script=0,command=65): int 65
17:39:50 # debug(script=0,command=74): int 74
17:39:50 # debug(script=0,command=83): int 83
17:39:50 # debug(script=0,command=86): int 86
17:39:50 # debug(script=0,command=93): int 93
17:39:50 # debug(script=0,command=95): int 0
17:39:50 # debug(script=0,command=96): int 1
17:39:50 # debug(script=0,command=97): int 0
17:39:50 # debug(script=0,command=98): int 5432
17:39:50 # debug(script=0,command=99): int -9223372036854775808
17:39:50 # debug(script=0,command=100): int 9223372036854775807
17:39:50 # debug(script=0,command=101): boolean true
17:39:50 # debug(script=0,command=102): boolean true
17:39:50 # debug(script=0,command=103): boolean true
17:39:50 # debug(script=0,command=104): boolean true
17:39:50 # debug(script=0,command=105): boolean true
17:39:50 # debug(script=0,command=109): boolean true
17:39:50 # debug(script=0,command=110): boolean true
17:39:50 # debug(script=0,command=111): boolean true
17:39:50 # debug(script=0,co

Re: Teaching users how they can get the most out of HOT in Postgres 14

On Tue, May 18, 2021 at 7:29 AM Masahiko Sawada  wrote:
> I prefer to have an on/off switch just in case. I remember I also
> commented the same thing before. We’ve discussed a way to control
> whether or not to enable the skipping optimization by adding a new
> mode to INDEX_CLEANUP option, as Peter mentioned. For example, we can
> use the new mode “auto” (or “smart”) mode by default, enabling all
> skipping optimizations, and specifying “on” disables them. Or we can
> add “force” mode to disable all skipping optimizations while leaving
> the existing modes as they are. Anyway, I think it’s not a good idea
> to add a new GUC parameter that we’re not sure how to tune.
>
> IIUC skipping index vacuum when less than 2% of relation pages with at
> least one LP_DEAD is a table’s optimization rather than a btree
> index’s optimization.

Right. There *is* an excellent way to tune this behavior: by adjusting
heap fillfactor to make HOT more effective. That was why I started
this thread!

If you leave heap fillfactor at the default of 100, and have lots of
updates (that don't modify indexed columns) and no deletes, then
you're almost certainly not going to have VACUUM skip indexes anyway
-- in practice you're bound to exceed having 2% of pages with an
LP_DEAD item before very long. Tuning heap fillfactor is practically
essential to see a real benefit, regardless of the exact
BYPASS_THRESHOLD_PAGES. (There may be some rare exceptions, but for
the most part this mechanism helps with tables that get many updates
that are expected to use HOT, and will use HOT barring a tiny number
of cases where the new tuple won't' quite fit, etc.)

The idea of tuning the behavior directly (e.g. with a reloption that
lets the user specify a BYPASS_THRESHOLD_PAGES style threshold) is
exactly backwards. The point for the user should not be to skip
indexes during VACUUM. The point for the user is to get lots of
non-HOT updates to *avoid heap fragmentation*, guided by the new
autovacuum instrumentation. That also means that there will be much
less index vacuuming. But that's a pretty minor side-benefit. Why
should the user *expect* largely unnecessary index vacuuming to take
place?

To put it another way, the index bypass mechanism added to
vacuumlazy.c was not intended to add a new good behavior. It was
actually intended to subtract an old bad behavior. The patch is mostly
useful because it allows the user to make VACUUM *more* aggressive
with freezing and VM bit setting (not less aggressive with indexes).
The BYPASS_THRESHOLD_PAGES threshold of 0.02 is a little arbitrary --
but only a little.

> Since we’re not likely to set many pages
> all-visible or collect many dead tuples in that case, we can skip
> index vacuuming and table vacuuming. IIUC this case, fortunately, goes
> well together btree indexes’ bottom-up deletion.

It's true that bottom-up index deletion provides additional insurance
against problems, but I don't think that that insurance is strictly
necessary. It's nice to have insurance, though.

> If this skipping
> behavior badly affects other indexes AMs, this optimization should be
> considered within btree indexes, although we will need a way for index
> AMs to consider and tell the vacuum strategy. But I guess this works
> well in most cases so having an on/off switch suffice.

Right. I doubt that it will actually turn out to be necessary to have
such a switch. But I try to be modest when it comes to predicting what
will be important to some user workload -- it's way too complicated to
have total confidence about something like that. It is a risk to be
managed.

-- 
Peter Geoghegan

Re: pgbench test failing on 14beta1 on Debian/i386

Christoph Berg  writes:
> I can reproducibly get build failures in pgbench on 32-bit i386
> Debian, both on sid and buster. (The older Debian stretch and Ubuntu
> bionic are unaffected. Other architectures are also fine.)

The test that's failing came in with

6b258e3d688db14aadb58dde2a72939362310684
Author: Dean Rasheed 
Date:   Tue Apr 6 11:50:42 2021 +0100

pgbench: Function to generate random permutations.


regards, tom lane

Supporting $n parameters in new syntax

2021-05-18 Thread Paul A Jungwirth

Hello hackers,

I'm wrapping up a patch that adds SQL:2011 FOR PORTION OF syntax and
then uses it to implement CASCADE in temporal foreign keys. The FKs
are implemented as triggers, like ordinary FKs, and the trigger
function makes a call through SPI that does `UPDATE %s FOR PORTION OF
%s FROM $%d TO $%d`. But I suspect I'm missing something in the
analyze/rewriting phase, because I get this error:

ERROR:  no value found for parameter 1

That's coming from ExecEvalParamExtern in executor/execExprInterp.c.

If I hardcode some dates instead, the query works (even if I use the
same parameters elsewhere). Does anyone have any hints what I may be
missing? Any suggestions for some other syntax to consult as an
example (e.g. ON CONFLICT DO UPDATE)?

In gram.y I parse the phrase like this:

FOR PORTION OF ColId FROM a_expr TO a_expr

Then in the analysis phase I do this:

/*
 * Build a range from the FROM ... TO  bounds.
 * This should give a constant result, so we accept functions like NOW()
 * but not column references, subqueries, etc.
 *
 * It also permits MINVALUE and MAXVALUE like declarative partitions.
 */
Node *target_start = transformForPortionOfBound(forPortionOf->target_start);
Node *target_end   = transformForPortionOfBound(forPortionOf->target_end);
FuncCall *fc = makeFuncCall(SystemFuncName(range_type_name),
list_make2(target_start, target_end),
COERCE_EXPLICIT_CALL,
forPortionOf->range_name_location);
result->targetRange = transformExpr(pstate, (Node *) fc,
EXPR_KIND_UPDATE_PORTION);

(transformForPortionOfBound just handles MIN/MAXVALUE, and for
targetRange I feed the bounds into a range type constructor to use
later.)

I was hoping that transformExpr would do everything I need re
identifying parameters, but maybe there is something else in a later
phase?

Thanks,
Paul

Re: PG 14 release notes, first draft

On Tue, May 18, 2021 at 12:17 PM Bruce Momjian  wrote:
> I went with this release note text:

That seems reasonable -- thanks!

-- 
Peter Geoghegan

Re: pgbench test failing on 14beta1 on Debian/i386

2021-05-18 Thread Thomas Munro

On Wed, May 19, 2021 at 9:51 AM Tom Lane  wrote:
> Christoph Berg  writes:
> > I can reproducibly get build failures in pgbench on 32-bit i386
> > Debian, both on sid and buster. (The older Debian stretch and Ubuntu
> > bionic are unaffected. Other architectures are also fine.)
>
> The test that's failing came in with
>
> 6b258e3d688db14aadb58dde2a72939362310684
> Author: Dean Rasheed 
> Date:   Tue Apr 6 11:50:42 2021 +0100
>
> pgbench: Function to generate random permutations.

FWIW this is reproducible on my local Debian/gcc box with -m32, but
not on my FreeBSD/clang box with -m32.  permute() produces different
values here:

\set t debug(permute(:size-1, :size, 5432) = 5301702756001087507 and \
 permute(:size-2, :size, 5432) = 8968485976055840695 and \
 permute(:size-3, :size, 5432) = 6708495591295582115 and \
 permute(:size-4, :size, 5432) = 2801794404574855121 and \
 permute(:size-5, :size, 5432) = 1489011409218895840 and \
 permute(:size-6, :size, 5432) = 2267749475878240183 and \
 permute(:size-7, :size, 5432) = 1300324176838786780)

I don't understand any of this stuff at all, but I added a bunch of
printfs and worked out that the first point its local variables
diverge is here:

/* Random offset */
r = (uint64) getrand(&random_state2, 0, size - 1);

... after 4 earlier getrand() produced matching values.  Hmm.

Re: pgbench test failing on 14beta1 on Debian/i386

2021-05-18 Thread Thomas Munro

On Wed, May 19, 2021 at 11:34 AM Thomas Munro  wrote:
> I don't understand any of this stuff at all, but I added a bunch of
> printfs and worked out that the first point its local variables
> diverge is here:
>
> /* Random offset */
> r = (uint64) getrand(&random_state2, 0, size - 1);

Forgot to post the actual values:

  r = 2563421694876090368
  r = 2563421694876090365

Smells a bit like a precision problem in the workings of pg_erand48(),
but as soon as I saw floating point numbers I closed my laptop and ran
for the door.

Subscription tests fail under CLOBBER_CACHE_ALWAYS

I discovered $SUBJECT after wondering why hyrax hadn't reported
in recently, and trying to run check-world under CCA to see if
anything got stuck. Indeed it did --- although this doesn't
explain the radio silence from hyrax, because that animal doesn't
run any TAP tests. (Neither does avocet, which I think is the
only other active CCA critter. So this could have been broken
for a very long time.)

I count three distinct bugs that were exposed by this attempt:

1. In the part of 013_partition.pl that tests firing AFTER
triggers on partitioned tables, we have a case of continuing
to access a relcache entry that's already been closed.
(I'm not quite sure why prion's -DRELCACHE_FORCE_RELEASE
hasn't exposed this.) It looks to me like instead we had
a relcache reference leak before f3b141c48, but now, the
only relcache reference count on a partition child table
is dropped by ExecCleanupTupleRouting, which logical/worker.c
invokes before it fires triggers on that table. Kaboom.
This might go away if worker.c weren't so creatively different
from the other code paths concerned with executor shutdown.

2. Said bug causes a segfault in the apply worker process.
This causes the parent postmaster to give up and die.
I don't understand why we don't treat that like a crash
in a regular backend, considering that an apply worker
is running largely user-defined code.

3. Once the subscriber1 postmaster has exited, the TAP
test will eventually time out, and then this happens:

timed out waiting for catchup at t/013_partition.pl line 219.
### Stopping node "publisher" using mode immediate
# Running: pg_ctl -D
/Users/tgl/pgsql/src/test/subscription/tmp_check/t_013_partition_publisher_data/pgdata
-m immediate stop
waiting for server to shut down done
server stopped
# No postmaster PID for node "publisher"
### Stopping node "subscriber1" using mode immediate
# Running: pg_ctl -D
/Users/tgl/pgsql/src/test/subscription/tmp_check/t_013_partition_subscriber1_data/pgdata
-m immediate stop
pg_ctl: PID file
"/Users/tgl/pgsql/src/test/subscription/tmp_check/t_013_partition_subscriber1_data/pgdata/postmaster.pid"
does not exist
Is server running?
Bail out! system pg_ctl failed

That is, because we failed to shut down subscriber1, the
test script neglects to shut down subscriber2, and now
things just sit indefinitely. So that's a robustness
problem in the TAP infrastructure, rather than a bug in
PG proper; but I still say it's a bug that needs fixing.

regards, tom lane

Re: pgbench test failing on 14beta1 on Debian/i386

Thomas Munro  writes:
> Forgot to post the actual values:
>   r = 2563421694876090368
>   r = 2563421694876090365
> Smells a bit like a precision problem in the workings of pg_erand48(),
> but as soon as I saw floating point numbers I closed my laptop and ran
> for the door.

Yup.  This test has a touching, but entirely unwarranted, faith in
pg_erand48() producing bit-for-bit the same values everywhere.

regards, tom lane

Re: Supporting $n parameters in new syntax

2021-05-18 Thread Paul A Jungwirth

On Tue, May 18, 2021 at 3:00 PM Paul A Jungwirth
 wrote:
>
> I suspect I'm missing something in the
> analyze/rewriting phase, because I get this error:
>
> ERROR:  no value found for parameter 1
> . . .
>
> I was hoping that transformExpr would do everything I need re
> identifying parameters, but maybe there is something else in a later
> phase?

Never mind, I think I figured it out. The problem was that I was
calling ExecEvalExpr with CreateStandaloneExprContext(), and I should
have been using the context from the query.

Thanks!
Paul

Re: pg_dumpall misses --no-toast-compression

On Tue, May 18, 2021 at 09:48:59AM +0900, Michael Paquier wrote:
> Okay, thanks.  I don't mind taking care of that on HEAD once beta1 is
> shipped, then.

Beta1 just got tagged, so this one has been applied as of 694da19.
--
Michael


signature.asc
Description: PGP signature

Re: pg_get_wal_replay_pause_state() should not return 'paused' while a promotion is ongoing.

At Tue, 18 May 2021 12:48:38 +0900, Fujii Masao  
wrote in 
> Currently a promotion causes all available WAL to be replayed before
> a standby becomes a primary whether it was in paused state or not.
> OTOH, something like immediate promotion (i.e., standby becomes
> a primary without replaying outstanding WAL) might be useful for
> some cases. I don't object to that.

Mmm. I was confused with recovery target + pause. Actually promotion
works as so and it is documented.  Anyway it is a matter of the next
version.

I forgot to mention the patch itself, but what the patch does looks
fine to me.  Disabling pause after setting SharedProteIsTriggered
prevents later re-pausing (from the sql function).

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: MaxOffsetNumber for Table AMs

On Thu, May 6, 2021 at 4:10 AM Matthias van de Meent
 wrote:
> See below. I'm not saying we need it _right now_, but at the very
> least I'd like to argue that we should not close the door on varlena
> TIDs, because there _are_ reasons for those TID types. See also below.

Perhaps I was a bit too strident. It's a question of trade-offs.

> Although I agree that abstract definitions are tools, I disagree that
> they're seldom useful as a starting point. Many have implemented b
> (plus) trees based on the original paper exactly due to the guarantees
> that are provided by the abstract definition as defined in the paper.

I was unclear. I agree with this much. I don't think it applies in
this situation, though.

There are many competing considerations besides that. What I'm
concerned about is avoiding a strategically ruinous direction that
works against our strengths -- we should play to our strengths. For
example, it's no coincidence that Postgres is the only system that has
B-Tree deduplication and enables it by default, with next to no
downside. This is more or less a consequence of the fact that indexes
in Postgres don't need their own locks, unlike any relation DB system
that more or less uses a design like ARIES -- indexes are just
physical data structures that back the logical database. They are not
part of the logical database per se.

Consider a 2015 paper from Goetz Graefe about index locking, entitled
"Orthogonal key-value locking":

https://subs.emis.de/LNI/Proceedings/Proceedings241/237.pdf

Some kind of locking in indexes (something like ARIES/IM or ARIES/KVL)
is needed to make TIDs stable identifiers of logical rows -- even in a
system like Oracle (but most obviously in a system like InnoDB or SQL
Server). According to Graefe, "further improvements [to index
concurrency control] are possible despite multiple available
techniques and decades with little progress". This is why terms like
"ruinously expensive" don't feel like hyperbole when talking about
pursuing TID stability/clustered indexes/whatever -- it is a radically
different design. And maybe even a basically inferior design these
days, all things considered.

I think that the simple approach to storage taken by Postgres has aged
incredibly well -- there are many ways in which it suits modern
hardware more than traditional designs. Modern hardware is generally
much more sensitive to the cost of any kind of synchronization [1],
but is otherwise very fast in the ways -- rather a reversal from the
environment that ARIES was created in. While there are certainly
downsides to the Postgres approach to storage, concurrency control and
recovery, these downsides can be fixed directly. Including in the ways
we've been discussing on other threads these past few months. I think
that we can push the idea of teaching heapam (and maybe nbtree) just
enough about the logical database to not make really dumb decisions in
some cases. Why not see how far that can go first?

Graefe's perspective in the key-value locking paper is roughly the
Microsoft SQL Server perspective. It's not hard for me to imagine why
he undertook to research index locking used for concurrency control
(not to be confused with what they call latches and what we call
LWLocks) while still working for Microsoft. The TPC-E benchmark is
something that Microsoft alone targets with SQL Server (no other DB
system has an official entry on the TPC website). Pity TPC-E isn't
more successful, since it seems to have a lot more real world
relevance than TPC-C. I bet we could learn a lot from it because it's
actually realistic (TPC-C isn't that realistic, and TPC-B/pgbench is
the furthest possible thing from reality).

The best analysis I've been able to find about TPC-E is probably "From
A to E: Analyzing TPC’s OLTP Benchmarks" [2]. It concludes that the
big bottleneck for TPC-E is "logical lock contention", which is the
overwhelming bottleneck at high client counts. Though I haven't run
TPC-E properly myself (because of issues with bitrot), I'd speculate
that this gives Postgres an edge in TPC-E. I've heard quite a few
reports of Postgres having much faster writes compared to any of the
big proprietary systems. Some of these reports are many years old. And
yet we routinely seem to talk about our approach as if it's obviously
inferior. The main weakness that Postgres storage has seems to me to
be largely in the area of stability, perhaps only in theoretically
rare or extreme cases that nevertheless cause real problems in the
real world.

Of course I cannot prove that adopting stable TIDs necessarily means
accepting a similar burden from lock contention in an affected table
AM, or in general. I cannot prove much of anything when we're
considering such an abstract question. But perhaps this gives you a
better idea about where my skepticism comes from.

> When trying to create completely new things I would agree that
> starting with the abstract is probably not the right idea, but we're
> not in a gree

Re: Forget close an open relation in ReorderBufferProcessTXN()

On Tue, May 18, 2021 at 5:29 PM Amit Kapila  wrote:
>
> On Tue, May 18, 2021 at 1:29 PM osumi.takami...@fujitsu.com
>  wrote:
> >
> > On Monday, May 17, 2021 6:45 PM Amit Kapila  wrote:
> > >
> > > We allow taking locks on system catalogs, so why prohibit
> > > user_catalog_tables? However, I agree that if we want plugins to acquire 
> > > the
> > > lock on user_catalog_tables then we should either prohibit decoding of 
> > > such
> > > relations or do something else to avoid deadlock hazards.
> > OK.
> >
> > Although we have not concluded the range of logical decoding of 
> > user_catalog_table
> > (like we should exclude TRUNCATE command only or all operations on that 
> > type of table),
> > I'm worried that disallowing the logical decoding of user_catalog_table 
> > produces
> > the deadlock still. It's because disabling it by itself does not affect the
> > lock taken by TRUNCATE command. What I have in mind is an example below.
> >
> > (1) plugin (e.g. pgoutput) is designed to take a lock on user_catalog_table.
> > (2) logical replication is set up in synchronous mode.
> > (3) TRUNCATE command takes an access exclusive lock on the 
> > user_catalog_table.
> > (4) This time, we don't do anything for the TRUNCATE decoding.
> > (5) the plugin tries to take a lock on the truncated table
> > but, it can't due to the lock by TRUNCATE command.
> >
>
> If you skip decoding of truncate then we won't invoke plugin API so
> step 5 will be skipped.
>

I think you were right here even if skip step-4, the plugin might take
a lock on user_catalog_table for something else. I am not sure but I
think we should prohibit truncate on user_catalog_tables as we
prohibit truncate on system catalog tables (see below [1]) if we want
plugin to lock them, otherwise, as you said it might lead to deadlock.
For the matter, I think we should once check all other operations
where we can take an exclusive lock on [user]_catalog_table, say
Cluster command, and compare the behavior of same on system catalog
tables.

[1]
postgres=# truncate pg_class;
ERROR:  permission denied: "pg_class" is a system catalog
postgres=# cluster pg_class;
ERROR:  there is no previously clustered index for table "pg_class"

-- 
With Regards,
Amit Kapila.

Re: Forget close an open relation in ReorderBufferProcessTXN()

On Wed, May 19, 2021 at 7:59 AM Amit Kapila  wrote:
>
> On Tue, May 18, 2021 at 5:29 PM Amit Kapila  wrote:
> >
> > On Tue, May 18, 2021 at 1:29 PM osumi.takami...@fujitsu.com
> >  wrote:
> > >
> > > On Monday, May 17, 2021 6:45 PM Amit Kapila  
> > > wrote:
> > > >
> > > > We allow taking locks on system catalogs, so why prohibit
> > > > user_catalog_tables? However, I agree that if we want plugins to 
> > > > acquire the
> > > > lock on user_catalog_tables then we should either prohibit decoding of 
> > > > such
> > > > relations or do something else to avoid deadlock hazards.
> > > OK.
> > >
> > > Although we have not concluded the range of logical decoding of 
> > > user_catalog_table
> > > (like we should exclude TRUNCATE command only or all operations on that 
> > > type of table),
> > > I'm worried that disallowing the logical decoding of user_catalog_table 
> > > produces
> > > the deadlock still. It's because disabling it by itself does not affect 
> > > the
> > > lock taken by TRUNCATE command. What I have in mind is an example below.
> > >
> > > (1) plugin (e.g. pgoutput) is designed to take a lock on 
> > > user_catalog_table.
> > > (2) logical replication is set up in synchronous mode.
> > > (3) TRUNCATE command takes an access exclusive lock on the 
> > > user_catalog_table.
> > > (4) This time, we don't do anything for the TRUNCATE decoding.
> > > (5) the plugin tries to take a lock on the truncated table
> > > but, it can't due to the lock by TRUNCATE command.
> > >
> >
> > If you skip decoding of truncate then we won't invoke plugin API so
> > step 5 will be skipped.
> >
>
> I think you were right here even if skip step-4, the plugin might take
> a lock on user_catalog_table for something else. I am not sure but I
> think we should prohibit truncate on user_catalog_tables as we
> prohibit truncate on system catalog tables (see below [1]) if we want
> plugin to lock them, otherwise, as you said it might lead to deadlock.
> For the matter, I think we should once check all other operations
> where we can take an exclusive lock on [user]_catalog_table, say
> Cluster command, and compare the behavior of same on system catalog
> tables.
>
> [1]
> postgres=# truncate pg_class;
> ERROR:  permission denied: "pg_class" is a system catalog
> postgres=# cluster pg_class;
> ERROR:  there is no previously clustered index for table "pg_class"
>

Please ignore the cluster command as we need to use 'using index' with
that command to make it successful. I just want to show the truncate
command behavior for which you have asked the question.


-- 
With Regards,
Amit Kapila.

Re: postgres_fdw - should we tighten up batch_size, fetch_size options against non-numeric values?

At Tue, 18 May 2021 19:46:39 +0530, Bharath Rupireddy 
 wrote in 
> On Tue, May 18, 2021 at 7:19 PM Tom Lane  wrote:
> >
> > Fujii Masao  writes:
> > > On 2021/05/17 18:58, Bharath Rupireddy wrote:
> > >> It looks like the values such as '123.456', '789.123' '100$%$#$#',
> > >> '9,223,372,' are accepted and treated as valid integers for
> > >> postgres_fdw options batch_size and fetch_size. Whereas this is not
> > >> the case with fdw_startup_cost and fdw_tuple_cost options for which an
> > >> error is thrown. Attaching a patch to fix that.
> >
> > > This looks an improvement. But one issue is that the restore of
> > > dump file taken by pg_dump from v13 may fail for v14 with this patch
> > > if it contains invalid setting of fetch_size, e.g., "fetch_size 
> > > '123.456'".
> > > OTOH, since batch_size was added in v14, it has no such issue.
> >
> > Maybe better to just silently round to integer?  I think that's
> > what we generally do with integer GUCs these days, eg
> >
> > regression=# set work_mem = 102.9;
> > SET
> > regression=# show work_mem;
> >  work_mem
> > --
> >  103kB
> > (1 row)
> 
> I think we can use parse_int to parse the fetch_size and batch_size as
> integers, which also rounds off decimals to integers and returns false
> for non-numeric junk. But, it accepts both quoted and unquoted
> integers, something like batch_size 100 and batch_size '100', which I
> feel is okay because the reloptions also accept both.
> 
> While on this, we can also use parse_real for fdw_startup_cost and
> fdw_tuple_cost options but with that they will accept both quoted and
> unquoted real values.
> 
> Thoughts?

They are more or less a kind of reloptions. So I think it is
reasonable to treat the option same way with RELOPT_TYPE_INT.  That
is, it would be better to use our standard functions rather than
random codes using bare libc functions for input from users. The same
can be said for parameters with real numbers, regardless of the
"quoted" discussion.

> > I agree with throwing an error for non-numeric junk though.
> > Allowing that on the grounds of backwards compatibility
> > seems like too much of a stretch.
> 
> +1.

+1.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: wal stats questions





On 2021/05/18 9:57, Masahiro Ikeda wrote:



On 2021/05/17 22:34, Fujii Masao wrote:



On 2021/05/17 16:39, Masahiro Ikeda wrote:


Thanks for your comments!


+ * is executed, wal records aren't nomally generated (although HOT makes


nomally -> normally?


Yes, fixed.


+ * It's not enough to check the number of generated wal records, for
+ * example the walwriter may write/sync the WAL although it doesn't


You use both 'wal' and 'WAL' in the comments, but are there any intension?


No, I changed to use 'WAL' only. Although some other comments have 'wal' and
'WAL', it seems that 'WAL' is often used.


Thanks for updating the patch!


Thanks a lot of comments!


+ * Buffer and generated WAL usage counters.
+ *
+ * The counters are accumulated values. There are infrastructures
+ * to add the values and calculate the difference within a specific period.

Is it really worth adding these comments here?


BufferUsage has almost same comments. So, I removed it.


+ * Note: regarding to WAL statistics counters, it's not enough to check
+ * only whether the WAL record is generated or not. If a read transaction
+ * is executed, WAL records aren't normally generated (although HOT makes
+ * WAL records). But, just writes and syncs the WAL data may happen when
+ * to flush buffers.

Aren't the following comments better?

--
To determine whether any WAL activity has occurred since last time, not only
the number of generated WAL records but also the numbers of WAL writes and
syncs need to be checked. Because even transaction that generates no WAL
records can write or sync WAL data when flushing the data pages.
--


Thanks. Yes, I fixed it.


- * This function can be called even if nothing at all has happened. In
- * this case, avoid sending a completely empty message to the stats
- * collector.

I think that it's better to leave this comment as it is.


OK. I leave it.


+ * First, to check the WAL stats counters were updated.
+ *
+ * Even if the 'force' is true, we don't need to send the stats if the
+ * counters were not updated.
+ *
+ * We can know whether the counters were updated or not to check only
+ * three counters. So, for performance, we don't allocate another memory
+ * spaces and check the all stats like pgstat_send_slru().

Is it really worth adding these comments here?


I removed them because the following comments are enough.

* This function can be called even if nothing at all has happened. In
* this case, avoid sending a completely empty message to the stats
* collector.


+ * The current 'wal_records' is the same as the previous one means that
+ * WAL records weren't generated. So, the counters of 'wal_fpi',
+ * 'wal_bytes', 'm_wal_buffers_full' are not updated neither.
+ *
+ * It's not enough to check the number of generated WAL records, for
+ * example the walwriter may write/sync the WAL although it doesn't
+ * generate WAL records. 'm_wal_write' and 'm_wal_sync' are zero means the
+ * counters of time spent are zero too.

Aren't the following comments better?


Check wal_records counter to determine whether any WAL activity has happened
since last time. Note that other WalUsage counters don't need to be checked
because they are incremented always together with wal_records counter.

m_wal_buffers_full also doesn't need to be checked because it's incremented
only when at least one WAL record is generated (i.e., wal_records counter is
incremented). But for safely, we assert that m_wal_buffers_full is always zero
when no WAL record is generated

This function can be called by a process like walwriter that normally
generates no WAL records. To determine whether any WAL activity has happened
at that process since the last time, the numbers of WAL writes and syncs are
also checked.



Yes, I modified them.


+ * The accumulated counters for buffer usage.
+ *
+ * The reason the counters are accumulated values is to avoid unexpected
+ * reset because many callers may use them.

Aren't the following comments better?


These counters keep being incremented infinitely, i.e., must never be reset to
zero, so that we can calculate how much the counters are incremented in an
arbitrary period.



Yes, thanks.
I fixed it.


Thanks for updating the patch! I modified some comments slightly and
pushed that version of the patch.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Multiple pg_waldump --rmgr options

At Tue, 18 May 2021 23:23:02 +0800, Julien Rouhaud  wrote 
in 
> On Tue, May 18, 2021 at 04:50:31PM +0300, Heikki Linnakangas wrote:
> > That didn't do what I wanted. It only printed the Heap2 records, not Heap,
> > even though I specified both. The reason is that if you specify multiple
> > --rmgr options, only the last one takes effect.
> > 
> > I propose the attached to allow selecting multiple rmgrs, by giving multiple
> > --rmgr options. With that, it works the way I expected:
> 
> The change and the patch look sensible to me.

+1.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Performance degradation of REFRESH MATERIALIZED VIEW

On Wed, May 19, 2021 at 3:08 AM Andres Freund  wrote:
>
> Hi,
>
> On 2021-05-18 11:20:07 +0900, Masahiko Sawada wrote:
> > Yes. It depends on how much the matview refresh gets slower but I
> > think the problem here is that users always are forced to pay the cost
> > for freezing tuple during refreshing the matview. There is no way to
> > disable it unlike FREEZE option of COPY command.
> >
> > I’ve done benchmarks for matview refresh on my machine (FreeBSD 12.1,
> > AMD Ryzen 5 PRO 3400GE, 24GB RAM) with four codes: HEAD, HEAD +
> > Andres’s patch, one before 39b66a91b, and HEAD without
> > TABLE_INSERT_FROZEN.
> >
> > The workload is to refresh the matview that simply selects 50M tuples
> > (about 1.7 GB). Here are the average execution times of three trials
> > for each code:
> >
> > 1) head: 42.263 sec
> > 2) head w/ Andres’s patch: 40.194 sec
> > 3) before 39b66a91b commit: 38.143 sec
> > 4) head w/o freezing tuples: 32.413 sec
>
> I don't see such a big difference between andres-freeze/non-freeze. Is
> there any chance there's some noise in there? I found that I need to
> disable autovacuum and ensure that there's a checkpoint just before the
> REFRESH to get halfway meaningful numbers, as well as a min/max_wal_size
> ensuring that only recycled WAL is used.

I've ran the same benchmarks with the following parameters:

shared_buffers = 10GB
max_wal_size = 50GB
min_wal_size = 50GB
checkpoint_timeout = 1h
maintenance_work_mem = 1GB
work_mem = 512MB
autovacuum = off

1) head: 42.397 sec
2) head w/ Andres’s patch: 34.857 sec
3) before 39b66a91b commit: 32.556 sec
4) head w/o freezing tuples: 32.752 sec

There is 6% degradation between 2 and 4 but 2 is much better than the
previous tests.

>
>
> > I also observed 5% degradation by comparing 1 and 2 but am not sure
> > where the overhead came from. I agree with Andres’s proposal. It’s a
> > straightforward approach.
>
> What degradation are you referencing here?

Sorry, I meant comparing 2 to 3 and 4.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

RE: Forget close an open relation in ReorderBufferProcessTXN()

2021-05-18 Thread osumi.takami...@fujitsu.com

On Wednesday, May 19, 2021 11:33 AM Amit Kapila  wrote:
> On Wed, May 19, 2021 at 7:59 AM Amit Kapila 
> wrote:
> >
> > On Tue, May 18, 2021 at 5:29 PM Amit Kapila 
> wrote:
> > >
> > > On Tue, May 18, 2021 at 1:29 PM osumi.takami...@fujitsu.com
> > >  wrote:
> > > >
> > > > On Monday, May 17, 2021 6:45 PM Amit Kapila
>  wrote:
> > > > >
> > > > > We allow taking locks on system catalogs, so why prohibit
> > > > > user_catalog_tables? However, I agree that if we want plugins to
> > > > > acquire the lock on user_catalog_tables then we should either
> > > > > prohibit decoding of such relations or do something else to avoid
> deadlock hazards.
> > > > OK.
> > > >
> > > > Although we have not concluded the range of logical decoding of
> > > > user_catalog_table (like we should exclude TRUNCATE command only
> > > > or all operations on that type of table), I'm worried that
> > > > disallowing the logical decoding of user_catalog_table produces
> > > > the deadlock still. It's because disabling it by itself does not affect 
> > > > the
> lock taken by TRUNCATE command. What I have in mind is an example
> below.
> > > >
> > > > (1) plugin (e.g. pgoutput) is designed to take a lock on
> user_catalog_table.
> > > > (2) logical replication is set up in synchronous mode.
> > > > (3) TRUNCATE command takes an access exclusive lock on the
> user_catalog_table.
> > > > (4) This time, we don't do anything for the TRUNCATE decoding.
> > > > (5) the plugin tries to take a lock on the truncated table
> > > > but, it can't due to the lock by TRUNCATE command.
> > > >
> > >
> > > If you skip decoding of truncate then we won't invoke plugin API so
> > > step 5 will be skipped.
> > >
> >
> > I think you were right here even if skip step-4, the plugin might take
> > a lock on user_catalog_table for something else. 
Yes, we can't know the exact place where the user wants to use the feature
of user_catalog_table. Even if we imagine that the user skips
the truncate decoding (I imagined continuing and skipping a case in
REORDER_BUFFER_CHANGE_TRUNCATE of pgoutput),
it's possible that the user accesses it somewhere else for different purpose 
with lock.


> I am not sure but I
> > think we should prohibit truncate on user_catalog_tables as we
> > prohibit truncate on system catalog tables (see below [1]) if we want
> > plugin to lock them, otherwise, as you said it might lead to deadlock.
> > For the matter, I think we should once check all other operations
> > where we can take an exclusive lock on [user]_catalog_table, say
> > Cluster command, and compare the behavior of same on system catalog
> > tables.
> >
> > [1]
> > postgres=# truncate pg_class;
> > ERROR:  permission denied: "pg_class" is a system catalog postgres=#
> > cluster pg_class;
> > ERROR:  there is no previously clustered index for table "pg_class"
> >
> 
> Please ignore the cluster command as we need to use 'using index' with that
> command to make it successful. I just want to show the truncate command
> behavior for which you have asked the question.
Thank you so much for clarifying the direction.
I agree with the changing the TRUNCATE side.
I'll make a patch based on this.


Best Regards,
Takamichi Osumi

Re: postgres_fdw - should we tighten up batch_size, fetch_size options against non-numeric values?





On 2021/05/19 11:36, Kyotaro Horiguchi wrote:

At Tue, 18 May 2021 19:46:39 +0530, Bharath Rupireddy 
 wrote in

On Tue, May 18, 2021 at 7:19 PM Tom Lane  wrote:


Fujii Masao  writes:

On 2021/05/17 18:58, Bharath Rupireddy wrote:

It looks like the values such as '123.456', '789.123' '100$%$#$#',
'9,223,372,' are accepted and treated as valid integers for
postgres_fdw options batch_size and fetch_size. Whereas this is not
the case with fdw_startup_cost and fdw_tuple_cost options for which an
error is thrown. Attaching a patch to fix that.



This looks an improvement. But one issue is that the restore of
dump file taken by pg_dump from v13 may fail for v14 with this patch
if it contains invalid setting of fetch_size, e.g., "fetch_size '123.456'".
OTOH, since batch_size was added in v14, it has no such issue.


Maybe better to just silently round to integer?  I think that's
what we generally do with integer GUCs these days, eg

regression=# set work_mem = 102.9;
SET
regression=# show work_mem;
  work_mem
--
  103kB
(1 row)


I think we can use parse_int to parse the fetch_size and batch_size as
integers, which also rounds off decimals to integers and returns false
for non-numeric junk. But, it accepts both quoted and unquoted
integers, something like batch_size 100 and batch_size '100', which I
feel is okay because the reloptions also accept both.

While on this, we can also use parse_real for fdw_startup_cost and
fdw_tuple_cost options but with that they will accept both quoted and
unquoted real values.

Thoughts?


They are more or less a kind of reloptions. So I think it is
reasonable to treat the option same way with RELOPT_TYPE_INT.  That
is, it would be better to use our standard functions rather than
random codes using bare libc functions for input from users. The same
can be said for parameters with real numbers, regardless of the
"quoted" discussion.


Sounds reasonable. Since parse_int() is used to parse RELOPT_TYPE_INT value
in reloptions.c, your idea seems to be almost the same as Bharath's one.
That is, use parse_int() and parse_real() to parse integer and real options
values, respectively.




I agree with throwing an error for non-numeric junk though.
Allowing that on the grounds of backwards compatibility
seems like too much of a stretch.


+1.


+1.


+1

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Subscription tests fail under CLOBBER_CACHE_ALWAYS

On Tue, May 18, 2021 at 07:42:08PM -0400, Tom Lane wrote:
> I count three distinct bugs that were exposed by this attempt:
> 
> 1. In the part of 013_partition.pl that tests firing AFTER
> triggers on partitioned tables, we have a case of continuing
> to access a relcache entry that's already been closed.
> (I'm not quite sure why prion's -DRELCACHE_FORCE_RELEASE
> hasn't exposed this.)  It looks to me like instead we had
> a relcache reference leak before f3b141c48, but now, the
> only relcache reference count on a partition child table
> is dropped by ExecCleanupTupleRouting, which logical/worker.c
> invokes before it fires triggers on that table.  Kaboom.
> This might go away if worker.c weren't so creatively different
> from the other code paths concerned with executor shutdown.

The tuple routing has made the whole worker logic messier by a larger
degree compared to when this stuff was only able to apply DMLs changes
on the partition leaves.  I know that it is not that great to be more
creative here, but we need to make sure that AfterTriggerEndQuery() is
moved before ExecCleanupTupleRouting().  We could either keep the
ExecCleanupTupleRouting() calls as they are now, and call
AfterTriggerEndQuery() in more code paths.  Or we could have one
PartitionTupleRouting and one ModifyTableState created beforehand
in create_estate_for_relation() if applying the change on a
partitioned table but that means manipulating more structures across 
more layers of this code.  Something like the attached fixes the
problem for me, but honestly it does not help in clarifying this code
more.  I was not patient enough to wait for CLOBBER_CACHE_ALWAYS to
initialize the nodes in the TAP tests, so I have tested that with a
setup initialized with a non-clobber build, and reproduced the problem
with CLOBBER_CACHE_ALWAYS builds on those same nodes.

You are right that this is not a problem of 14~.  I can reproduce the
problem on 13 as well, and we have no coverage for tuple routing with
triggers on this branch, so this would never have been stressed in the
buildfarm.  There is a good argument to be made here in cherry-picking
2ecfeda3 to REL_13_STABLE.

> 2. Said bug causes a segfault in the apply worker process.
> This causes the parent postmaster to give up and die.
> I don't understand why we don't treat that like a crash
> in a regular backend, considering that an apply worker
> is running largely user-defined code.

CleanupBackgroundWorker() and CleanupBackend() have a lot of common
points.  Are you referring to an inconsistent behavior with
restart_after_crash that gets ignored for bgworkers?  We disable it by
default in the TAP tests.

> 3. Once the subscriber1 postmaster has exited, the TAP
> test will eventually time out, and then this happens:
>
> [.. logs ..]
>
> That is, because we failed to shut down subscriber1, the
> test script neglects to shut down subscriber2, and now
> things just sit indefinitely.  So that's a robustness
> problem in the TAP infrastructure, rather than a bug in
> PG proper; but I still say it's a bug that needs fixing.

This one comes down to teardown_node() that uses system_or_bail(),
leaving things unfinished.  I guess that we could be more aggressive
and ignore failures if we have a non-zero error code and that not all
the tests have passed within the END block of PostgresNode.pm.
--
Michael
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 1432554d5a..01c1bd14f6 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -394,9 +394,6 @@ create_estate_for_relation(LogicalRepRelMapEntry *rel,
 static void
 finish_estate(EState *estate)
 {
-	/* Handle any queued AFTER triggers. */
-	AfterTriggerEndQuery(estate);
-
 	/* Cleanup. */
 	ExecResetTupleTable(estate->es_tupleTable, false);
 	FreeExecutorState(estate);
@@ -1226,9 +1223,14 @@ apply_handle_insert(StringInfo s)
 		apply_handle_tuple_routing(resultRelInfo, estate,
    remoteslot, NULL, rel, CMD_INSERT);
 	else
+	{
 		apply_handle_insert_internal(resultRelInfo, estate,
 	 remoteslot);

+		/* Handle any queued AFTER triggers */
+		AfterTriggerEndQuery(estate);
+	}
+
 	PopActiveSnapshot();

 	finish_estate(estate);
@@ -1371,9 +1373,14 @@ apply_handle_update(StringInfo s)
 		apply_handle_tuple_routing(resultRelInfo, estate,
    remoteslot, &newtup, rel, CMD_UPDATE);
 	else
+	{
 		apply_handle_update_internal(resultRelInfo, estate,
 	 remoteslot, &newtup, rel);

+		/* Handle any queued AFTER triggers */
+		AfterTriggerEndQuery(estate);
+	}
+
 	PopActiveSnapshot();

 	finish_estate(estate);
@@ -1494,9 +1501,14 @@ apply_handle_delete(StringInfo s)
 		apply_handle_tuple_routing(resultRelInfo, estate,
    remoteslot, NULL, rel, CMD_DELETE);
 	else
+	{
 		apply_handle_delete_internal(resultRelInfo, estate,
 	 remoteslot, &rel->remoterel);

+		/* Handle any queued AFTER triggers */
+		AfterTriggerEn

Re: Different compression methods for FPI

2021-05-18 Thread Justin Pryzby

On Mon, May 17, 2021 at 04:44:11PM +0900, Michael Paquier wrote:
> On Sun, Mar 21, 2021 at 02:30:04PM -0500, Justin Pryzby wrote:
> 
> For this patch, this is going to require a bit more in terms of library
> linking as the block decompression is done in xlogreader.c, so that's one
> thing to worry about.

I'm not sure what you mean here ?

> +   {
> +   {"wal_compression_method", PGC_SIGHUP, WAL_SETTINGS,
> +   gettext_noop("Set the method used to compress full page images in 
> the WAL."),
> +   NULL
> +   },
> +   &wal_compression_method,
> +   WAL_COMPRESSION_PGLZ, wal_compression_options,
> +   NULL, NULL, NULL
> +   },
> The interface is not quite right to me.  I think that we should just
> change wal_compression to become an enum, with extra values for pglz
> and the new method.  "on" would be a synonym for "pglz".

Andrey gave a reason in March:

| I hope one day we will compress all WAL, not just FPIs. Advanced archive 
management tools already do so, why not compress it in walwriter?
| When this will be implemented, we could have wal_compression = {off, fpi, 
all}.

> +/* This is a mapping indexed by wal_compression */
> +// XXX: maybe this is better done as a GUC hook to assign the 1)
> method; and 2) level
> +struct walcompression walmethods[] = {
> +   {"pglz",WAL_COMPRESSION_PGLZ},
> +   {"zlib",WAL_COMPRESSION_ZLIB},
> +};
> Don't think you need a hook here, but zlib, or any other method which
> is not supported by the build, had better not be listed instead.  This
> ensures that the value cannot be assigned if the binaries don't
> support that.

I think you're confusing the walmethods struct (which is unconditional) with
wal_compression_options, which is conditional.

> The patch set is a gathering of various things, and not only things
> associated to the compression method used for FPIs.
> What is the point of that in patch 0002?
> > Subject: [PATCH 03/12] Make sure published XIDs are persistent
> Patch 0003 looks unrelated to this thread.

..for the reason that I gave:

| And 2ndary patches from another thread to allow passing recovery tests.
|These two patches are a prerequisite for this patch to progress:
| * Run 011_crash_recovery.pl with wal_level=minimal
| * Make sure published XIDs are persistent

> > Subject: [PATCH 04/12] wal_compression_method: default to zlib..
> 
> Patch 0004 could happen, however there are no reasons given why this
> is adapted.  Changing the default is not going to happen for the time
> release where this feature is added, anyway.

>From the commit message:
| this is meant to exercise the CIs, and not meant to be merged

> +   default:
> +   report_invalid_record(record, "image at %X/%X is compressed with 
> unsupported codec, block %d (%d/%s)",
> + (uint32) (record->ReadRecPtr >> 32),
> + (uint32) record->ReadRecPtr,
> + block_id,
> + compression_method,
> + wal_compression_name(compression_method));
> +   return false;
> In xlogreader.c, the error message is helpful this way.  However, we
> would not know which compression method failed if there is a
> decompression failure for a method supported by the build restoring
> this block.  That would be good to add.

I don't undersand you here - that's what wal_compression_name is for ?
2021-05-18 21:38:04.324 CDT [26984] FATAL:  unknown compression method 
requested: 2(lz4)

> I think that what we actually need for this thread are patches 0001,
> 0005 and 0006 merged together to study first the performance we have
> with each one of the compression methods proposed, and then let's just
> pick one.  Reading around, zstd and zlib compresse more but take
> longer.  LZ4 is faster than the others, but can compress less.
> With limited bandwidth, less data makes sense, and my guess is that
> most users care most about the speed of recovery if we can afford
> speed with an acceptable compression ratio.

I don't see why we'd add a guc for configuration compression but not include
the 30 lines of code needed to support a 3rd method that we already used by the
core server.

-- 
Justin
>From 0e477fdcdf25a1e59163d748df03e570684a1955 Mon Sep 17 00:00:00 2001
From: Andrey Borodin 
Date: Sat, 27 Feb 2021 09:03:50 +0500
Subject: [PATCH v7 1/9] Allow alternate compression methods for
 wal_compression

TODO: bump XLOG_PAGE_MAGIC
---
 doc/src/sgml/config.sgml  | 17 +
 src/backend/Makefile  |  2 +-
 src/backend/access/transam/xlog.c | 10 +++
 src/backend/access/transam/xloginsert.c   | 67 ---
 src/backend/access/transam/xlogreader.c   | 64 +-
 src/backend/utils/misc/guc.c  | 11 +++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 src/include/access/xlog.h |  1 +
 src/incl

Re: Multiple pg_waldump --rmgr options

On Wed, May 19, 2021 at 11:50:52AM +0900, Kyotaro Horiguchi wrote:
> At Tue, 18 May 2021 23:23:02 +0800, Julien Rouhaud  wrote 
> in 
>> The change and the patch look sensible to me.
> 
> +1.

Agreed.
--
Michael


signature.asc
Description: PGP signature

Re: Subscription tests fail under CLOBBER_CACHE_ALWAYS

2021-05-18 Thread Amit Langote

On Wed, May 19, 2021 at 12:04 PM Michael Paquier  wrote:
> On Tue, May 18, 2021 at 07:42:08PM -0400, Tom Lane wrote:
> > I count three distinct bugs that were exposed by this attempt:
> >
> > 1. In the part of 013_partition.pl that tests firing AFTER
> > triggers on partitioned tables, we have a case of continuing
> > to access a relcache entry that's already been closed.
> > (I'm not quite sure why prion's -DRELCACHE_FORCE_RELEASE
> > hasn't exposed this.)  It looks to me like instead we had
> > a relcache reference leak before f3b141c48, but now, the
> > only relcache reference count on a partition child table
> > is dropped by ExecCleanupTupleRouting, which logical/worker.c
> > invokes before it fires triggers on that table.  Kaboom.

Oops.

> > This might go away if worker.c weren't so creatively different
> > from the other code paths concerned with executor shutdown.
>
> The tuple routing has made the whole worker logic messier by a larger
> degree compared to when this stuff was only able to apply DMLs changes
> on the partition leaves.  I know that it is not that great to be more
> creative here, but we need to make sure that AfterTriggerEndQuery() is
> moved before ExecCleanupTupleRouting().  We could either keep the
> ExecCleanupTupleRouting() calls as they are now, and call
> AfterTriggerEndQuery() in more code paths.

Yeah, that's what I thought to propose doing too.  Your patch looks
enough in that regard.  Thanks for writing it.

> Or we could have one
> PartitionTupleRouting and one ModifyTableState created beforehand
> in create_estate_for_relation() if applying the change on a
> partitioned table but that means manipulating more structures across
> more layers of this code.

Yeah, that seems like too much change to me too.

>  Something like the attached fixes the
> problem for me, but honestly it does not help in clarifying this code
> more.  I was not patient enough to wait for CLOBBER_CACHE_ALWAYS to
> initialize the nodes in the TAP tests, so I have tested that with a
> setup initialized with a non-clobber build, and reproduced the problem
> with CLOBBER_CACHE_ALWAYS builds on those same nodes.

I have checked the fix works with a CLOBBER_CACHE_ALWAYS build.

> You are right that this is not a problem of 14~.  I can reproduce the
> problem on 13 as well, and we have no coverage for tuple routing with
> triggers on this branch, so this would never have been stressed in the
> buildfarm.  There is a good argument to be made here in cherry-picking
> 2ecfeda3 to REL_13_STABLE.

+1

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Re: Subscription tests fail under CLOBBER_CACHE_ALWAYS

Michael Paquier  writes:
> On Tue, May 18, 2021 at 07:42:08PM -0400, Tom Lane wrote:
>> This might go away if worker.c weren't so creatively different
>> from the other code paths concerned with executor shutdown.

> The tuple routing has made the whole worker logic messier by a larger
> degree compared to when this stuff was only able to apply DMLs changes
> on the partition leaves.  I know that it is not that great to be more
> creative here, but we need to make sure that AfterTriggerEndQuery() is
> moved before ExecCleanupTupleRouting().  We could either keep the
> ExecCleanupTupleRouting() calls as they are now, and call
> AfterTriggerEndQuery() in more code paths.  Or we could have one
> PartitionTupleRouting and one ModifyTableState created beforehand
> in create_estate_for_relation() if applying the change on a
> partitioned table but that means manipulating more structures across 
> more layers of this code.  Something like the attached fixes the
> problem for me, but honestly it does not help in clarifying this code
> more.

I was wondering if we could move the ExecCleanupTupleRouting call
into finish_estate.  copyfrom.c, for example, does that during
its shutdown function.  Compare also the worker.c changes proposed
in

https://www.postgresql.org/message-id/3362608.1621367104%40sss.pgh.pa.us

which are because I discovered it's unsafe to pop the snapshot
before running AfterTriggerEndQuery.

regards, tom lane

Re: Subscription tests fail under CLOBBER_CACHE_ALWAYS

On Tue, May 18, 2021 at 11:46:25PM -0400, Tom Lane wrote:
> I was wondering if we could move the ExecCleanupTupleRouting call
> into finish_estate.  copyfrom.c, for example, does that during
> its shutdown function.  Compare also the worker.c changes proposed
> in

Yeah, the first patch I wrote for this thread was pushing out
PopActiveSnapshot() into the finish() routine, but I really found the
creation of the ModifyTableState stuff needed for a partitioned table
done in create_estate_for_relation() to make the code more confusing,
as that's only a piece needed for the tuple routing path.  Saying
that, minimizing calls to PopActiveSnapshot() and PushActiveSnapshot()
is an improvement.  That's why I settled into more calls to
AfterTriggerEndQuery() in the 4 code paths of any apply (tuple routing
+ 3 DMLs).

> https://www.postgresql.org/message-id/3362608.1621367104%40sss.pgh.pa.us
> 
> which are because I discovered it's unsafe to pop the snapshot
> before running AfterTriggerEndQuery.

Didn't remember this one.  This reminds me of something similar I did
a couple of weeks ago for the worker code, similar to what you have
here.  Moving the snapshot push to be earlier, as your other patch is
doing, was bringing a bit more sanity when it came to opening the
indexes of the relation on which a change is applied as we need an
active snapshot for predicates and expressions (aka ExecOpenIndices
and ExecCloseIndices).
--
Michael


signature.asc
Description: PGP signature

Re: pg_get_wal_replay_pause_state() should not return 'paused' while a promotion is ongoing.





On 2021/05/18 15:46, Michael Paquier wrote:

On Tue, May 18, 2021 at 12:48:38PM +0900, Fujii Masao wrote:

Currently a promotion causes all available WAL to be replayed before
a standby becomes a primary whether it was in paused state or not.
OTOH, something like immediate promotion (i.e., standby becomes
a primary without replaying outstanding WAL) might be useful for
some cases. I don't object to that.


Sounds like a "promotion immediate" mode.  It does not sound difficult
nor expensive to add a small test for that in one of the existing
recovery tests triggerring a promotion.  Could you add one based on
pg_get_wal_replay_pause_state()?


You're thinking to add the test like the following?
#1. Pause the recovery
#2. Confirm that pg_get_wal_replay_pause_state() returns 'paused'
#3. Trigger standby promotion
#4. Confirm that pg_get_wal_replay_pause_state() returns 'not paused'

It seems not easy to do the test #4 stably because
pg_get_wal_replay_pause_state() needs to be executed
before the promotion finishes.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: pg_get_wal_replay_pause_state() should not return 'paused' while a promotion is ongoing.





On 2021/05/19 9:53, Kyotaro Horiguchi wrote:

At Tue, 18 May 2021 12:48:38 +0900, Fujii Masao  
wrote in

Currently a promotion causes all available WAL to be replayed before
a standby becomes a primary whether it was in paused state or not.
OTOH, something like immediate promotion (i.e., standby becomes
a primary without replaying outstanding WAL) might be useful for
some cases. I don't object to that.


Mmm. I was confused with recovery target + pause. Actually promotion
works as so and it is documented.  Anyway it is a matter of the next
version.

I forgot to mention the patch itself, but what the patch does looks
fine to me.  Disabling pause after setting SharedProteIsTriggered
prevents later re-pausing (from the sql function).


Thanks for the review! I pushed the patch.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Forget close an open relation in ReorderBufferProcessTXN()

On Wed, May 19, 2021 at 8:28 AM osumi.takami...@fujitsu.com
 wrote:
>
> On Wednesday, May 19, 2021 11:33 AM Amit Kapila  
> wrote:
> > On Wed, May 19, 2021 at 7:59 AM Amit Kapila 
> > wrote:
> > >
> > > On Tue, May 18, 2021 at 5:29 PM Amit Kapila 
> > wrote:
> > > >
> > > > On Tue, May 18, 2021 at 1:29 PM osumi.takami...@fujitsu.com
> > > >  wrote:
> > > > >
> > > > > On Monday, May 17, 2021 6:45 PM Amit Kapila
> >  wrote:
> > > > > >
> > > > > > We allow taking locks on system catalogs, so why prohibit
> > > > > > user_catalog_tables? However, I agree that if we want plugins to
> > > > > > acquire the lock on user_catalog_tables then we should either
> > > > > > prohibit decoding of such relations or do something else to avoid
> > deadlock hazards.
> > > > > OK.
> > > > >
> > > > > Although we have not concluded the range of logical decoding of
> > > > > user_catalog_table (like we should exclude TRUNCATE command only
> > > > > or all operations on that type of table), I'm worried that
> > > > > disallowing the logical decoding of user_catalog_table produces
> > > > > the deadlock still. It's because disabling it by itself does not 
> > > > > affect the
> > lock taken by TRUNCATE command. What I have in mind is an example
> > below.
> > > > >
> > > > > (1) plugin (e.g. pgoutput) is designed to take a lock on
> > user_catalog_table.
> > > > > (2) logical replication is set up in synchronous mode.
> > > > > (3) TRUNCATE command takes an access exclusive lock on the
> > user_catalog_table.
> > > > > (4) This time, we don't do anything for the TRUNCATE decoding.
> > > > > (5) the plugin tries to take a lock on the truncated table
> > > > > but, it can't due to the lock by TRUNCATE command.
> > > > >
> > > >
> > > > If you skip decoding of truncate then we won't invoke plugin API so
> > > > step 5 will be skipped.
> > > >
> > >
> > > I think you were right here even if skip step-4, the plugin might take
> > > a lock on user_catalog_table for something else.
> Yes, we can't know the exact place where the user wants to use the feature
> of user_catalog_table. Even if we imagine that the user skips
> the truncate decoding (I imagined continuing and skipping a case in
> REORDER_BUFFER_CHANGE_TRUNCATE of pgoutput),
> it's possible that the user accesses it somewhere else for different purpose 
> with lock.
>
>
> > I am not sure but I
> > > think we should prohibit truncate on user_catalog_tables as we
> > > prohibit truncate on system catalog tables (see below [1]) if we want
> > > plugin to lock them, otherwise, as you said it might lead to deadlock.
> > > For the matter, I think we should once check all other operations
> > > where we can take an exclusive lock on [user]_catalog_table, say
> > > Cluster command, and compare the behavior of same on system catalog
> > > tables.
> > >
> > > [1]
> > > postgres=# truncate pg_class;
> > > ERROR:  permission denied: "pg_class" is a system catalog postgres=#
> > > cluster pg_class;
> > > ERROR:  there is no previously clustered index for table "pg_class"
> > >
> >
> > Please ignore the cluster command as we need to use 'using index' with that
> > command to make it successful. I just want to show the truncate command
> > behavior for which you have asked the question.
> Thank you so much for clarifying the direction.
> I agree with the changing the TRUNCATE side.
> I'll make a patch based on this.
>

Isn't it a better idea to start a new thread where you can summarize
whatever we have discussed here about user_catalog_tables? We might
get the opinion from others about the behavior change you are
proposing.

-- 
With Regards,
Amit Kapila.

Re: Subscription tests fail under CLOBBER_CACHE_ALWAYS

On Wed, May 19, 2021 at 9:54 AM Michael Paquier  wrote:
>
> On Tue, May 18, 2021 at 11:46:25PM -0400, Tom Lane wrote:
> > I was wondering if we could move the ExecCleanupTupleRouting call
> > into finish_estate.  copyfrom.c, for example, does that during
> > its shutdown function.  Compare also the worker.c changes proposed
> > in
>
> Yeah, the first patch I wrote for this thread was pushing out
> PopActiveSnapshot() into the finish() routine, but I really found the
> creation of the ModifyTableState stuff needed for a partitioned table
> done in create_estate_for_relation() to make the code more confusing,
> as that's only a piece needed for the tuple routing path.
>

How about moving AfterTriggerEndQuery() to apply_handle_*_internal
calls? That way, we might not even need to change Push/Pop calls.

-- 
With Regards,
Amit Kapila.

Re: subscriptioncheck failure

On Tue, May 18, 2021 at 11:25 AM vignesh C  wrote:
>
> Thanks for the comments, the attached patch has the changes for the same.
>

Thanks, I have pushed your patch after making minor changes in the comments.

-- 
With Regards,
Amit Kapila.

Re: Subscription tests fail under CLOBBER_CACHE_ALWAYS

On Wed, May 19, 2021 at 10:26:28AM +0530, Amit Kapila wrote:
> How about moving AfterTriggerEndQuery() to apply_handle_*_internal
> calls? That way, we might not even need to change Push/Pop calls.

Isn't that going to be a problem when a tuple is moved to a new
partition in the tuple routing?  This does a DELETE followed by an
INSERT, but the operation is an UPDATE.
--
Michael


signature.asc
Description: PGP signature

Re: pg_get_wal_replay_pause_state() should not return 'paused' while a promotion is ongoing.

On Wed, May 19, 2021 at 01:46:45PM +0900, Fujii Masao wrote:
> You're thinking to add the test like the following?
> #1. Pause the recovery
> #2. Confirm that pg_get_wal_replay_pause_state() returns 'paused'
> #3. Trigger standby promotion
> #4. Confirm that pg_get_wal_replay_pause_state() returns 'not paused'
> 
> It seems not easy to do the test #4 stably because
> pg_get_wal_replay_pause_state() needs to be executed
> before the promotion finishes.

Couldn't you rely on recovery_end_command for number #4?  The shared
memory state tracked by SharedRecoveryState is updated after the
end-recovery command is triggered, so pg_get_wal_replay_pause_state()
can be executed at this point.  A bit hairy, I agree, but that would
work :)

Still, it would be easy enough to have something for
pg_get_wal_replay_pause_state() called on a standby when there is no 
pause (your case #2) and a second case on a standby with a pause
triggered, though (not listed above).
--
Michael


signature.asc
Description: PGP signature

Re: Subscription tests fail under CLOBBER_CACHE_ALWAYS

On Wed, May 19, 2021 at 10:35 AM Michael Paquier  wrote:
>
> On Wed, May 19, 2021 at 10:26:28AM +0530, Amit Kapila wrote:
> > How about moving AfterTriggerEndQuery() to apply_handle_*_internal
> > calls? That way, we might not even need to change Push/Pop calls.
>
> Isn't that going to be a problem when a tuple is moved to a new
> partition in the tuple routing?
>

Right, it won't work.

-- 
With Regards,
Amit Kapila.

Re: postgres_fdw - should we tighten up batch_size, fetch_size options against non-numeric values?

On Wed, May 19, 2021 at 8:28 AM Fujii Masao  wrote:
> >>> I agree with throwing an error for non-numeric junk though.
> >>> Allowing that on the grounds of backwards compatibility
> >>> seems like too much of a stretch.
> >>
> >> +1.
> >
> > +1.
>
> +1

Thanks all for your inputs. PSA which uses parse_int for
batch_size/fech_size and parse_real for fdw_startup_cost and
fdw_tuple_cost.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

v2-0001-Tighten-up-batch_size-fetch_size-options-against-.patch
Description: Binary data

Re: pg_get_wal_replay_pause_state() should not return 'paused' while a promotion is ongoing.

2021-05-18 Thread Dilip Kumar

On Wed, May 19, 2021 at 10:16 AM Fujii Masao
 wrote:
>
> On 2021/05/18 15:46, Michael Paquier wrote:
> > On Tue, May 18, 2021 at 12:48:38PM +0900, Fujii Masao wrote:
> >> Currently a promotion causes all available WAL to be replayed before
> >> a standby becomes a primary whether it was in paused state or not.
> >> OTOH, something like immediate promotion (i.e., standby becomes
> >> a primary without replaying outstanding WAL) might be useful for
> >> some cases. I don't object to that.
> >
> > Sounds like a "promotion immediate" mode.  It does not sound difficult
> > nor expensive to add a small test for that in one of the existing
> > recovery tests triggerring a promotion.  Could you add one based on
> > pg_get_wal_replay_pause_state()?
>
> You're thinking to add the test like the following?
> #1. Pause the recovery
> #2. Confirm that pg_get_wal_replay_pause_state() returns 'paused'
> #3. Trigger standby promotion
> #4. Confirm that pg_get_wal_replay_pause_state() returns 'not paused'
>
> It seems not easy to do the test #4 stably because
> pg_get_wal_replay_pause_state() needs to be executed
> before the promotion finishes.

Even for #2, we can not ensure that whether it will be 'paused' or
'pause requested'.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: pg_get_wal_replay_pause_state() should not return 'paused' while a promotion is ongoing.