Re: EXEC_BACKEND vs bgworkers without BGWORKER_SHMEM_ACCESS

2021-08-02 Thread Andres Freund
Hi,

On 2021-08-01 23:51:16 -0700, Andres Freund wrote:
> In practice this will lead pretty quickly to a segfault, because process exit
> will run proc_exit callbacks, which in turn will try to do a ProcKill(). Or
> logging dereferences MyProc, or ...
> 
> It seems the above code block would need to at least do shmem_exit() before
> the PGSharedMemoryDetach()?

> I tested both the crash and that a shmem_exit() fixes it with an ugly hack in
> regress.c. I don't really know how to write a good testcase for this, given
> that the communication facilities of a unconnected bgworker are quite
> limited... I guess the bgworker could create files or something :(.

Hm. There may be another way: Because BackgroundWorkerEntry() does not
actually need a lock, we could transport BackgroundWorkerData via
backend_save_variables(). To make the connected case work, we'd of course
continue to call CreateSharedMemoryAndSemaphores (and thus InitProcess()) for
those.

But that doesn't really seem better? Sure, it's nice to not need a pgproc
entry unnecessarily, but it's not like unconnected bgworkers are commonly
used and it does seem like it'd end up with more complicated code?

Greetings,

Andres Freund




Re: [HACKERS] logical decoding of two-phase transactions

2021-08-02 Thread Peter Smith
Please find attached the latest patch v101

Differences:

* Rebased to HEAD @ today.

* Addresses all v100 review comments from Vignesh [1], Greg [2], Tang
[3], and Amit [2].


[1] 
https://www.postgresql.org/message-id/CALDaNm2N3qgSv3XyHW%2Bop_SJcLmz1s%3D0jJc-taxUmeEBXW5EPw%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/CAJcOf-eGCg8s%2BtT_Mo5xKksAhA%3D%3D1QAH_Sj7SqBotHQhwapdEw%40mail.gmail.com
[3] 
https://www.postgresql.org/message-id/OS0PR01MB6113B6F3C88C3C11A2F62CFFFBEC9%40OS0PR01MB6113.jpnprd01.prod.outlook.com
[4] 
https://www.postgresql.org/message-id/CAA4eK1%2BVcNDUYSZVm3xNg4YLzaMqcZHqxznfbAYvJWoVzvLqFQ%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia


v101-0001-Add-prepare-API-support-for-streaming-transacti.patch
Description: Binary data


Re: [HACKERS] logical decoding of two-phase transactions

2021-08-02 Thread Peter Smith
On Sat, Jul 31, 2021 at 9:36 PM Amit Kapila  wrote:
>
> On Fri, Jul 30, 2021 at 9:32 AM Peter Smith  wrote:
> >
> > Please find attached the latest patch set v100*
> >
>
> Few minor comments:
> 1.
> CREATE SUBSCRIPTION regress_testsub CONNECTION
> 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect =
> false, two_phase = true);
>
>  \dRs+
> +
>  --fail - alter of two_phase option not supported.
>  ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
>
> Spurious line addition.
>

OK. Fixed in v101.

> 2.
> +TransactionId
> +logicalrep_read_stream_prepare(StringInfo in,
> LogicalRepPreparedTxnData *prepare_data)
> +{
> + logicalrep_read_prepare_common(in, "stream prepare", prepare_data);
> +
> + return prepare_data->xid;
> +}
>
> There is no need to return TransactionId separately. The caller can
> use from prepare_data, if required.
>

OK. Modified in v101

> 3.
>  extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
>   TransactionId *subxid);
>
> +extern void logicalrep_write_stream_prepare(StringInfo out,
> ReorderBufferTXN *txn,
> + XLogRecPtr prepare_lsn);
> +extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
> + LogicalRepPreparedTxnData *prepare_data);
> +
> +
>
> Keep the order of declarations the same as its definitions in proto.c
> which means move these after logicalrep_read_rollback_prepared() and
> be careful about extra blank lines.
>

OK. Reordered in v101.

--
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: [HACKERS] logical decoding of two-phase transactions

2021-08-02 Thread Peter Smith
On Fri, Jul 30, 2021 at 6:25 PM tanghy.f...@fujitsu.com
 wrote:
>
> On Friday, July 30, 2021 12:02 PM Peter Smith wrote:
> >
> > Please find attached the latest patch set v100*
> >
> > v99-0002 --> v100-0001
> >
>
> Thanks for your patch. A few comments on the test file:
>
> 1. src/test/subscription/t/022_twophase_cascade.pl
>
> 1.1
> I saw your test cases for "PREPARE / COMMIT PREPARED" and "PREPARE with a 
> nested ROLLBACK TO SAVEPOINT", but didn't see cases for "PREPARE / ROLLBACK 
> PREPARED". Is it needless or just missing?
>

Yes, that test used to exist but it was removed in response to a
previous review (see [1] comment #10,  Amit said there were too many
tests).

> 1.2
> +# check inserts are visible at subscriber(s).
> +# All the streamed data (prior to the SAVEPOINT) should be rolled back.
> +# (3, 'foobar') should be committed.
>
> I think it should be (, 'foobar') here.
>

Good catch. Fixed in v101.

> 1.3
> +$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab 
> where b = 'foobar';");
> +is($result, qq(1), 'Rows committed are present on subscriber B');
> +$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
> +
>
> It seems the test is not finished yet. We didn't check the value of 'result'. 
> Besides, maybe we should also check node_C, right?
>

Oops. Thanks for finding this! Fixed in v101 by adding the missing tests.

> 1.4
> +$node_B->append_conf('postgresql.conf',qq(max_prepared_transactions 
> = 10));
> +$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 
> 64kB));
>
> You see, the first line uses a TAB but the second line uses a space.
> Also, we could use only one statement to append these two settings to run 
> tests a bit faster. Thoughts?
> Something like:
>
> $node_B->append_conf(
> 'postgresql.conf', qq(
> max_prepared_transactions = 10
> logical_decoding_work_mem = 64kB
> ));
>

OK. In v101 I changed the config as you suggested for both the 022 and
023 TAP tests.

--
[1] 
https://www.postgresql.org/message-id/CAHut%2BPts_bWx_RrXu%2BYwbiJva33nTROoQQP5H4pVrF%2BNcCMkRA%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia.




Re: Failed transaction statistics to measure the logical replication progress

2021-08-02 Thread Masahiko Sawada
On Mon, Aug 2, 2021 at 2:52 PM osumi.takami...@fujitsu.com
 wrote:
>
> On Thursday, July 29, 2021 10:50 AM Masahiko Sawada  
> wrote:
> > On Thu, Jul 8, 2021 at 3:55 PM osumi.takami...@fujitsu.com
> >  wrote:
> > > When the current HEAD fails during logical decoding, the failure
> > > increments txns count in pg_stat_replication_slots - [1] and adds
> > > the transaction size to the sum of bytes in the same repeatedly on
> > > the publisher, until the problem is solved.
> > > One of the good examples is duplication error on the subscriber side
> > > and this applies to both streaming and spill cases as well.
> > >
> > > This update prevents users from grasping the exact number and size
> > > of successful and unsuccessful transactions. Accordingly, we need to
> > > have new columns of failed transactions that will work to
> > > differentiate both of them for all types, which means spill,
> > > streaming and normal transactions. This will help users to measure
> > > the exact status of logical replication.
> >
> > Could you please elaborate on use cases of the proposed statistics?
> > For example, the current statistics on pg_replication_slots can be
> > used for tuning logical_decoding_work_mem as well as inferring the
> > total amount of bytes passed to the output plugin. How will the user use 
> > those statistics?
> >
> > Also, if we want the stats of successful transactions why don't we
> > show the stats of successful transactions in the view instead of ones
> > of failed transactions?
> It works to show the ratio of successful and unsuccessful transactions,
> which should be helpful in terms of administrator perspective.
> FYI, the POC patch added the columns where I prefixed 'failed' to those names.
> But, substantially, it meant the ratio when user compared normal columns and
> newly introduced columns by this POC in the pg_stat_replication_slots.

What can the administrator use the ratio of successful and
unsuccessful logical replication transactions for? For example, IIUC
if a conflict happens on the subscriber as you mentioned, the
successful transaction ratio calculated by those statistics is getting
low, perhaps meaning the logical replication stopped. But it can be
checked also by checking pg_stat_replication view or
pg_replication_slots view (or error counts of
pg_stat_subscription_errors view I’m proposing[1]). Do you have other
use cases?

> > Moreover, if we want to add more statistics on the view in the future,
> > it further increases the usage of shared memory. If we want to track
> > the stats of successful transactions, I think it's easier to track
> > them on the subscriber side rather than the publisher side. We can
> > increase counters when applying [stream]commit/abort logical changes on the 
> > subscriber.
> It's true that to track the stats of successful and unsuccessful transactions 
> on the *sub*
> is easier than on the pub. After some survey, it turned out that I cannot 
> distinguish
> the protocol messages between the cases of any failure (e.g. duplication 
> error on the sub)
> from user intentional and successful operations(e.g. DROP SUBSCRIPTION and 
> ALTER SUBSCRIPTION DISABLE) on the pub.
>
> If we truly want to achieve this change on the publisher side,
> protocol change requires in order to make above cases distinguishable,
> now I feel that it is better to do this in the subscriber side.
>
> Accordingly, I'm thinking to have unsuccessful and successful stats on the 
> sub side.
> Sawada-san is now implementing a new view in [1].
> Do you think that I should write a patch to introduce a new separate view
> or write a patch to add more columns to the new view 
> "pg_stat_subscription_errors" that is added at [1] ?

pg_stat_subscriptions_errors view I'm proposing is a view showing the
details of error happening during logical replication. So I think a
separate view or pg_stat_subscription view would be a more appropriate
place.

Regards,

[1] 
https://www.postgresql.org/message-id/flat/CAD21AoDeScrsHhLyEPYqN3sydg6PxAPVBboK=30xjfuvihn...@mail.gmail.com

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: [HACKERS] logical decoding of two-phase transactions

2021-08-02 Thread Peter Smith
On Fri, Jul 30, 2021 at 3:18 PM vignesh C  wrote:
>
> On Fri, Jul 30, 2021 at 9:32 AM Peter Smith  wrote:
> >
> > Please find attached the latest patch set v100*
> >
> > v99-0002 --> v100-0001
> >
> > Differences:
> >
> > * Rebased to HEAD @ today (needed because some recent commits [1][2] broke 
> > v99)
> >
>
> The patch applies neatly, tests passes and documentation looks good.
> A Few minor comments.
> 1) This blank line is not required:
> +-- two_phase and streaming are compatible.
> +CREATE SUBSCRIPTION regress_testsub CONNECTION
> 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect =
> false, streaming = true, two_phase = true);
> +
>

Fixed in v101.

> 2) Few points have punctuation mark and few don't have, we can make it
> consistent:
> +###
> +# Test 2PC PREPARE / ROLLBACK PREPARED.
> +# 1. Table is deleted back to 2 rows which are replicated on subscriber.
> +# 2. Data is streamed using 2PC
> +# 3. Do rollback prepared.
> +#
> +# Expect data rolls back leaving only the original 2 rows.
> +###
>

Fixed in v101.

> 3) similarly here too:
> +###
> +# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
> +# 1. Table is deleted back to 2 rows which are replicated on subscriber.
> +# 2. Data is streamed using 2PC.
> +# 3. A single row INSERT is done which is after the PREPARE
> +# 4. Then do a ROLLBACK PREPARED.
> +#
> +# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
> +# (the original 2 + inserted 1)
> +###
>

Fixed in v101.

--
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: [HACKERS] logical decoding of two-phase transactions

2021-08-02 Thread Peter Smith
On Fri, Jul 30, 2021 at 4:33 PM Greg Nancarrow  wrote:
>
> On Fri, Jul 30, 2021 at 2:02 PM Peter Smith  wrote:
> >
> > Please find attached the latest patch set v100*
> >
> > v99-0002 --> v100-0001
> >
>
> A few minor comments:
>
> (1) doc/src/sgml/protocol.sgml
>
> In the following description, is the word "large" really needed? Also
> "the message ... for a ... message" sounds a bit odd, as does
> "two-phase prepare".
>
> What about the following:
>
> BEFORE:
> +Identifies the message as a two-phase prepare for a
> large in-progress transaction message.
> AFTER:
> +Identifies the message as a prepare for an
> in-progress two-phase transaction.
>

Updated in v101.

The other nearby messages are referring refer to a “streamed
transaction” so I’ve changed this to say “Identifies the message as a
two-phase prepare for a streamed transaction message.” (e.g. compare
this text with the existing similar text for ‘P’).

BTW, I agree with you that "the message ... for a ... message" seems
odd; it was written in this way only to be consistent with existing
documentation, which all uses the same odd phrasing.

> (2) src/backend/replication/logical/worker.c
>
> Similar format comment, but one uses a full-stop and the other
> doesn't, looks a bit odd, since the lines are near each other.
>
> * 1. Replay all the spooled operations - Similar code as for
>
> * 2. Mark the transaction as prepared. - Similar code as for
>

Updated in v101 to make the comments consistent.

> (3) src/test/subscription/t/023_twophase_stream.pl
>
> Shouldn't the following comment mention, for example, "with streaming"
> or something to that effect?
>
> # logical replication of 2PC test
>

Fixed as suggested in v101.

--
Kind Regards,
Peter Smith.
Fujitsu Australia.




Re: Record a Bitmapset of non-pruned partitions

2021-08-02 Thread David Rowley
On Mon, 2 Aug 2021 at 00:31, David Rowley  wrote:
> I had another self review of these and I'm pretty happy with them. I'm
> quite glad to see the performance of querying a single partition of a
> table with large numbers of partitions no longer tails off as much as
> it used to.

I did some profiling and benchmarking on master and with the v4 patch.
With a hash partitioned table containing 8192 partitions I see the
following when running a query that selects a value from a single
partition:

  19.39%  postgres[.] apply_scanjoin_target_to_paths
   5.35%  postgres[.] base_yyparse
   4.71%  postgres[.] AllocSetAlloc
   2.86%  libc-2.33.so[.] __memset_avx2_unaligned_erms
   2.17%  postgres[.] SearchCatCacheInternal

With the patched version, I see:

   5.89%  postgres[.] AllocSetAlloc
   3.97%  postgres[.] base_yyparse
   3.87%  libc-2.33.so[.] __memset_avx2_unaligned_erms
   2.44%  postgres[.] SearchCatCacheInternal
   1.29%  postgres[.] hash_search_with_hash_value

I'm getting:
master: 16613 tps
patched: 22078 tps

So there's about 32% performance improvement with this number of
partitions. These results are not the same as my original email here
as I've only recently discovered that I really need to pin pgbench and
the postgres backend to the same CPU core to get good and stable
performance from a single threaded pgbench job.

FWIW, the next thing there on the profile the following line in
expand_partitioned_rtentry()

relinfo->part_rels = (RelOptInfo **) palloc0(relinfo->nparts *
sizeof(RelOptInfo *));

If anyone has any objections to both the v4 0001 and 0002 patch, can
they let me know soon. I'm currently seeing no reason that they can't
go in.

David




Re: logical replication empty transactions

2021-08-02 Thread Amit Kapila
On Fri, Jul 23, 2021 at 3:39 PM Ajin Cherian  wrote:
>

Let's first split the patch for prepared and non-prepared cases as
that will help to focus on each of them separately. BTW, why haven't
you considered implementing point 1b as explained by Andres in his
email [1]? I think we can send a keepalive message in case of
synchronous replication when we skip an empty transaction, otherwise,
it might delay in responding to transactions synchronous_commit mode.
I think in the tests done in the thread, it might not have been shown
because we are already sending keepalives too frequently. But what if
someone disables wal_sender_timeout or kept it to a very large value?
See WalSndKeepaliveIfNecessary. The other thing you might want to look
at is if the reason for frequent keepalives is the same as described
in the email [2].

Few other miscellaneous comments:
1.
static void
 pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
- XLogRecPtr commit_lsn)
+ XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+ TimestampTz prepare_time)
 {
+ PGOutputTxnData*txndata = (PGOutputTxnData *) txn->output_plugin_private;
+
  OutputPluginUpdateProgress(ctx);

+ /*
+ * If the BEGIN PREPARE was not yet sent, then it means there were no
+ * relevant changes encountered, so we can skip the COMMIT PREPARED
+ * message too.
+ */
+ if (txndata)
+ {
+ bool skip = !txndata->sent_begin_txn;
+ pfree(txndata);
+ txn->output_plugin_private = NULL;

How is this supposed to work after the restart when prepared is sent
before the restart and we are just sending commit_prepared after
restart? Won't this lead to sending commit_prepared even when the
corresponding prepare is not sent? Can we think of a better way to
deal with this?

2.
@@ -222,8 +224,10 @@ logicalrep_write_commit_prepared(StringInfo out,
ReorderBufferTXN *txn,
  pq_sendbyte(out, flags);

  /* send fields */
+ pq_sendint64(out, prepare_end_lsn);
  pq_sendint64(out, commit_lsn);
  pq_sendint64(out, txn->end_lsn);
+ pq_sendint64(out, prepare_time);

Doesn't this means a change of protocol and how is it suppose to work
when say publisher is 15 and subscriber from 14 which I think works
without such a change?


[1] - 
https://www.postgresql.org/message-id/20200309183018.tzkzwu635sd366ej%40alap3.anarazel.de
[2] - 
https://www.postgresql.org/message-id/CALtH27cip5uQNJb4uHjLXtx1R52ELqXVfcP9fhHr%3DAvFo1dtqw%40mail.gmail.com

-- 
With Regards,
Amit Kapila.




Re: [HACKERS] logical decoding of two-phase transactions

2021-08-02 Thread Amit Kapila
On Sun, Aug 1, 2021 at 3:51 PM Peter Smith  wrote:
>
> On Sun, Aug 1, 2021 at 3:05 AM vignesh C  wrote:
> >
> > On Sat, Jul 31, 2021 at 11:12 AM Ajin Cherian  wrote:
> > >
> > > On Sat, Jul 31, 2021 at 2:39 PM Amit Kapila  
> > > wrote:
> > >
> > > > Here, the test is expecting 2 prepared transactions corresponding to
> > > > two subscriptions but it waits for just one subscription via
> > > > appname_copy. It should wait for the second subscription using
> > > > $appname as well.
> > > >
> > > > What do you think?
> > >
> > > I agree with this analysis. The test needs to wait for both
> > > subscriptions to catch up.
> > > Attached is a patch that addresses this issue.
> >
> > The changes look good to me.
> >
>
> The patch to the test code posted by Ajin LGTM also.
>

Pushed.

-- 
With Regards,
Amit Kapila.




[PATCH]Comment improvement in publication.sql

2021-08-02 Thread tanghy.f...@fujitsu.com
Hi Hackers

When review and test another patch at [1], I found some comments in existing 
test code of " src/test/regress/sql/publication.sql " is a little bit confused.
Attached a patch to fix them, please take a check.

Here is the detail:

Existing code:
CREATE TABLE testpub_tbl2 (id serial primary key, data text);
-- fail - can't add to for all tables publication
ALTER PUBLICATION testpub_foralltables ADD TABLE testpub_tbl2;
-- fail - can't drop from all tables publication
ALTER PUBLICATION testpub_foralltables DROP TABLE testpub_tbl2;
-- fail - can't add to for all tables publication
ALTER PUBLICATION testpub_foralltables SET TABLE pub_test.testpub_nopk;

After patch:
CREATE TABLE testpub_tbl2 (id serial primary key, data text);
-- fail - tables can't be added to or dropped form FOR ALL TABLES publications
ALTER PUBLICATION testpub_foralltables ADD TABLE testpub_tbl2;
ALTER PUBLICATION testpub_foralltables DROP TABLE testpub_tbl2;
ALTER PUBLICATION testpub_foralltables SET TABLE pub_test.testpub_nopk;

You see the comment for SET TABLE is not appropriate.
And above three operations(ADD DROP SET) output the same message as below:
"DETAIL:  Tables cannot be added to or dropped from FOR ALL TABLES 
publications."

So maybe we can combine the existing three comments to one, thoughts?

Besides, another comment in the same file is not clear enough to me:
-- fail - already added
CREATE PUBLICATION testpub_fortbl FOR TABLE testpub_tbl1;

Maybe it will be better if we use 'already exists'. Thoughts?

[1] 
https://www.postgresql.org/message-id/OS0PR01MB6113CC160D0F134448567FDDFBE99%40OS0PR01MB6113.jpnprd01.prod.outlook.com

Regards
Tang


improvement_on_comment.patch
Description: improvement_on_comment.patch


Re: Speeding up GIST index creation for tsvectors

2021-08-02 Thread John Naylor
On Sun, Aug 1, 2021 at 11:41 PM Amit Khandekar 
wrote:
>
> > FWIW, I anticipate some push back from the community because of the
fact that the optimization relies on statistical phenomena.
>
> I dug into this issue for tsvector type. Found out that it's the way
> in which the sign array elements are arranged that is causing the
pointers to
> be misaligned:
[...]
> If siglen is not a multiple of 8 (say 700), cache[j].sign will in some
> cases point to non-8-byte-aligned addresses, as you can see in the
> above code snippet.
>
> Replacing siglen by MAXALIGN64(siglen) in the above snippet gets rid
> of the misalignment. This change applied over the 0001-v3 patch gives
> additional ~15% benefit. MAXALIGN64(siglen) will cause a bit more
> space, but for not-so-small siglens, this looks worth doing. Haven't
> yet checked into types other than tsvector.

Sounds good.

> Will get back with your other review comments. I thought, meanwhile, I
> can post the above update first.

Thinking some more, my discomfort with inline functions that call a global
function doesn't make logical sense, so feel free to do it that way if you
like.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Use generation context to speed up tuplesorts

2021-08-02 Thread David Rowley
On Sat, 31 Jul 2021 at 14:34, Tomas Vondra
 wrote:
> I spent a bit of time hacking on the Generation context, adding the two
> improvements discussed in this thread:
>
> 1) internal handling of block sizes, similar to what AllocSet does (it
> pretty much just copies parts of it)
>
> 2) keeper block (we keep one empry block instead of freeing it)
>
> 3) I've also added allocChunkLimit, which makes it look a bit more like
> AllocSet (instead of using just blockSize/8, which does not work too
> well with dynamic blockSize)
>
> I haven't done any extensive tests on it, but it does pass check-world
> with asserts etc. I haven't touched the comments, those need updating.
> regards

Thanks for starting work on that. I've only had a quick look, but I
can have a more detailed look once you've got it more complete.

For now it does not really look like the keeper block stuff is wired
up the same way as in aset.c. I'd expect you to be allocating that in
the same malloc as you're using to allocate the context struct itself
in GenerationContextCreate().

Also, likely as a result of the above, minContextSize does not seem to
be wired up to anything apart from an Assert().

David




Re: Minimal logical decoding on standbys

2021-08-02 Thread Ronan Dunklau
Le mardi 27 juillet 2021, 09:23:48 CEST Drouvot, Bertrand a écrit :
> Thanks for the warning, rebase done and new v21 version attached.
> 
> Bertrand

Hello,

I've taken a look at this patch, and it looks like you adressed every prior 
remark, including the race condition Andres was worried about.

As for the basics: make check-world and make installcheck-world pass. 

I think the beahviour when dropping a database on the primary should be 
documented, and proper procedures for handling it correctly should be 
suggested.

Something along the lines of:

"If a database is dropped on the primary server, the logical replication slot 
on the standby will be dropped as well. This means that you should ensure that 
the client usually connected to this slot has had the opportunity to stream 
the latest changes before the database is dropped."

As for the patches themselves, I only have two small comments to make.

In patch 0002, in InvalidateConflictingLogicalReplicationSlots, I don't see the 
need to check for an InvalidOid since we already check the SlotIsLogical:

+   /* We are only dealing with *logical* slot conflicts. */
+   if (!SlotIsLogical(s))
+   continue;
+
+   /* not our database and we don't want all the database, 
skip */
+   if ((s->data.database != InvalidOid && s->data.database 
!= dboid) && TransactionIdIsValid(xid))
+   continue;

In patch 0004, small typo in the test file:
+##
+# Test standby promotion and logical decoding bheavior
+# after the standby gets promoted.
+##

Thank you for working on this !

Regards,

-- 
Ronan Dunklau






Update of partition key on foreign server

2021-08-02 Thread Илья Гладышев
Hi,

I am currently looking into a partition constraint violation that occurs on 
update of a partition key on a foreign server. To reproduce it you can run:

On server 1 using port 5432:

create extension postgres_fdw;
create table players (id integer, name text) partition by list(name);
create table players_0 partition of players for values in ('name1');
create server neighbor foreign data wrapper postgres_fdw options (host 
'localhost', port '5433', dbname 'postgres');
create foreign table players_1 partition of players for values in ('name2') 
server neighbor ;
create user mapping for current_user server neighbor;
On server 2 using port 5433:

create extension postgres_fdw;
create server neighbor foreign data wrapper postgres_fdw options (host 
'localhost', port '5432', dbname 'postgres');
create table players (id integer, name text) partition by list(name);
create table players_1 partition of players for values in ('name2');
create user mapping for current_user server neighbor;
create foreign table players_0 partition of players for values in ('name1') 
server neighbor;

insert into players values (1, 'name1');
update players set name='name2' where name='name1';
ERROR:  new row for relation "players_0" violates partition constraint
DETAIL:  Failing row contains (1, name2).
CONTEXT:  remote SQL command: UPDATE public.players_0 SET name = 'name2'::text 
WHERE ((name = 'name1'::text))

From what I have read on the mailing list, I understand that this is a known 
problem, but I haven't found any thread discussing it in particular. Is this 
something that needs fixing? If it is, I want to try to do it, but I’m 
wondering if there are any known caveats and looking for any tips on how to 
implement it. 

My understanding is that this should be implemented in a similar way to how the 
row is routed from a local partition in ExecCrossPartitionUpdate, so the update 
should be replaced with a foreign delete + local/foreign insert. In addition, a 
direct update should be forbidden when the query modifies the partition key. 
I’m probably missing a lot of details (feel free to point out), but is the 
general idea correct? I will be grateful for any feedback.

Thanks,
Ilya Gladyshev

Re: .ready and .done files considered harmful

2021-08-02 Thread Dipesh Pandit
Hi,

> I think what you are saying is true before v14, but not in v14 and master.
Yes, we can use archiver specific shared memory. Thanks.

> I don't think it's great that we're using up SIGINT for this purpose.
> There aren't that many signals available at the O/S level that we can
> use for our purposes, and we generally try to multiplex them at the
> application layer, e.g. by setting a latch or a flag in shared memory,
> rather than using a separate signal. Can we do something of that sort
> here? Or maybe we don't even need a signal. ThisTimeLineID is already
> visible in shared memory, so why not just have the archiver just check
> and see whether it's changed, say via a new accessor function
> GetCurrentTimeLineID()? I guess there could be a concern about the
> expensive of that, because we'd probably be taking a spinlock or an
> lwlock for every cycle, but I don't think it's probably that bad,
> because I doubt we can archive much more than a double-digit number of
> files per second even with a very fast archive_command, and contention
> on a lock generally requires a five digit number of acquisitions per
> second. It would be worth testing to see if we can see a problem here,
> but I'm fairly hopeful that it's not an issue. If we do feel that it's
> important to avoid repeatedly taking a lock, let's see if we can find
> a way to do it without dedicating a signal to this purpose.

We can maintain the current timeline ID in archiver specific shared memory.
If we switch to a new timeline then the backend process can update the new
timeline ID in shared memory. Archiver can keep a track of current timeline
ID
and if it finds that there is a timeline switch then it can perform a full
directory
scan to make sure that archiving history files takes precedence over WAL
files.
Access to the shared memory area can be protected by adding a
WALArchiverLock.
If we take this approach then it doesn't require to use a dedicated signal
to notify
a timeline switch.

> The problem with all this is that you can't understand either function
> in isolation. Unless you read them both together and look at all of
> the ways these three variables are manipulated, you can't really
> understand the logic. And there's really no reason why that needs to
> be true. The job of cleaning timeline_switch and setting dirScan could
> be done entirely within pgarch_readyXlog(), and so could the job of
> incrementing nextLogSegNo, because we're not going to again call
> pgarch_readyXlog() unless archiving succeeded.

> Also note that the TLI which is stored in curFileTLI corresponds to
> the segment number stored in nextLogSegNo, yet one of them has "cur"
> for "current" in the name and the other has "next". It would be easier
> to read the code if the names were chosen more consistently.

> My tentative idea as to how to clean this up is: declare a new struct
> with a name like readyXlogState and members lastTLI and lastSegNo.
> Have pgarch_ArchiverCopyLoop() declare a variable of this type, zero
> it, pass it as a parameter to pgarch_readyXlog(), and otherwise leave
> it alone. Then let pgarch_readyXlog() do all of the manipulation of
> the values stored therein.

Make sense, we can move the entire logic to a single function
pgarch_readyXlog()
and declare a new struct readyXLogState.

I think we cannot declare a variable of this type in
pgarch_ArchiverCopyLoop()
due to the fact that this function will be called every time the archiver
wakes up.
Initializing readyXLogState here will reset the next anticipated log
segment number
when the archiver wakes up from a wait state. We can declare and initialize
it in
pgarch_MainLoop() to avoid resetting the next anticipated log segment
number
when the archiver wakes up.

> You've moved this comment from its original location, but the trouble
> is that the comment is 100% false. In fact, the whole reason why you
> wrote this patch is *because* this comment is 100% false. In fact it
> is not difficult to create cases where each scan finds many files, and
> the purpose of the patch is precisely to optimize the code that the
> person who wrote this thought didn't need optimizing. Now it may take
> some work to figure out what we want to say here exactly, but
> preserving the comment as it's written here is certainly misleading.

Yes, I agree. We can update the comments here to list the scenarios
where we may need to perform a full directory scan.

I have incorporated these changes and updated a new patch. Please find
the attached patch v5.

Thanks,
Dipesh
From 3e0f690d1b8fd1d07fa503a807ee97cb9ed7377b Mon Sep 17 00:00:00 2001
From: Dipesh Pandit 
Date: Wed, 30 Jun 2021 14:05:58 +0530
Subject: [PATCH] mitigate directory scan for WAL archiver

WAL archiver scans the status directory to identify the next WAL file
that needs to be archived. This directory scan can be minimised by
maintaining the log segment number of current file which is being
archived and incrementing it by '1' to get the next WAL fi

Re: Background writer and checkpointer in crash recovery

2021-08-02 Thread Robert Haas
On Fri, Jul 30, 2021 at 4:00 PM Andres Freund  wrote:
> I don't agree with that? If (user+system) << wall then it is very likely
> that recovery is IO bound. If system is a large percentage of wall, then
> shared buffers is likely too small (or we're replacing the wrong
> buffers) because you spend a lot of time copying data in/out of the
> kernel page cache. If user is the majority, you're CPU bound.
>
> Without user & system time it's much harder to figure that out - at
> least for me.

Oh, that's an interesting point. At least now I'll know why I am
supposed to care about that log line the next time I see it. I guess
we could include both things, though the line might get a little long.
Or maybe there's some other subset that would make sense.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Background writer and checkpointer in crash recovery

2021-08-02 Thread Robert Haas
On Mon, Aug 2, 2021 at 1:37 AM Thomas Munro  wrote:
> I pushed 0001.

That's great. I just realized that this leaves us with identical
RequestCheckpoint() calls in two nearby places. Is there any reason
not to further simplify as in the attached?

-- 
Robert Haas
EDB: http://www.enterprisedb.com


0001-Further-simplify-a-bit-of-logic-in-StartupXLOG.patch
Description: Binary data


Re: Background writer and checkpointer in crash recovery

2021-08-02 Thread Amul Sul
On Mon, Aug 2, 2021 at 6:47 PM Robert Haas  wrote:
>
> On Mon, Aug 2, 2021 at 1:37 AM Thomas Munro  wrote:
> > I pushed 0001.
>
> That's great. I just realized that this leaves us with identical
> RequestCheckpoint() calls in two nearby places. Is there any reason
> not to further simplify as in the attached?
>
+1, also, can we just get rid off "promoted" flag? The only
inconvenience is we might need to check three flags instead of one to
perform the checkpoint at the end.




RE: Background writer and checkpointer in crash recovery

2021-08-02 Thread Jakub Wartak
> On Fri, Jul 30, 2021 at 4:00 PM Andres Freund  wrote:
> > I don't agree with that? If (user+system) << wall then it is very
> > likely that recovery is IO bound. If system is a large percentage of
> > wall, then shared buffers is likely too small (or we're replacing the
> > wrong
> > buffers) because you spend a lot of time copying data in/out of the
> > kernel page cache. If user is the majority, you're CPU bound.
> >
> > Without user & system time it's much harder to figure that out - at
> > least for me.
> 
> Oh, that's an interesting point. At least now I'll know why I am supposed to 
> care
> about that log line the next time I see it. I guess we could include both 
> things,
> though the line might get a little long.
> Or maybe there's some other subset that would make sense.

Hi Robert,

The email threads from [1] can serve as indication that having complete view of 
resource usage (user+system+elapsed) is advantageous in different situations 
and 
pretty platform-generic. Also as  Andres and Simon earlier pointed out - the 
performance
insight into crash recovery/replication performance is next to nothing, judging 
just by 
looking at currently emitted log messages. The more the there is, the better I 
think. 

BTW,  if you now there's this big push for refactoring StartupXLOG() then what 
frustrating^H^H^H^H^H could be done better - at least from end-user point of 
view -  
is that there is lack of near real time cyclic messages (every 1min?) about 
current status, 
performance and maybe even ETA (simplistic case; assuming it is linear). 

[1] - https://commitfest.postgresql.org/30/2799/ 




[BUG] wrong refresh when ALTER SUBSCRIPTION ADD/DROP PUBLICATION

2021-08-02 Thread houzj.f...@fujitsu.com

Hi hackers,

When testing some other logical replication related patches, I found two
unexpected behaviours about ALTER SUBSCRIPTION ADD/DROP PUBLICATION.

(1)
when I execute the following sqls[1], the data of tables that registered to
'pub' wasn't copied to the subscriber side which means tablesync worker didn't
start.

-sub originally had 2 pub nodes(pub,pub2)
ALTER SUBSCRIPTION sub drop PUBLICATION pub;
ALTER SUBSCRIPTION sub add PUBLICATION pub;
-

(2)
And when I execute the following sqls, the data of table registered to 'pub2'
are synced again.

-sub originally had 2 pub nodes(pub,pub2)
ALTER SUBSCRIPTION sub drop PUBLICATION pub;
ALTER SUBSCRIPTION sub REFRESH PUBLICATION;
-

After looking into this problem, I think the reason is the
[alter sub add/drop publication] misused the function
AlterSubscription_refresh().

For DROP cases:

Currently, in function AlterSubscription_refresh(), it will first fetch the
target tables from the target publication, and also fetch the tables in
subscriber side from pg_subscription_rel. Then it will check each table from
local pg_subscription_rel, if the table does not exists in the tables fetched
from the target publication then drop it.

The logic above only works for SET PUBLICATION. However, When DROP PUBLICATION,
the tables fetched from target publication is actually the tables that need to
be dropped. If reuse the above logic, it will drop the wrong table which result
in unexpected behavioud in (1) and (2).(ADD PUBLICATION have similar problem).

So, I think we'd better fix this problem. I tried add some additional check in
AlterSubscription_refresh() which can avoid the problem like the attached
patch. Not sure do we need to further refactor.

Best regards,
houzj



0001-fix-ALTER-SUB-ADD-DROP-PUBLICATION.patch
Description: 0001-fix-ALTER-SUB-ADD-DROP-PUBLICATION.patch


Re: Background writer and checkpointer in crash recovery

2021-08-02 Thread Robert Haas
On Mon, Aug 2, 2021 at 9:40 AM Amul Sul  wrote:
> > That's great. I just realized that this leaves us with identical
> > RequestCheckpoint() calls in two nearby places. Is there any reason
> > not to further simplify as in the attached?
> >
> +1, also, can we just get rid off "promoted" flag? The only
> inconvenience is we might need to check three flags instead of one to
> perform the checkpoint at the end.

I'm not sure that's really a win, because if we use the same
conditional in both places then it might not be clear to somebody that
they're supposed to match.

I do think we ought to consider renaming the flag, though, because
LocalPromoteIsTriggered is already tracking whether we were promoted,
and what this flag is tracking is not quite that thing. It's real
purpose is to track whether we should request a non-immediate
checkpoint at the end of recovery, and I think we ought to name it
some way that reflects that, e.g. perform_checkpoint_later.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Background writer and checkpointer in crash recovery

2021-08-02 Thread Robert Haas
On Mon, Aug 2, 2021 at 9:51 AM Jakub Wartak  wrote:
> BTW,  if you now there's this big push for refactoring StartupXLOG() then what
> frustrating^H^H^H^H^H could be done better - at least from end-user point of 
> view -
> is that there is lack of near real time cyclic messages (every 1min?) about 
> current status,
> performance and maybe even ETA (simplistic case; assuming it is linear).

I agree. See 
https://www.postgresql.org/message-id/ca+tgmoahqrgdfobwgy16xcomtxxsrvgfb2jncvb7-ubuee1...@mail.gmail.com
and subsequent discussion.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: EXEC_BACKEND vs bgworkers without BGWORKER_SHMEM_ACCESS

2021-08-02 Thread Robert Haas
On Mon, Aug 2, 2021 at 2:51 AM Andres Freund  wrote:
> which presents a problem: We've initialized all kind of references to shared
> memory, own a PGPROC, but have detached from shared memory.
>
> In practice this will lead pretty quickly to a segfault, because process exit
> will run proc_exit callbacks, which in turn will try to do a ProcKill(). Or
> logging dereferences MyProc, or ...
>
> It seems the above code block would need to at least do shmem_exit() before
> the PGSharedMemoryDetach()?
>
> This code has been introduced in
>
> commit 4d155d8b08fe08c1a1649fdbad61c6dcf4a8671f
> Author: Robert Haas 
> Date:   2014-05-07 14:54:43 -0400
>
> Detach shared memory from bgworkers without shmem access.
>
> Since the postmaster won't perform a crash-and-restart sequence
> for background workers which don't request shared memory access,
> we'd better make sure that they can't corrupt shared memory.
>
> Patch by me, review by Tom Lane.
>
> but before that things were just slightly differently broken...

If you're saying that this code has been 100% broken for 7 years and
nobody's noticed until now, then that suggests that nobody actually
uses non-shmem-connected bgworkers. I sort of hate to give up on that
concept but if we've really gone that many years without anyone
noticing obvious breakage then maybe we should.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: EXEC_BACKEND vs bgworkers without BGWORKER_SHMEM_ACCESS

2021-08-02 Thread Tom Lane
Robert Haas  writes:
> If you're saying that this code has been 100% broken for 7 years and
> nobody's noticed until now, then that suggests that nobody actually
> uses non-shmem-connected bgworkers. I sort of hate to give up on that
> concept but if we've really gone that many years without anyone
> noticing obvious breakage then maybe we should.

Well, the problem only exists on Windows so maybe this indeed
escaped notice.  Still, this is good evidence that the case isn't
used *much*, and TBH I don't see many applications for it.
I can't say I'm excited about putting effort into fixing it.

regards, tom lane




Re: 2021-07 CF now in progress

2021-08-02 Thread Ibrar Ahmed
On Mon, Jul 26, 2021 at 5:52 PM Ibrar Ahmed  wrote:

>
>
> On Mon, Jul 19, 2021 at 4:37 PM Ibrar Ahmed  wrote:
>
>>
>>
>> On Mon, Jul 12, 2021 at 4:59 PM Ibrar Ahmed 
>> wrote:
>>
>>>
>>> Hackers,
>>>
>>> The Commitfest 2021-07 is now in progress. It is one of the biggest one.
>>> Total number of patches of this commitfest is 342.
>>>
>>> Needs review: 204.
>>> Waiting on Author: 40.
>>> Ready for Committer: 18.
>>> Committed: 57.
>>> Moved to next CF: 3.
>>> Withdrawn: 15.
>>> Rejected: 3.
>>> Returned with Feedback: 2.
>>> Total: 342.
>>>
>>> If you are a patch author, please check http://commitfest.cputube.org to
>>> be sure your patch still applies, compiles, and passes tests.
>>>
>>> We need your involvement and participation in reviewing the patches.
>>> Let's try and make this happen.
>>>
>>> --
>>> Regards.
>>> Ibrar Ahmed
>>>
>>
>>
>> Over the past one week, statuses of 47 patches have been changed from
>> "Needs review". This still leaves us with 157 patches
>> requiring reviews. As always, your continuous support is appreciated to
>> get us over the line.
>>
>> Please look at the patches requiring review in the current commitfest.
>> Test, provide feedback where needed, and update the patch status.
>>
>> Total: 342.
>>
>> Needs review: 157.
>> Waiting on Author: 74.
>> Ready for Committer: 15.
>> Committed: 68.
>> Moved to next CF: 3.
>> Withdrawn: 19.
>> Rejected: 4.
>> Returned with Feedback: 2.
>>
>>
>> --
>> Ibrar Ahmed
>>
>
> Over the past one week, some progress was made, however, there are still
> 155 patches in total that require reviews. Time to continue pushing for
> maximising patch reviews and getting stuff committed in PostgreSQL.
>
> Total: 342.
> Needs review: 155.
> Waiting on Author: 67.
> Ready for Committer: 20.
> Committed: 70.
> Moved to next CF: 3.
> Withdrawn: 20.
> Rejected: 5.
> Returned with Feedback: 2.
>
> Thank you for your continued effort andI  support.
>
> --
> Ibrar Ahmed
>

Here is the current state of the commitfest. It looks like it should be
closed now. I don't have permission to do that.

Needs review: 148.
Waiting on Author: 61.
Ready for Committer: 22.
Committed: 79.
Moved to next CF: 4.
Returned with Feedback: 2.
Rejected: 6.
Withdrawn: 20.

Thanks to everyone who worked on the commitfest.


-- 
Ibrar Ahmed


Re: Corrected documentation of data type for the logical replication message formats.

2021-08-02 Thread Tom Lane
Peter Smith  writes:
> I agree. The specified value looks better when it comes first, as you did it.

Actually, it looks to me like we don't have to resolve the question of
which should come first, because I don't see any cases where it's
useful to have both.  I don't agree with appending "uint8" to those
field descriptions, because it's adding no information, especially
when the high bit couldn't be set anyway.

At some point it might be useful to add UInt to the set of base
data types, and then go through all the message types and decide
which fields we think are unsigned.  But that is not this patch,
and there would be questions about whether it constituted a protocol
break.

I noticed also that having to add "(Oid)" was sort of self-inflicted
damage, because the field descriptions were using the very vague
term "ID", when they could have said "OID" and been clear.  I left
the "(Oid)" additions in place but also changed the text.

Pushed with those changes.  I couldn't resist copy-editing the section
intro, too.

regards, tom lane




Re: Minimal logical decoding on standbys

2021-08-02 Thread Ronan Dunklau
Le lundi 2 août 2021, 17:31:46 CEST Drouvot, Bertrand a écrit :
> > I think the beahviour when dropping a database on the primary should be
> > documented, and proper procedures for handling it correctly should be
> > suggested.
> > 
> > Something along the lines of:
> > 
> > "If a database is dropped on the primary server, the logical replication
> > slot on the standby will be dropped as well. This means that you should
> > ensure that the client usually connected to this slot has had the
> > opportunity to stream the latest changes before the database is dropped."
> 
> I am not sure we should highlight this as part of this patch.
> 
> I mean, the same would currently occur on a non standby if you drop a
> database that has a replication slot linked to it.

The way I see it, the main difference is that you drop an additional object on 
the standby, which doesn't exist and that you don't necessarily have any 
knowledge of on the primary. As such, I thought it would be better to be 
explicit about it to warn users of that possible case.


Regards,

--
Ronan Dunklau












Re: EXEC_BACKEND vs bgworkers without BGWORKER_SHMEM_ACCESS

2021-08-02 Thread Andres Freund
Hi,

On 2021-08-02 11:00:49 -0400, Tom Lane wrote:
> Robert Haas  writes:
> > If you're saying that this code has been 100% broken for 7 years and
> > nobody's noticed until now, then that suggests that nobody actually
> > uses non-shmem-connected bgworkers. I sort of hate to give up on that
> > concept but if we've really gone that many years without anyone
> > noticing obvious breakage then maybe we should.
> 
> Well, the problem only exists on Windows so maybe this indeed
> escaped notice.

Right. I did briefly look around and I didn't find bgworkers without
shmem attachement...


> Still, this is good evidence that the case isn't used *much*, and TBH
> I don't see many applications for it.  I can't say I'm excited about
> putting effort into fixing it.

Yea, I don't think it adds that much - without e.g. sharing a file
descriptor with the unconnected bgworker one can't implement something
like syslogger.

Greetings,

Andres Freund




Re: Corrected documentation of data type for the logical replication message formats.

2021-08-02 Thread vignesh C
On Mon, Aug 2, 2021 at 9:10 PM Tom Lane  wrote:
>
> Peter Smith  writes:
> > I agree. The specified value looks better when it comes first, as you did 
> > it.
>
> Actually, it looks to me like we don't have to resolve the question of
> which should come first, because I don't see any cases where it's
> useful to have both.  I don't agree with appending "uint8" to those
> field descriptions, because it's adding no information, especially
> when the high bit couldn't be set anyway.
>
> At some point it might be useful to add UInt to the set of base
> data types, and then go through all the message types and decide
> which fields we think are unsigned.  But that is not this patch,
> and there would be questions about whether it constituted a protocol
> break.
>
> I noticed also that having to add "(Oid)" was sort of self-inflicted
> damage, because the field descriptions were using the very vague
> term "ID", when they could have said "OID" and been clear.  I left
> the "(Oid)" additions in place but also changed the text.
>
> Pushed with those changes.  I couldn't resist copy-editing the section
> intro, too.

Thanks for pushing the patch.

Regards,
Vignesh




Re: [PATCH]Comment improvement in publication.sql

2021-08-02 Thread vignesh C
On Mon, Aug 2, 2021 at 3:31 PM tanghy.f...@fujitsu.com
 wrote:
>
> Hi Hackers
>
> When review and test another patch at [1], I found some comments in existing 
> test code of " src/test/regress/sql/publication.sql " is a little bit 
> confused.
> Attached a patch to fix them, please take a check.
>
> Here is the detail:
>
> Existing code:
> CREATE TABLE testpub_tbl2 (id serial primary key, data text);
> -- fail - can't add to for all tables publication
> ALTER PUBLICATION testpub_foralltables ADD TABLE testpub_tbl2;
> -- fail - can't drop from all tables publication
> ALTER PUBLICATION testpub_foralltables DROP TABLE testpub_tbl2;
> -- fail - can't add to for all tables publication
> ALTER PUBLICATION testpub_foralltables SET TABLE pub_test.testpub_nopk;
>
> After patch:
> CREATE TABLE testpub_tbl2 (id serial primary key, data text);
> -- fail - tables can't be added to or dropped form FOR ALL TABLES publications
> ALTER PUBLICATION testpub_foralltables ADD TABLE testpub_tbl2;
> ALTER PUBLICATION testpub_foralltables DROP TABLE testpub_tbl2;
> ALTER PUBLICATION testpub_foralltables SET TABLE pub_test.testpub_nopk;
>
> You see the comment for SET TABLE is not appropriate.
> And above three operations(ADD DROP SET) output the same message as below:
> "DETAIL:  Tables cannot be added to or dropped from FOR ALL TABLES 
> publications."
>
> So maybe we can combine the existing three comments to one, thoughts?
>
> Besides, another comment in the same file is not clear enough to me:
> -- fail - already added
> CREATE PUBLICATION testpub_fortbl FOR TABLE testpub_tbl1;
>
> Maybe it will be better if we use 'already exists'. Thoughts?
>
> [1] 
> https://www.postgresql.org/message-id/OS0PR01MB6113CC160D0F134448567FDDFBE99%40OS0PR01MB6113.jpnprd01.prod.outlook.com

Few minor suggestions:
1) Should we change below to "fail - tables can't be added, dropped or
set to "FOR ALL TABLES" publications"
--- fail - can't add to for all tables publication
+-- fail - tables can't be added to or dropped from FOR ALL TABLES publications

2) Should we change this to "--fail - already existing publication"
--- fail - already added
+-- fail - already exists

Regards,
Vignesh




Re: Minimal logical decoding on standbys

2021-08-02 Thread Andres Freund
Hi,

On 2021-08-02 16:45:23 +0200, Drouvot, Bertrand wrote:
> On 7/27/21 7:22 PM, Andres Freund wrote:
> > On 2021-07-27 09:23:48 +0200, Drouvot, Bertrand wrote:
> > > Thanks for the warning, rebase done and new v21 version attached.
> > Did you have a go at fixing the walsender race conditions I
> > (re-)discovered? Without fixing those I don't see this patch going in...
>
> Those new patches should be addressing all your previous code and TAP tests
> remarks, except those 2 for which I would need your input:
>
> 1. The first one is linked to your remarks:
> "
>
> > While working on this I found a, somewhat substantial, issue:
> >
> > When the primary is idle, on the standby logical decoding via walsender
> > will typically not process the records until further WAL writes come in
> > from the primary, or until a 10s lapsed.
> >
> > The problem is that WalSndWaitForWal() waits for the *replay* LSN to
> > increase, but gets woken up by walreceiver when new WAL has been
> > flushed. Which means that typically walsenders will get woken up at the
> > same time that the startup process will be - which means that by the
> > time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
> > that the startup process already replayed the record and updated
> > XLogCtl->lastReplayedEndRecPtr.
> >
> > I think fixing this would require too invasive changes at this point. I
> > think we might be able to live with 10s delay issue for one release, but
> > it sure is ugly :(.
>
> This is indeed pretty painful. It's a lot more regularly occuring if you
> either have a slot disk, or you switch around the order of
> WakeupRecovery() and WalSndWakeup() XLogWalRcvFlush().
>
> "
>
> Is that what you are referring to as the “walsender race conditions”?

Yes.


> If so, do you already have in mind a way to handle this? (I thought you
> already had in mind a way to handle it so the question)

Yes. I think we need to add a condition variable to be able to wait for
WAL positions to change. Either multiple condition variables (one for
the flush position, one for the replay position), or one that just
changes more often. That way one can wait for apply without a race
condition.


> 2. The second one is linked to your remark:
>
> "There's also no test 
for a recovery conflict due to row removal"
>
> Don't you think that the 
"vacuum full" conflict test is enough?

It's not. It'll cause conflicts due to exclusive locks etc.


> if not, what kind of additional 
tests would you like to see?

A few catalog rows being removed (e.g. due to DELETE and then VACUUM
*without* full) and a standby without hot_standby_feedback catching
that.

Greetings,

Andres Freund




straightening out backend process startup

2021-08-02 Thread Andres Freund
Hi,

I've previously complained ([1]) that process initialization has gotten
very complicated. I hit this once more last week when trying to commit
one of the shared memory stats pieces...

I think there's quite a few different issues around this - here I'm just
trying to tackle a few of the most glaring (to me):

- AuxiliaryProcessMain() is used for two independent tasks: Start bootstrap /
  checker mode and starting auxiliary processes. In HEAD there's maybe 5 lines
  out 250 that are actually common to both uses.

  A related oddity is that we reserve shared memory resources for bootstrap &
  checker aux processes, despite those never existing.

  This is addressed in patches 1-7

- The order of invocation of InitProcess()/InitAuxiliaryProcess() and
  BaseInit() depends on EXEC_BACKEND. Due to that there often is no single
  place initialization code can be put if it needs any locks.

  This is addressed in patches 8-9

- PostgresMain() has code for single user and multi user interleaved, making
  it unnecessarily hard to understand what's going on.

  This is addressed in patches 10


This isn't a patch series ready to commit, there's a bunch of polishing that
needs to be done if there's agreement.


Questions I have:

- What exactly to do with checker mode: Keep it as part of bootstrap, separate
  it out completely? What commandline flags?

- I used a separate argv entry to pass the aux proc type - do we rather want
  to go for the approach that e.g. bgworker went for? Adding code for string
  splitting seems a bit unnecessary to me.

- PostgresMainSingle() should probably not be in postgres.c. We could put it
  into postinit.c or ..?

- My first attempt at PostgresMainSingle() separated the single/multi user
  cases a bit more than the code right now, by having a PostgresMainCommon()
  which was called by PostgresMainSingle(), PostgresMain(). *Common only
  started with the MessageContext allocation, which did have the advantage of
  splitting out a few of the remaining conditional actions in PostgresMain()
  (PostmasterContext, welcome banner, Log_disconnections). But lead to a bit
  more duplication. I don't really have an opinion on what's better.

- I had to move the PgStartTime computation to a bit earlier for single user
  mode. That seems to make sense to me anyway, given that postmaster does so
  fairly early too.

  Any reason that'd be a bad idea?

  Arguably it should even be a tad earlier to be symmetric.


There's one further issue that I think is big enough to be worth
tackling in the near future: Too many things depend on BackendIds.

Aux processes need procsignal and backend status reporting, which use
BackendId for indexing. But they don't use sinval, so they don't have a
BackendId - so we have hacks to work around that in a few places. If we
instead make those places use pgprocno for indexing the whole issue
vanishes.

In fact, I think there's a good argument to be made that we should
entirely remove the concept of BackendIds and just use pgprocnos. We
have a fair number of places like SignalVirtualTransaction() that need
to search the procarray just to find the proc to signal based on the
BackendId. If we used pgprocno instead, that'd not be needed.

But perhaps that's a separate thread.

Greetings,

Andres Freund

[1] https://postgr.es/m/20210402002240.56cuz3uo3alnqwae%40alap3.anarazel.de
>From 9ad9ca5aa6684ab15000913fd101e329b1e2a74d Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Fri, 30 Jul 2021 18:53:52 -0700
Subject: [PATCH v1 01/10] process startup: Rename postmaster's --forkboot to
 --forkaux.

It is confusing that aux processes are started with --forkboot, given that
bootstrap mode cannot run below postmaster.

Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/postmaster/postmaster.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 122c2b05bdb..df4b115c1e6 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -4930,7 +4930,7 @@ SubPostmasterMain(int argc, char *argv[])
 	if (strcmp(argv[1], "--forkbackend") == 0 ||
 		strcmp(argv[1], "--forkavlauncher") == 0 ||
 		strcmp(argv[1], "--forkavworker") == 0 ||
-		strcmp(argv[1], "--forkboot") == 0 ||
+		strcmp(argv[1], "--forkaux") == 0 ||
 		strncmp(argv[1], "--forkbgworker=", 15) == 0)
 		PGSharedMemoryReAttach();
 	else
@@ -5018,7 +5018,7 @@ SubPostmasterMain(int argc, char *argv[])
 		/* And run the backend */
 		BackendRun(&port);		/* does not return */
 	}
-	if (strcmp(argv[1], "--forkboot") == 0)
+	if (strcmp(argv[1], "--forkaux") == 0)
 	{
 		/* Restore basic shared memory pointers */
 		InitShmemAccess(UsedShmemSegAddr);
@@ -5427,7 +5427,7 @@ StartChildProcess(AuxProcType type)
 	av[ac++] = "postgres";
 
 #ifdef EXEC_BACKEND
-	av[ac++] = "--forkboot";
+	av[ac++] = "--forkaux";
 	av[ac++] = NULL;			/* filled in by postmaster_forkexec */
 #end

Passing data out of indexam or tableam

2021-08-02 Thread Chris Cleveland
Hi,

I'm attempting to develop a new type of full-text index. I've started
building a new index access method.

Is it possible to calculate values during the index scan and then make
them available for display in a select clause? For example,

SELECT title, score, total_hits
FROM mytable
WHERE (some search expression here)
ORDER BY score DESC
LIMIT 10

In this case, the index access method would calculate a score for each
row and attach it to the row for display. It would also calculate the
total number of hits, that is, rows that would be returned if there
were no LIMIT clause, and attach it only to the first row.

Is this possible?

Is it possible to do it with a table access method, as opposed to an
index access method?

-- 
Chris Cleveland
312-339-2677 mobile




Re: straightening out backend process startup

2021-08-02 Thread Tom Lane
Andres Freund  writes:
> I think there's quite a few different issues around this - here I'm just
> trying to tackle a few of the most glaring (to me):

No opinion yet about most of this, but I did want to react to

> In fact, I think there's a good argument to be made that we should
> entirely remove the concept of BackendIds and just use pgprocnos. We
> have a fair number of places like SignalVirtualTransaction() that need
> to search the procarray just to find the proc to signal based on the
> BackendId. If we used pgprocno instead, that'd not be needed.

If I understand what you're suggesting, it'd result in unused slots
in sinvaladt.c's procState[] array, which could create an annoying
drag on performance.  However, I think it might be reasonable to
switch other things over to using pgprocnos, with an eye to
eventually making BackendIds private to the sinval mechanism.
There's certainly no strong reason why sinval's array indexes
need to be used as identifiers by other modules.

regards, tom lane




Re: straightening out backend process startup

2021-08-02 Thread Andres Freund
Hi,

On 2021-08-02 12:57:36 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > I think there's quite a few different issues around this - here I'm just
> > trying to tackle a few of the most glaring (to me):
> 
> No opinion yet about most of this, but I did want to react to
> 
> > In fact, I think there's a good argument to be made that we should
> > entirely remove the concept of BackendIds and just use pgprocnos. We
> > have a fair number of places like SignalVirtualTransaction() that need
> > to search the procarray just to find the proc to signal based on the
> > BackendId. If we used pgprocno instead, that'd not be needed.
> 
> If I understand what you're suggesting, it'd result in unused slots
> in sinvaladt.c's procState[] array, which could create an annoying
> drag on performance.

Yea, I was looking into that, which is why I don't yet have a patch. I think
it might be reasonable to address this by making pgprocnos more dense. We
right now actually have a kind of maximally bad allocation pattern for
pgprocnos - InitProcGlobal puts them onto the free lists in reverse
order. Which means that ProcArrayAdd() will have to move all procs...


But, as you say:

> However, I think it might be reasonable to switch other things over to
> using pgprocnos, with an eye to eventually making BackendIds private
> to the sinval mechanism.  There's certainly no strong reason why
> sinval's array indexes need to be used as identifiers by other
> modules.

I think it'd entirely be reasonable to switch over at least backend_status.c,
procsignal.c to pgprocnos without doing anything about the density of
allocation. We'd likely would want to do that as independent steps anyway,
even if we were to switch over sinval as well.


Another approach to deal with this could be to simply not do the O(n) work in
SIInsertDataEntries(). It's not obvious that ->hasMessages is actually
necessary - we could atomically read maxMsgNum without acquiring a lock
instead of needing the per-backend ->hasMessages.  I don't the density would
be a relevant factor in SICleanupQueue().

Greetings,

Andres Freund




Re: EXEC_BACKEND vs bgworkers without BGWORKER_SHMEM_ACCESS

2021-08-02 Thread Alvaro Herrera
On 2021-Aug-02, Tom Lane wrote:

> Robert Haas  writes:
> > If you're saying that this code has been 100% broken for 7 years and
> > nobody's noticed until now, then that suggests that nobody actually
> > uses non-shmem-connected bgworkers. I sort of hate to give up on that
> > concept but if we've really gone that many years without anyone
> > noticing obvious breakage then maybe we should.
> 
> Well, the problem only exists on Windows so maybe this indeed
> escaped notice.  Still, this is good evidence that the case isn't
> used *much*, and TBH I don't see many applications for it.
> I can't say I'm excited about putting effort into fixing it.

When I included this case I was thinking in tasks which would just run
stuff not directly connected to data.  Something like a sub-daemon: say
a connection pooler, which is a bgworker just so that it starts and
stops together with postmaster, and share facilities like GUC
configuration and SIGHUP handling, etc.  It doesn't look like anybody
has had an interest in developing such a thing, so if this is
obstructing your work, I don't object to removing the no-shmem case.

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/




Re: EXEC_BACKEND vs bgworkers without BGWORKER_SHMEM_ACCESS

2021-08-02 Thread Andres Freund
Hi,

On Mon, Aug 2, 2021, at 11:08, Alvaro Herrera wrote:
> On 2021-Aug-02, Tom Lane wrote:
> 
> > Robert Haas  writes:
> > > If you're saying that this code has been 100% broken for 7 years and
> > > nobody's noticed until now, then that suggests that nobody actually
> > > uses non-shmem-connected bgworkers. I sort of hate to give up on that
> > > concept but if we've really gone that many years without anyone
> > > noticing obvious breakage then maybe we should.
> > 
> > Well, the problem only exists on Windows so maybe this indeed
> > escaped notice.  Still, this is good evidence that the case isn't
> > used *much*, and TBH I don't see many applications for it.
> > I can't say I'm excited about putting effort into fixing it.
> 
> When I included this case I was thinking in tasks which would just run
> stuff not directly connected to data.  Something like a sub-daemon: say
> a connection pooler, which is a bgworker just so that it starts and
> stops together with postmaster, and share facilities like GUC
> configuration and SIGHUP handling, etc.

I think nearly all such cases are going to want some monitoring from within the 
database - which then needs shared memory.

And even if not - it's not really that useful to avoid a crash-restart if your 
worker dies with a segfault. There's no really legitimate reasons for 
completely unhandled errors even if not connected to shmem.


> It doesn't look like anybody
> has had an interest in developing such a thing, so if this is
> obstructing your work, I don't object to removing the no-shmem case.

It's not obstructing me right now. I noticed it'd crash on EXEC_BACKEND when I 
tried to understand some code (see the nearby thread about straightening out 
process startup). 

I do think there's some potential gains in simplicity and robustness that are 
made mildly harder by a subprocess that first attaches and detaches from shm 
(it's the only case where we can't easily unify the place InitProcess() is 
called between EB and ! EB right now). There's several ways that could be 
tackled. Removing the need to have that if obviously one of them.

Regards,

Andres




Re: EXEC_BACKEND vs bgworkers without BGWORKER_SHMEM_ACCESS

2021-08-02 Thread Alvaro Herrera
On 2021-Aug-02, Andres Freund wrote:

> > When I included this case I was thinking in tasks which would just run
> > stuff not directly connected to data.  Something like a sub-daemon: say
> > a connection pooler, which is a bgworker just so that it starts and
> > stops together with postmaster, and share facilities like GUC
> > configuration and SIGHUP handling, etc.
> 
> I think nearly all such cases are going to want some monitoring from
> within the database - which then needs shared memory.

True.  Observability for such things is critical (pgbouncer goes quite
some trouble to offer SQL-queryable views into its metrics), which kills
the argument.

> I do think there's some potential gains in simplicity and robustness
> that are made mildly harder by a subprocess that first attaches and
> detaches from shm (it's the only case where we can't easily unify the
> place InitProcess() is called between EB and ! EB right now). There's
> several ways that could be tackled. Removing the need to have that if
> obviously one of them.

Hmm, I don't remember that an shmem-unconnected bgworker first connected
to it and then let go.  It seems weird to do it that way.  My intention,
as far as I recall, is that they would just never connect to shmem,
period.

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"I think my standards have lowered enough that now I think 'good design'
is when the page doesn't irritate the living f*ck out of me." (JWZ)




Re: EXEC_BACKEND vs bgworkers without BGWORKER_SHMEM_ACCESS

2021-08-02 Thread Andres Freund
Hi,

On Mon, Aug 2, 2021, at 12:12, Alvaro Herrera wrote:
> On 2021-Aug-02, Andres Freund wrote:
> > I do think there's some potential gains in simplicity and robustness
> > that are made mildly harder by a subprocess that first attaches and
> > detaches from shm (it's the only case where we can't easily unify the
> > place InitProcess() is called between EB and ! EB right now). There's
> > several ways that could be tackled. Removing the need to have that if
> > obviously one of them.
> 
> Hmm, I don't remember that an shmem-unconnected bgworker first connected
> to it and then let go.  It seems weird to do it that way.  My intention,
> as far as I recall, is that they would just never connect to shmem,
> period.

They currently do for EXEC_BACKEND. See SubPostmasterMain(). There the 
definition of the worker is passed via shared memory. So it does the full 
reattach thing, which requires lwlock, which requires PGPROC. We could get away 
without that by passing more through the variables file (either the worker 
definition or the address of the bgworker shmem piece).

Greetings,

Andres




Re: EXEC_BACKEND vs bgworkers without BGWORKER_SHMEM_ACCESS

2021-08-02 Thread Alvaro Herrera
On 2021-Aug-02, Andres Freund wrote:

> On Mon, Aug 2, 2021, at 12:12, Alvaro Herrera wrote:

> > Hmm, I don't remember that an shmem-unconnected bgworker first connected
> > to it and then let go.  It seems weird to do it that way.  My intention,
> > as far as I recall, is that they would just never connect to shmem,
> > period.
> 
> They currently do for EXEC_BACKEND. See SubPostmasterMain(). There the
> definition of the worker is passed via shared memory. So it does the
> full reattach thing, which requires lwlock, which requires PGPROC. We
> could get away without that by passing more through the variables file
> (either the worker definition or the address of the bgworker shmem
> piece).

Ah, that makes sense.  That doesn't sound super fragile, but it is odd
and it's probably a good argument for removing the feature, particularly
since nobody seems to be using it.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Ninguna manada de bestias tiene una voz tan horrible como la humana" (Orual)




Re: Make relfile tombstone files conditional on WAL level

2021-08-02 Thread Robert Haas
On Thu, Jun 10, 2021 at 6:47 AM Heikki Linnakangas  wrote:
> It would indeed be nice to have some other mechanism to prevent the
> issue with wal_level=minimal, the tombstone files feel hacky and
> complicated. Maybe a new shared memory hash table to track the
> relfilenodes of dropped tables.

Just to summarize the issue here as I understand it, if a relfilenode
is used for two unrelated relations during the same checkpoint cycle
with wal_level=minimal, and if the WAL-skipping optimization is
applied to the second of those but not to the first, then crash
recovery will lose our only copy of the new relation's data, because
we'll replay the removal of the old relfilenode but will not have
logged the new data. Furthermore, we've wondered about writing an
end-of-recovery record in all cases rather than sometimes writing an
end-of-recovery record and sometimes a checkpoint record. That would
allow another version of this same problem, since a single checkpoint
cycle could now span multiple server lifetimes. At present, we dodge
all this by keeping the first segment of the main fork around as a
zero-length file for the rest of the checkpoint cycle, which I think
prevents the problem in both cases. Now, apparently that caused some
problem with the AIO patch set so Thomas is curious about getting rid
of it, and Heikki concurs that it's a hack.

I guess my concern about this patch is that it just seems to be
reducing the number of cases where that hack is used without actually
getting rid of it. Rarely-taken code paths are more likely to have
undiscovered bugs, and that seems particularly likely in this case,
because this is a low-probability scenario to begin with. A lot of
clusters probably never have an OID counter wraparound ever, and even
in those that do, getting an OID collision with just the right timing
followed by a crash before a checkpoint can intervene has got to be
super-unlikely. Even as things are today, if this mechanism has subtle
bugs, it seems entirely possible that they could have escaped notice
up until now.

So I spent some time thinking about the question of getting rid of
tombstone files altogether. I don't think that Heikki's idea of a
shared memory hash table to track dropped relfilenodes can work. The
hash table will have to be of some fixed size N, and whatever the
value of N, the approach will break down if N+1 relfilenodes are
dropped in the same checkpoint cycle.

The two most principled solutions to this problem that I can see are
(1) remove wal_level=minimal and (2) use 64-bit relfilenodes. I have
been reluctant to support #1 because it's hard for me to believe that
there aren't cases where being able to skip a whole lot of WAL-logging
doesn't work out to a nice performance win, but I realize opinions on
that topic vary. And I'm pretty sure that Andres, at least, will hate
#2 because he's unhappy with the width of buffer tags already. So I
don't really have a good idea. I agree this tombstone system is a bit
of a wart, but I'm not sure that this patch really makes anything any
better, and I'm not really seeing another idea that seems better
either.

Maybe I am missing something...

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: make MaxBackends available in _PG_init

2021-08-02 Thread Robert Haas
On Mon, Aug 2, 2021 at 2:18 PM Bossart, Nathan  wrote:
> I just encountered the same thing, so I am bumping this thread.  I was
> trying to use MaxBackends in a call to RequestAddinShmemSpace() in a
> _PG_init() function for a module, but since MaxBackends is not yet
> initialized, you essentially need to open-code InitializeMaxBackends()
> instead.
>
> I think the comments about needing to register background workers
> before initializing MaxBackends have been incorrect since the addition
> of max_worker_processes in v9.4 (6bc8ef0b).  Furthermore, I think the
> suggested reordering is a good idea because it is not obvious that
> MaxBackends will be uninitialized in _PG_init(), and use-cases like
> the RequestAddinShmemSpace() one are not guaranteed to fail when
> MaxBackends is used incorrectly (presumably due to the 100 KB buffer
> added in CreateSharedMemoryAndSemaphores()).
>
> I've attached a new version of the proposed patch with some slight
> adjustments and an attempt at a commit message.

I think this is a good idea. Anyone object?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: make MaxBackends available in _PG_init

2021-08-02 Thread Alvaro Herrera
On 2021-Aug-02, Robert Haas wrote:

> On Mon, Aug 2, 2021 at 2:18 PM Bossart, Nathan  wrote:
> >
> > I think the comments about needing to register background workers
> > before initializing MaxBackends have been incorrect since the addition
> > of max_worker_processes in v9.4 (6bc8ef0b).

> I think this is a good idea. Anyone object?

No objection here.  AFAICS Nathan is correct.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Los dioses no protegen a los insensatos.  Éstos reciben protección de
otros insensatos mejor dotados" (Luis Wu, Mundo Anillo)




Re: make MaxBackends available in _PG_init

2021-08-02 Thread Andres Freund
Hi,

On 2021-08-02 16:06:15 -0400, Robert Haas wrote:
> On Mon, Aug 2, 2021 at 2:18 PM Bossart, Nathan  wrote:
> > I just encountered the same thing, so I am bumping this thread.  I was
> > trying to use MaxBackends in a call to RequestAddinShmemSpace() in a
> > _PG_init() function for a module, but since MaxBackends is not yet
> > initialized, you essentially need to open-code InitializeMaxBackends()
> > instead.
> >
> > I think the comments about needing to register background workers
> > before initializing MaxBackends have been incorrect since the addition
> > of max_worker_processes in v9.4 (6bc8ef0b).  Furthermore, I think the
> > suggested reordering is a good idea because it is not obvious that
> > MaxBackends will be uninitialized in _PG_init(), and use-cases like
> > the RequestAddinShmemSpace() one are not guaranteed to fail when
> > MaxBackends is used incorrectly (presumably due to the 100 KB buffer
> > added in CreateSharedMemoryAndSemaphores()).
> >
> > I've attached a new version of the proposed patch with some slight
> > adjustments and an attempt at a commit message.
> 
> I think this is a good idea. Anyone object?

I'm not so sure it's a good idea. I've seen several shared_preload_library
using extensions that adjust some GUCs (e.g. max_prepared_transactions)
because they need some more resources internally - that's perhaps not a great
idea, but there's also not an obviously better way.

ISTM this would better be solved with making the hook or config logic for
libraries a bit more elaborate (e.g. having a hook you can register for that's
called once all libraries are loaded).

Greetings,

Andres Freund




Release 13 of the PostgreSQL BuildFarm client

2021-08-02 Thread Andrew Dunstan


I have pushed Release 13 of the PostgreSQL BuildFarm client.


Change highlights:

  * add tests for a few cases that were previously missing
  * more fine-grained control over which TAP test sets run
  * --no-keepall can be specified on the command line
  * repair a repo if it crashed during a copy operation
  * make sure the repo is really clean (including free of ignored files)
  * generate stack traces on CYGWIN
  * collect stack traces for TAP tests
  * Extract MSVC settings at runtime rather than had coding them in the
config file (see below)
  * cross version upgrade tests now run on Windows, both for msys2 and
MSVC builds
  * add support for inhibit-runs and force-one-run trigger files( see below)
  * add experimental module for running arbitrary out of tree TAP tests
  * Adjust if an upstream repo changes the default branch name (see below)
  * add use_discard_caches caches setting (see below)


MSVC animals are now very much simpler to set up, and to upgrade to a
new compiler. Using the new mechanism, as shown in the sample config
file, all that's needed is to specify a location where the standard
script vcvarsall.bat can be found. The script will then run that script
and extract the settings and apply them. Tha means that upgrading to a
new version of Visual Studio would entail just a one line change in the
config file.

If you put a file called .inhibit-runs in the build root,
all runs will be stopped until the file is removed. If you put a file
called .force-one-run in the build root, each configured
branch will forced to run once, and the file will be removed. These only
apply if you use the run_branches.pl script.

The client should transparently deal with any change that is made in the
upstream repository's default branch name. This avoids the need for a
flag day when we eventually change the default branch name for
postgresql, as I assume we will do before long. The branch bf_HEAD which
the client creates now refers to the upstream default whatever it might
be, rather than the hardcoded name 'master'. The code of the SCM module
underwent quite a lot of change in order to make this happen; the
checkout code had become quite long and convoluted and I had to refactor
it somewhat before I was able to make and test this change. The changes
have been fairly extensively tested, but I'm still slightly nervous
about them. Owners are asked to report any issues promptly.

the `use_discard_caches` setting reflects a change in the way postgres
handles this - it's now a runtime setting rather than a compile time
setting. On older branches it sets "-DCLOBBER_CACHE_ALWAYS". If you use
this setting don't use that define.


Downloads are available at
 and



cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Slow standby snapshot

2021-08-02 Thread Michail Nikolaev
Hello.

> I have tried such an approach but looks like it is not effective,
> probably because of CPU caching issues.

It was a mistake by me. I have repeated the approach and got good
results with small and a non-invasive patch.

The main idea is simple optimistic optimization - store offset to next
valid entry. So, in most cases, we could just skip all the gaps.
Of course, it adds some additional impact for workloads without long
(few seconds) transactions but it is almost not detectable (because of
CPU caches).

* TEST

The next testing setup was used:

max_connections=5000 (taken from real RDS installation)
pgbench -i -s 10 -U postgres -d postgres

# emulate typical OLTP load
pgbench -b simple-update -j 1 -c 50 -n -P 1 -T 18000 -U postgres postgres

#emulate typical cache load on replica
pgbench -b select-only -p 6543 -j 1 -c 50 -n -P 1 -T 18000 -U postgres postgres

# emulate some typical long transactions up to 60 seconds on primary
echo "\set t random(0, 60)
BEGIN;
select txid_current();
select pg_sleep(:t);
COMMIT;" > slow_transactions.bench
pgbench -f /home/nkey/pg/slow_transactions.bench -p 5432 -j 1 -c 10 -n
-P 1 -T 18000 -U postgres postgres

* RESULTS

*REL_13_STABLE* - 23.02% vs 0.76%

non-patched:
  23.02%  postgres   [.] KnownAssignedXidsGetAndSetXmin
   2.56%  postgres[.] base_yyparse
   2.15%  postgres[.] AllocSetAlloc
   1.68%  postgres[.] MemoryContextAllocZeroAligned
   1.51%  postgres[.] hash_search_with_hash_value
   1.26%  postgres[.] SearchCatCacheInternal
   1.03%  postgres[.] hash_bytes
   0.92%  postgres[.] pg_checksum_block
   0.89%  postgres[.] expression_tree_walker
   0.81%  postgres[.] core_yylex
   0.69%  postgres[.] palloc
   0.68%  [kernel]  [k] do_syscall_64
   0.59%  postgres[.] _bt_compare
   0.54%  postgres[.] new_list

patched:
   3.09%  postgres[.] base_yyparse
   3.00%  postgres[.] AllocSetAlloc
   2.01%  postgres[.] MemoryContextAllocZeroAligned
   1.89%  postgres[.] SearchCatCacheInternal
   1.80%  postgres[.] hash_search_with_hash_value
   1.27%  postgres[.] expression_tree_walker
   1.27%  postgres[.] pg_checksum_block
   1.18%  postgres[.] hash_bytes
   1.10%  postgres[.] core_yylex
   0.96%  [kernel]  [k] do_syscall_64
   0.86%  postgres[.] palloc
   0.84%  postgres[.] _bt_compare
   0.79%  postgres[.] new_list
   0.76%  postgres[.] KnownAssignedXidsGetAndSetXmin

*MASTER* - 6.16% vs ~0%
(includes snapshot scalability optimization by Andres Freund (1))

non-patched:
   6.16%  postgres[.] KnownAssignedXidsGetAndSetXmin
   3.05%  postgres[.] AllocSetAlloc
   2.59%  postgres[.] base_yyparse
   1.95%  postgres[.] hash_search_with_hash_value
   1.87%  postgres[.] MemoryContextAllocZeroAligned
   1.85%  postgres[.] SearchCatCacheInternal
   1.27%  postgres[.] hash_bytes
   1.16%  postgres[.] expression_tree_walker
   1.06%  postgres[.] core_yylex
   0.94%  [kernel]  [k] do_syscall_64

patched:
   3.35%  postgres[.] base_yyparse
   2.84%  postgres[.] AllocSetAlloc
   1.89%  postgres[.] hash_search_with_hash_value
   1.82%  postgres[.] MemoryContextAllocZeroAligned
   1.79%  postgres[.] SearchCatCacheInternal
   1.49%  postgres[.] pg_checksum_block
   1.26%  postgres[.] hash_bytes
   1.26%  postgres[.] expression_tree_walker
   1.08%  postgres[.] core_yylex
   1.04%  [kernel]  [k] do_syscall_64
   0.81%  postgres[.] palloc

Looks like it is possible to get a significant TPS increase on a very
typical standby workload.
Currently, I have no environment to measure TPS accurately. Could you
please try it on yours?

I have attached two versions of the patch - for master and REL_13_STABLE.
Also, I am going to add a patch to commitfest (2).

Thanks,
MIchail.

(1): https://commitfest.postgresql.org/29/2500/
(2): https://commitfest.postgresql.org/34/3271/
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index c7816fcfb3..27486c096b 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -267,6 +267,7 @@ static PGPROC *allProcs;
  */
 static TransactionId *KnownAssignedXids;
 static bool *KnownAssignedXidsValid;
+static pg_atomic_uint32 *KnownAssignedXidsNext;
 static TransactionId latestObservedXid = InvalidTransactionId;
 
 /*
@@ -405,6 +406,7 @@ void
 CreateSharedProcArray(void)
 {
 	bool		found;
+	int			i;
 
 	/* Create or attach to the ProcArray shared structure */
 	procArray = (ProcArrayStruct *)
@@ -446,6

Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace

2021-08-02 Thread Gilles Darold
Le 02/08/2021 à 01:21, Tom Lane a écrit :
> Gilles Darold  writes:
>> [ v5-0001-regexp-foo-functions.patch ]
> I've gone through this whole patch now, and found quite a lot that I did
> not like.  In no particular order:
>
> * Wrapping parentheses around the user's regexp doesn't work.  It can
> turn an invalid regexp into a valid one: for example 'a)(b' should draw
> a syntax error.  With this patch, no error would be thrown, but the
> "outer" parens wouldn't do what you expected.  Worse, it can turn a
> valid regexp into an invalid one: the metasyntax options described in
> 9.7.3.4 only work at the start of the regexp.  So we have to handle
> whole-regexp cases honestly rather than trying to turn them into an
> instance of the parenthesized-subexpression case.
>
> * You did a lot of things quite inefficiently, apparently to avoid
> touching any existing code.  I think it's better to extend
> setup_regexp_matches() and replace_text_regexp() a little bit so that
> they can support the behaviors these new functions need.  In both of
> them, it's absolutely trivial to allow a search start position to be
> passed in; and it doesn't take much to teach replace_text_regexp()
> to replace only the N'th match.
>
> * Speaking of N'th, there is not much of anything that I like
> about Oracle's terminology for the function arguments, and I don't
> think we ought to adopt it.  If we're documenting the functions as
> processing the "N'th match", it seems to me to be natural to call
> the parameter "N" not "occurrence".  Speaking of the "occurrence'th
> occurrence" is just silly, not to mention long and easy to misspell.
> Likewise, "position" is a horribly vague term for the search start
> position; it could be interpreted to mean several other things.
> "start" seems much better.  "return_opt" is likewise awfully unclear.
> I went with "endoption" below, though I could be talked into something
> else.  The only one of Oracle's choices that I like is "subexpr" for
> subexpression number ... but you went with DB2's rather vague "group"
> instead.  I don't want to use their "capture group" terminology,
> because that appears nowhere else in our documentation.  Our existing
> terminology is "parenthesized subexpression", which seems fine to me
> (and also agrees with Oracle's docs).
>
> * I spent a lot of time on the docs too.  A lot of the syntax specs
> were wrong (where you put the brackets matters), many of the examples
> seemed confusingly overcomplicated, and the text explanations needed
> copy-editing.
>
> * Also, the regression tests seemed misguided.  This patch is not
> responsible for testing the regexp engine as such; we have tests
> elsewhere that do that.  So I don't think we need complex regexps
> here.  We just need to verify that the parameters of these functions
> act properly, and check their error cases.  That can be done much
> more quickly and straightforwardly than what you had.
>
>
> So here's a revised version that I like better.  I think this
> is pretty nearly committable, aside from the question of whether
> a too-large subexpression number should be an error or not.


Thanks a lot for the patch improvement and the guidance. I have read the
patch and I agree with your choices I think I was too much trying to
mimic the oraclisms. I don't think we should take care of the too-large
subexpression number, the regexp writer should always test its regular
expression and also this will not prevent him to chose the wrong capture
group number but just a non existing one.


Best regards,

-- 
Gilles Darold






Re: Implementing Incremental View Maintenance

2021-08-02 Thread Zhihong Yu
On Sun, Aug 1, 2021 at 11:30 PM Yugo NAGATA  wrote:

> Hi hackers,
>
> On Mon, 19 Jul 2021 09:24:30 +0900
> Yugo NAGATA  wrote:
>
> > On Wed, 14 Jul 2021 21:22:37 +0530
> > vignesh C  wrote:
>
> > > The patch does not apply on Head anymore, could you rebase and post a
> > > patch. I'm changing the status to "Waiting for Author".
> >
> > Ok. I'll update the patch in a few days.
>
> Attached is the latest patch set to add support for Incremental
> Materialized View Maintenance (IVM)
>
> The patches are rebased to the master and also revised with some
> code cleaning.
>
> IVM is a way to make materialized views up-to-date in which only
> incremental changes are computed and applied on views rather than
> recomputing the contents from scratch as REFRESH MATERIALIZED VIEW
> does. IVM can update materialized views more efficiently
> than recomputation when only small part of the view need updates.
>
> The patch set implements a feature so that materialized views could be
> updated automatically and immediately when a base table is modified.
>
> Currently, our IVM implementation supports views which could contain
> tuple duplicates whose definition includes:
>
>  - inner and outer joins including self-join
>  - DISTINCT
>  - some built-in aggregate functions (count, sum, agv, min, and max)
>  - a part of subqueries
>-- simple subqueries in FROM clause
>-- EXISTS subqueries in WHERE clause
>  - CTEs
>
> We hope the IVM feature would be adopted into pg15. However, the size of
> patch set has grown too large through supporting above features.
> Therefore,
> I think it is better to consider only a part of these features for the
> first
> release. Especially, I would like propose the following features for pg15.
>
>  - inner joins including self-join
>  - DISTINCT and views with tuple duplicates
>  - some built-in aggregate functions (count, sum, agv, min, and max)
>
> By omitting outer-join, sub-queries, and CTE features, the patch size
> becomes
> less than half. I hope this will make a bit easer to review the IVM patch
> set.
>
> Here is a list of separated  patches.
>
> - 0001: Add a new syntax:
>   CREATE INCREMENTAL MATERIALIZED VIEW
> - 0002: Add a new column relisivm to pg_class
> - 0003: Add new deptype option 'm' in pg_depend
> - 0004: Change trigger.c to allow to prolong life span of tupestores
> containing Transition Tables generated via AFTER trigger
> - 0005: Add IVM supprot for pg_dump
> - 0006: Add IVM support for psql
> - 0007: Add the basic IVM future:
> This supports inner joins, DISTINCT, and tuple duplicates.
> - 0008: Add aggregates (count, sum, avg, min, max) support for IVM
> - 0009: Add regression tests for IVM
> - 0010: Add documentation for IVM
>
> We could split the patch furthermore if this would make reviews much
> easer.
> For example, I think 0007 could be split into the more basic part and the
> part
> for handling tuple duplicates. Moreover, 0008 could be split into "min/max"
> and other aggregates because handling min/max is a bit more complicated
> than
> others.
>
> I also attached IVM_extra.tar.gz that contains patches for sub-quereis,
> outer-join, CTE support, just for your information.
>
> Regards,
> Yugo Nagata
>
> --
> Yugo NAGATA 
>
>
> --
> Yugo NAGATA 
>
Hi,
For v23-0008-Add-aggregates-support-in-IVM.patch :

As a restriction, expressions specified in GROUP BY must appear in
the target list because tuples to be updated in IMMV are identified
by using this group keys.

IMMV ->  IMVM (Incremental Materialized View Maintenance, as said above)
Or maybe it means 'incrementally maintainable materialized view'. It would
be better to use the same abbreviation.

this group keys -> this group key

+errmsg("GROUP BY expression not appeared in select
list is not supported on incrementally maintainable materialized view")));

expression not appeared in select list -> expression not appearing in
select list

+* For aggregate functions except to count

except to count -> except count

Cheers


Re: archive status ".ready" files may be created too early

2021-08-02 Thread Alvaro Herrera
On 2021-Jul-31, Bossart, Nathan wrote:

> This is why I was trying to get away with just using info_lck for
> reading lastNotifiedSeg.  ArchNotifyLock is mostly intended to protect
> RecordBoundaryMap.  However, since lastNotifiedSeg is used in
> functions like GetLatestRecordBoundarySegment() that access the map, I
> found it easier to reason about things if I knew that it wouldn't
> change as long as I held ArchNotifyLock.

I think it's okay to make lastNotifiedSeg protected by just info_lck,
and RecordBoundaryMap protected by just ArchNotifyLock.  It's okay to
acquire the spinlock inside the lwlock-protected area, as long as we
make sure never to do the opposite.  (And we sure don't want to hold
info_lck long enough that a LWLock acquisition would occur in the
meantime).  So I modified things that way, and also added another
function to set the seg if it's unset, with a single spinlock
acquisition (rather than acqquire, read, release, acqquire, set,
release, which sounds like it would have trouble behaving.)

I haven't tried your repro with this yet.

I find it highly suspicious that the patch does an archiver notify (i.e.
creation of the .ready file) in XLogInsertRecord().  Is that a sane
thing to do?  Sounds to me like that should be attempted in XLogFlush
only.  This appeared after Kyotaro's patch at [1] and before your patch
at [2].

[1] 
https://postgr.es/m/20201014.090628.839639906081252194.horikyota@gmail.com
[2] https://postgr.es/m/eff40306-8e8a-4259-b181-c84f3f066...@amazon.com

I also just realized that Kyotaro's patch there also tried to handle the
streaming replication issue I was talking about.

> I think the main downside of making lastNotifiedSeg an atomic is that
> the value we first read in NotifySegmentsReadyForArchive() might not
> be up-to-date, which means we might hold off creating .ready files
> longer than necessary.

I'm not sure I understand how this would be a problem.  If we block
somebody from setting a newer value, they'll just set the value
immediately after we release the lock.  Will we reread the value
afterwards to see if it changed?

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/
>From 87c77464049c719c2075bd94d4e9ed168c7cf159 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Fri, 30 Jul 2021 18:35:52 -0400
Subject: [PATCH v3] Avoid creating archive status ".ready" files too early.

WAL records may span multiple segments, but XLogWrite() does not
wait for the entire record to be written out to disk before
creating archive status files.  Instead, as soon as the last WAL
page of the segment is written, the archive status file will be
created.  If PostgreSQL crashes before it is able to write the rest
of the record, it will end up reusing segments that have already
been marked as ready-for-archival.  However, the archiver process
may have already processed the old version of the segment, so the
wrong version of the segment may be backed-up.  This backed-up
segment will cause operations such as point-in-time restores to
fail.

To fix this, we keep track of records that span across segments and
ensure that segments are only marked ready-for-archival once such
records have been completely written to disk.
---
 src/backend/access/transam/timeline.c|   2 +-
 src/backend/access/transam/xlog.c| 286 ++-
 src/backend/access/transam/xlogarchive.c |  14 +-
 src/backend/replication/walreceiver.c|   6 +-
 src/backend/storage/lmgr/lwlocknames.txt |   1 +
 src/include/access/xlogarchive.h |   4 +-
 src/include/access/xlogdefs.h|   5 +
 7 files changed, 296 insertions(+), 22 deletions(-)

diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index 8d0903c175..acd5c2431d 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -452,7 +452,7 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 	if (XLogArchivingActive())
 	{
 		TLHistoryFileName(histfname, newTLI);
-		XLogArchiveNotify(histfname);
+		XLogArchiveNotify(histfname, true);
 	}
 }
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f84c0bb01e..99fe1ac0a2 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -512,6 +512,13 @@ typedef enum ExclusiveBackupState
  */
 static SessionBackupState sessionBackupState = SESSION_BACKUP_NONE;
 
+/* entries for RecordBoundaryMap, used to mark segments ready for archival */
+typedef struct RecordBoundaryEntry
+{
+	XLogSegNo seg;	/* must be first */
+	XLogRecPtr pos;
+} RecordBoundaryEntry;
+
 /*
  * Shared state data for WAL insertion.
  */
@@ -723,6 +730,12 @@ typedef struct XLogCtlData
 	 */
 	XLogRecPtr	lastFpwDisableRecPtr;
 
+	/*
+	 * The last segment we've marked ready for archival.  Protected by
+	 * info_lck.
+	 */
+	XLogSegNo lastNotifiedSeg;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogCtlData;
 

Re: Background writer and checkpointer in crash recovery

2021-08-02 Thread Thomas Munro
On Tue, Aug 3, 2021 at 1:17 AM Robert Haas  wrote:
> That's great. I just realized that this leaves us with identical
> RequestCheckpoint() calls in two nearby places. Is there any reason
> not to further simplify as in the attached?

LGTM.




Re: make MaxBackends available in _PG_init

2021-08-02 Thread Bossart, Nathan
On 8/2/21, 1:37 PM, "Andres Freund"  wrote:
> On 2021-08-02 16:06:15 -0400, Robert Haas wrote:
>> I think this is a good idea. Anyone object?
>
> I'm not so sure it's a good idea. I've seen several shared_preload_library
> using extensions that adjust some GUCs (e.g. max_prepared_transactions)
> because they need some more resources internally - that's perhaps not a great
> idea, but there's also not an obviously better way.

Interesting.  I hadn't heard of extensions adjusting GUCs in
_PG_init() before.  I think the other way to handle that scenario is
to check the GUCs in _PG_init() and fail startup if necessary.
However, while it's probably good to avoid changing GUCs from what
users specified, failing startup isn't exactly user-friendly, either.
In any case, changing a GUC in _PG_init() seems quite risky.  You're
effectively bypassing all of the usual checks.

> ISTM this would better be solved with making the hook or config logic for
> libraries a bit more elaborate (e.g. having a hook you can register for that's
> called once all libraries are loaded).

My worry is that this requires even more specialized knowledge to get
things right.  If I need to do anything based on the value of a GUC,
I have to register another hook.  In this new hook, we'd likely want
to somehow prevent modules from further adjusting GUCs.  We'd also
need modules to check the GUCs they care about again in case another
module's _PG_init() adjusted it in an incompatible way.  If you detect
a problem in this new hook, you probably need to fail startup.

Perhaps I am making a mountain out of a molehill and the modules that
are adjusting GUCs in _PG_init() are doing so in a generally safe way,
but IMO it ought to ordinarily be discouraged.

Nathan



Re: Slow standby snapshot

2021-08-02 Thread Andres Freund
Hi,

On 2021-08-03 00:07:23 +0300, Michail Nikolaev wrote:
> The main idea is simple optimistic optimization - store offset to next
> valid entry. So, in most cases, we could just skip all the gaps.
> Of course, it adds some additional impact for workloads without long
> (few seconds) transactions but it is almost not detectable (because of
> CPU caches).

I'm doubtful that that's really the right direction. For workloads that
are replay heavy we already often can see the cost of maintaining the
known xids datastructures show up significantly - not surprising, given
the datastructure. And for standby workloads with active primaries the
cost of searching through the array in all backends is noticeable as
well.  I think this needs a bigger data structure redesign.

Greetings,

Andres Freund




Re: make MaxBackends available in _PG_init

2021-08-02 Thread Andres Freund
Hi,

On 2021-08-02 21:57:18 +, Bossart, Nathan wrote:
> On 8/2/21, 1:37 PM, "Andres Freund"  wrote:
> > On 2021-08-02 16:06:15 -0400, Robert Haas wrote:
> >> I think this is a good idea. Anyone object?
> >
> > I'm not so sure it's a good idea. I've seen several shared_preload_library
> > using extensions that adjust some GUCs (e.g. max_prepared_transactions)
> > because they need some more resources internally - that's perhaps not a 
> > great
> > idea, but there's also not an obviously better way.
> 
> Interesting.  I hadn't heard of extensions adjusting GUCs in
> _PG_init() before.  I think the other way to handle that scenario is
> to check the GUCs in _PG_init() and fail startup if necessary.

The problem is that that makes it hard to configure things for the users own
needs. If e.g. the user's workload needs a certain number of prepared
transactions and the extension internally as well (as e.g. citus does), the
extension can't know if the configured number is "aimed" to be for the
extension, or for the user's own needs. If you instead just increase
max_prepared_transactions in _PG_init(), the story is different.


> However, while it's probably good to avoid changing GUCs from what
> users specified, failing startup isn't exactly user-friendly, either.
> In any case, changing a GUC in _PG_init() seems quite risky.  You're
> effectively bypassing all of the usual checks.

It doesn't need to bypass all - you can override them using GUC mechanisms at
that point.


> > ISTM this would better be solved with making the hook or config logic for
> > libraries a bit more elaborate (e.g. having a hook you can register for 
> > that's
> > called once all libraries are loaded).
> 
> My worry is that this requires even more specialized knowledge to get
> things right.  If I need to do anything based on the value of a GUC,
> I have to register another hook.  In this new hook, we'd likely want
> to somehow prevent modules from further adjusting GUCs.  We'd also
> need modules to check the GUCs they care about again in case another
> module's _PG_init() adjusted it in an incompatible way.

I think this is overblown. We already size resources *after*
shared_preload_libraries' _PG_init() runs, because that's the whole point of
shared_preload_libraries. What's proposed in this thread is to *disallow*
increasing resource usage in s_p_l _PG_init(), to make one specific case
simpler - but it'll actually also make things more complicated, because other
resources will still only be sized after all of s_p_l has been processed.


> If you detect a problem in this new hook, you probably need to fail startup.

Same is true for _PG_init().


> Perhaps I am making a mountain out of a molehill and the modules that
> are adjusting GUCs in _PG_init() are doing so in a generally safe way,
> but IMO it ought to ordinarily be discouraged.

It's not something I would recommend doing blithely - but I also don't think
we have a better answer for some cases.

Greetings,

Andres Freund




Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

2021-08-02 Thread Melanie Plageman
On Fri, Jun 4, 2021 at 5:52 PM Alvaro Herrera  wrote:
>
> On 2021-Apr-12, Melanie Plageman wrote:
>
> > As for the way I have recorded strategy writes -- it is quite inelegant,
> > but, I wanted to make sure that I only counted a strategy write as one
> > in which the backend wrote out the dirty buffer from its strategy ring
> > but did not check if there was any clean buffer in shared buffers more
> > generally (so, it is *potentially* an avoidable write). I'm not sure if
> > this distinction is useful to anyone. I haven't done enough with
> > BufferAccessStrategies to know what I'd want to know about them when
> > developing or using Postgres. However, if I don't need to be so careful,
> > it will make the code much simpler (though, I'm sure I can improve the
> > code regardless).
>
> I was bitten last year by REFRESH MATERIALIZED VIEW counting its writes
> via buffers_backend, and I was very surprised/confused about it.  So it
> seems definitely worthwhile to count writes via strategy separately.
> For a DBA tuning the server configuration it is very useful.
>
> The main thing is to *not* let these writes end up regular
> buffers_backend (or whatever you call these now).  I didn't read your
> patch, but the way you have described it seems okay to me.
>

Thanks for the feedback!

I agree it makes sense to count strategy writes separately.

I thought about this some more, and I don't know if it makes sense to
only count "avoidable" strategy writes.

This would mean that a backend writing out a buffer from the strategy
ring when no clean shared buffers (as well as no clean strategy buffers)
are available would not count that write as a strategy write (even
though it is writing out a buffer from its strategy ring). But, it
obviously doesn't make sense to count it as a regular buffer being
written out. So, I plan to change this code.

On another note, I've updated the patch with more correct concurrency
control control mechanisms (had some data races and other problems
before). Now, I am using atomics for the buffer action counters, though
the code includes several #TODO questions around the correctness of what
I have now too.

I also wrapped the buffer action types in a struct to make them easier
to work with.

The most substantial missing piece of the patch right now is persisting
the data across reboots.

The two places in the code I can see to persist the buffer action stats
data are:
1) using the stats collector code (like in
pgstat_read/write_statsfiles()
2) using a before_shmem_exit() hook which writes the data structure to a
file and then read from it when making the shared memory array initially

It feels a bit weird to me to wedge the buffer actions stats into the
stats collector code--since the stats collector isn't receiving and
aggregating the buffer action stats.

Also, I'm unsure how writing the buffer action stats out in
pgstat_write_statsfiles() will work, since I think that backends can
update their buffer action stats after we would have already persisted
the data from the BufferActionStatsArray -- causing us to lose those
updates.

And, I don't think I can use pgstat_read_statsfiles() since the
BufferActionStatsArray should have the data from the file as soon as the
view containing the buffer action stats can be queried. Thus, it seems
like I would need to read the file while initializing the array in
CreateBufferActionStatsCounters().

I am registering the patch for September commitfest but plan to update
the stats persistence before then (and docs, etc).

-- Melanie
From 2753bf0dc3ff54a515bc0729b51ef56b6715a703 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Mon, 2 Aug 2021 17:56:07 -0400
Subject: [PATCH v3] Add system view tracking shared buffer actions

Add a system view which tracks
- number of shared buffers the checkpointer and bgwriter write out
- number of shared buffers a regular backend is forced to flush
- number of extends done by a regular backend through shared buffers
- number of buffers flushed by a backend or autovacuum using a
  BufferAccessStrategy which, were they not to use this strategy, could
  perhaps have been avoided if a clean shared buffer was available
- number of fsyncs done by a backend which could have been done by
  checkpointer if sync queue had not been full
- number of buffers allocated by a regular backend or autovacuum worker
  for either a new block or an existing block of a relation which is not
  currently in a buffer

All of these stats which were in the system view pg_stat_bgwriter have
been removed from that view.

All backends, on exit, will update a shared memory array with the
buffers they wrote or extended.

When the view is queried, add all live backend's statuses
to the totals in the shared memory array and return that as the full
total.

Each row of the view is for a particular backend type and each column is
the number of a particular kind of buffer action taken by the various
backends.

TODO:
- Some kind of test?
- Docs change
-

Re: make MaxBackends available in _PG_init

2021-08-02 Thread Bossart, Nathan
On 8/2/21, 3:12 PM, "Andres Freund"  wrote:
> I think this is overblown. We already size resources *after*
> shared_preload_libraries' _PG_init() runs, because that's the whole point of
> shared_preload_libraries. What's proposed in this thread is to *disallow*
> increasing resource usage in s_p_l _PG_init(), to make one specific case
> simpler - but it'll actually also make things more complicated, because other
> resources will still only be sized after all of s_p_l has been processed.

True.  Perhaps the comments should reference the possibility that a
library will adjust resource usage to explain why
InitializeMaxBackends() is where it is.

Nathan



Re: Make relfile tombstone files conditional on WAL level

2021-08-02 Thread Andres Freund
Hi,

On 2021-08-02 16:03:31 -0400, Robert Haas wrote:
> The two most principled solutions to this problem that I can see are
> (1) remove wal_level=minimal and

I'm personally not opposed to this. It's not practically relevant and makes a
lot of stuff more complicated. We imo should rather focus on optimizing the
things wal_level=minimal accelerates a lot than adding complications for
wal_level=minimal. Such optimizations would have practical relevance, and
there's plenty low hanging fruits.


> (2) use 64-bit relfilenodes. I have
> been reluctant to support #1 because it's hard for me to believe that
> there aren't cases where being able to skip a whole lot of WAL-logging
> doesn't work out to a nice performance win, but I realize opinions on
> that topic vary. And I'm pretty sure that Andres, at least, will hate
> #2 because he's unhappy with the width of buffer tags already.

Yep :/

I guess there's a somewhat hacky way to get somewhere without actually
increasing the size. We could take 3 bytes from the fork number and use that
to get to a 7 byte relfilenode portion. 7 bytes are probably enough for
everyone.

It's not like we can use those bytes in a useful way, due to alignment
requirements. Declaring that the high 7 bytes are for the relNode portion and
the low byte for the fork would still allow efficient comparisons and doesn't
seem too ugly.


> So I don't really have a good idea. I agree this tombstone system is a
> bit of a wart, but I'm not sure that this patch really makes anything
> any better, and I'm not really seeing another idea that seems better
> either.

> Maybe I am missing something...

What I proposed in the past was to have a new shared table that tracks
relfilenodes. I still think that's a decent solution for just the problem at
hand. But it'd also potentially be the way to redesign relation forks and even
slim down buffer tags:

Right now a buffer tag is:
- 4 byte tablespace oid
- 4 byte database oid
- 4 byte "relfilenode oid" (don't think we have a good name for this)
- 4 byte fork number
- 4 byte block number

If we had such a shared table we could put at least tablespace, fork number
into that table mapping them to an 8 byte "new relfilenode". That'd only make
the "new relfilenode" unique within a database, but that'd be sufficient for
our purposes.  It'd give use a buffertag consisting out of the following:
- 4 byte database oid
- 8 byte "relfilenode"
- 4 byte block number

Of course, it'd add some complexity too, because a buffertag alone wouldn't be
sufficient to read data (as you'd need the tablespace oid from elsewhere). But
that's probably ok, I think all relevant places would have that information.


It's probably possible to remove the database oid from the tag as well, but
it'd make CREATE DATABASE tricker - we'd need to change the filenames of
tables as we copy, to adjust them to the differing oid.

Greetings,

Andres Freund




Re: make MaxBackends available in _PG_init

2021-08-02 Thread Andres Freund
Hi,

On 2021-08-02 22:35:13 +, Bossart, Nathan wrote:
> On 8/2/21, 3:12 PM, "Andres Freund"  wrote:
> > I think this is overblown. We already size resources *after*
> > shared_preload_libraries' _PG_init() runs, because that's the whole point of
> > shared_preload_libraries. What's proposed in this thread is to *disallow*
> > increasing resource usage in s_p_l _PG_init(), to make one specific case
> > simpler - but it'll actually also make things more complicated, because 
> > other
> > resources will still only be sized after all of s_p_l has been processed.
> 
> True.  Perhaps the comments should reference the possibility that a
> library will adjust resource usage to explain why
> InitializeMaxBackends() is where it is.

I've wondered, independent of this thread, about not making MaxBackends
externally visible, and requiring a function call to access it. It should be
easier to find cases of misuse if we errored out when accessed at the wrong
time. And we could use that opportunity to add flags that determine which
types of backends are included (e.g. not including autovac, or additionally
including aux workers or prepared xacts).

Greetings,

Andres Freund




Re: make MaxBackends available in _PG_init

2021-08-02 Thread Bossart, Nathan
On 8/2/21, 3:42 PM, "Andres Freund"  wrote:
> I've wondered, independent of this thread, about not making MaxBackends
> externally visible, and requiring a function call to access it. It should be
> easier to find cases of misuse if we errored out when accessed at the wrong
> time. And we could use that opportunity to add flags that determine which
> types of backends are included (e.g. not including autovac, or additionally
> including aux workers or prepared xacts).

I'm not opposed to this.  I can work on putting a patch together if no
opposition materializes.

Nathan



Re: Use generation context to speed up tuplesorts

2021-08-02 Thread David Rowley
On Sat, 31 Jul 2021 at 08:38, Andres Freund  wrote:
> There is at least one path using tuplecontext that reaches code that
> could end up freeing memory to a significant enough degree to care about
> generation.c effectively not using that memory:
> tuplesort_putdatum()->datumCopy()->EOH_flatten_into()
> On a quick look I didn't find any expanded record user that frees
> nontrivial amounts of memory, but I didn't look all that carefully.

I guess we could just use a normal context for datum sorts if we
thought that might be a problem.

I'm not too familiar with the expanded object code, but I'm struggling
to imagine why anything would need to do a pfree in there. We just do
EOH_get_flat_size() to determine how big to make the allocation then
allocate some memory for EOH_flatten_into() to use to expand the
object into.

David




Re: disfavoring unparameterized nested loops

2021-08-02 Thread Mike Klaas
I think that it is worth paying more than nothing to avoid plans that are
so far from optimal that they might as well take forever to execute


I recently came across this article from 2016 that expounded upon a bad
plan of the sort discussed in this thread:
https://heap.io/blog/when-to-avoid-jsonb-in-a-postgresql-schema

(The proximate cause in this case was Postgresql not collecting statistics
for fields in a JSONB column, estimating rowcount of 1, and thus creating a
pathological slowdown.)

–Mike


On Tue, Jun 22, 2021 at 7:37 PM, Peter Geoghegan  wrote:

> On Tue, Jun 22, 2021 at 2:53 AM Tomas Vondra
>  wrote:
>
> Yeah, I like the insurance analogy - it gets to the crux of the problem,
> because insurance is pretty much exactly about managing risk.
>
> The user's exposure to harm is what truly matters. I admit that that's
> very hard to quantify, but we should at least try to do so.
>
> We sometimes think about a plan that is 10x slower as if it's infinitely
> slow, or might as well be. But it's usually not like that
> -- it is generally meaningfully much better than the plan being 100x
> slower, which is itself sometimes appreciably better than 1000x slower. And
> besides, users often don't get anything like the optimal plan, even on what
> they would consider to be a good day (which is most days). So maybe 10x
> slower is actually the baseline good case already, without anybody knowing
> it. Most individual queries are not executed very often, even on the
> busiest databases. The extremes really do matter a lot.
>
> If a web app or OLTP query is ~10x slower than optimal then it might be
> the practical equivalent of an outage that affects the query alone
> (i.e. "infinitely slow") -- but probably not. I think that it is worth
> paying more than nothing to avoid plans that are so far from optimal that
> they might as well take forever to execute. This is not meaningfully
> different from a database outage affecting one particular query. It kind of
> is in a category of its own that surpasses "slow plan", albeit one that is
> very difficult or impossible to define formally.
>
> There may be a huge amount of variation in risk tolerance among basically
> reasonable people. For example, if somebody chooses to engage in some kind
> of extreme sport, to me it seems understandable. It's just not my cup of
> tea. Whereas if somebody chooses to never wear a seatbelt while driving,
> then to me they're simply behaving foolishly. They're not willing to incur
> the tiniest inconvenience in order to get a huge reduction in potential
> harm -- including a very real risk of approximately the worst thing that
> can happen to you. Sure, refusing to wear a seatbelt can theoretically be
> classified as just another point on the risk tolerance spectrum, but that
> seems utterly contrived to me. Some things (maybe not that many) really are
> like that, or can at least be assumed to work that way as a practical
> matter.
>
> But making
> everything slower will be a hard sell, because wast majority of workloads
> already running on Postgres don't have this issue at all, so for them it's
> not worth the expense.
>
> I think that we're accepting too much risk here. But I bet it's also true
> that we're not taking enough risk in other areas. That was really my point
> with the insurance analogy -- we can afford to take lots of individual
> risks as long as they don't increase our exposure to truly disastrous
> outcomes -- by which I mean queries that might as well take forever to
> execute as far as the user is concerned. (Easier said than done, of
> course.)
>
> A simple trade-off between fast and robust doesn't seem like a universally
> helpful thing. Sometimes it's a very unhelpful way of looking at the
> situation. If you make something more robust to extreme bad outcomes, then
> you may have simultaneously made it *faster* (not slower) for all practical
> purposes. This can happen when the increase in robustness allows the user
> to tune the system aggressively, and only take on new risks that they can
> truly live with (which wouldn't have been possible without the increase in
> robustness). For example, I imagine that the failsafe mechanism added to
> VACUUM will actually make it possible to tune VACUUM much more aggressively
> -- it might actually end up significantly improving performance for all
> practical purposes, even though technically it has nothing to do with
> performance.
>
> Having your indexes a little more bloated because the failsafe kicked-in
> is a survivable event -- the DBA lives to fight another day, and *learns*
> to tune vacuum/the app so it doesn't happen again and again. An
> anti-wraparound failure is perhaps not a survivable event -- the DBA gets
> fired. This really does seem like a fundamental difference to me.
>
> Following the insurance analogy,
> selling tornado insurance in Europe is mostly pointless.
>
> Principled skepticism of this kind of thing is of course necessary and
> welcome. 

Re: archive status ".ready" files may be created too early

2021-08-02 Thread Bossart, Nathan
On 8/2/21, 2:42 PM, "Alvaro Herrera"  wrote:
> I think it's okay to make lastNotifiedSeg protected by just info_lck,
> and RecordBoundaryMap protected by just ArchNotifyLock.  It's okay to
> acquire the spinlock inside the lwlock-protected area, as long as we
> make sure never to do the opposite.  (And we sure don't want to hold
> info_lck long enough that a LWLock acquisition would occur in the
> meantime).  So I modified things that way, and also added another
> function to set the seg if it's unset, with a single spinlock
> acquisition (rather than acqquire, read, release, acqquire, set,
> release, which sounds like it would have trouble behaving.)

The patch looks good to me.

> I find it highly suspicious that the patch does an archiver notify (i.e.
> creation of the .ready file) in XLogInsertRecord().  Is that a sane
> thing to do?  Sounds to me like that should be attempted in XLogFlush
> only.  This appeared after Kyotaro's patch at [1] and before your patch
> at [2].

I believe my worry was that we'd miss notifying a segment as soon as
possible if the record was somehow flushed prior to registering the
record boundary in the map.  If that's actually impossible, then I
would agree that the extra call to NotifySegmentsReadyForArchive() is
unnecessary.

>> I think the main downside of making lastNotifiedSeg an atomic is that
>> the value we first read in NotifySegmentsReadyForArchive() might not
>> be up-to-date, which means we might hold off creating .ready files
>> longer than necessary.
>
> I'm not sure I understand how this would be a problem.  If we block
> somebody from setting a newer value, they'll just set the value
> immediately after we release the lock.  Will we reread the value
> afterwards to see if it changed?

I think you are right.  If we see an old value for lastNotifiedSeg,
the worst case is that we take the ArchNotifyLock, read
lastNotifiedSeg again (which should then be up-to-date), and then
basically do nothing.  I suspect initializing lastNotifiedSeg might
still be a little tricky, though.  Do you think it is important to try
to avoid this spinlock for lastNotifiedSeg?  IIUC it's acquired at the
end of every call to XLogWrite() already, and we'd still need to
acquire it for the flush pointer, anyway.

Nathan



RE: make MaxBackends available in _PG_init

2021-08-02 Thread wangsh.f...@fujitsu.com
Hi,

Bossart, Nathan  wrote:

> I just encountered the same thing, so I am bumping this thread.
Thank you for bumping this thread.

> > I've wondered, independent of this thread, about not making MaxBackends
> > externally visible, and requiring a function call to access it. It should be
> > easier to find cases of misuse if we errored out when accessed at the wrong
> > time. 

> I'm not opposed to this.  I can work on putting a patch together if no
> opposition materializes.

I think this is a good idea.

Shenhao Wang
Best regards


Re: Record a Bitmapset of non-pruned partitions

2021-08-02 Thread David Rowley
On Mon, 2 Aug 2021 at 20:16, David Rowley  wrote:
> If anyone has any objections to both the v4 0001 and 0002 patch, can
> they let me know soon. I'm currently seeing no reason that they can't
> go in.

I've now pushed both of these. Thanks for the reviews.

David




Re: [HACKERS] logical decoding of two-phase transactions

2021-08-02 Thread Peter Smith
Please find attached the latest patch set v102*

Differences:

* Rebased to HEAD @ today.

* This is a documentation change only. A recent commit [1] has changed
the documentation style for the message formats slightly to annotate
the data types. For consistency, the same style change needs to be
adopted for the newly added message of this patch.  This same change
also finally addresses some old review comments [2] from Vignesh.


[1] 
https://github.com/postgres/postgres/commit/a5cb4f9829fbfd68655543d2d371a18a8eb43b84
[2] 
https://www.postgresql.org/message-id/CALDaNm3U4fGxTnQfaT1TqUkgX5c0CSDvmW12Bfksis8zB_XinA%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia


v102-0001-Add-prepare-API-support-for-streaming-transacti.patch
Description: Binary data


Re: [HACKERS] logical decoding of two-phase transactions

2021-08-02 Thread Peter Smith
On Mon, May 10, 2021 at 1:31 PM vignesh C  wrote:
>
...
>
> 2) I felt we can change lsn data type from Int64 to XLogRecPtr
> +
> +Int64
> +
> +The LSN of the prepare.
> +
> +
> +
> +
> +Int64
> +
> +The end LSN of the transaction.
> +
> +
>
> 3) I felt we can change lsn data type from Int32 to TransactionId
> +
> +Int32
> +
> +Xid of the subtransaction (will be same as xid of the
> transaction for top-level
> +transactions).
> +
> +
>
...
>
> Similar problems related to comments 2 and 3 are being discussed at
> [1], we can change it accordingly based on the conclusion in the other
> thread.
> [1] - 
> https://www.postgresql.org/message-id/flat/CAHut%2BPs2JsSd_OpBR9kXt1Rt4bwyXAjh875gUpFw6T210ttO7Q%40mail.gmail.com#cf2a85d0623dcadfbb1204a196681313
>

Earlier today the other documentation patch mentioned above was
committed by Tom Lane.

The 2PC patch v102 now fixes your review comments 2 and 3 by matching
the same datatype annotation style of that commit.

--
Kind Regards,
Peter Smith
Fujitsu Australia




Re: 2021-07 CF now in progress

2021-08-02 Thread Masahiko Sawada
On Tue, Aug 3, 2021 at 12:34 AM Ibrar Ahmed  wrote:
>
>
>
> On Mon, Jul 26, 2021 at 5:52 PM Ibrar Ahmed  wrote:
>>
>>
>>
>> On Mon, Jul 19, 2021 at 4:37 PM Ibrar Ahmed  wrote:
>>>
>>>
>>>
>>> On Mon, Jul 12, 2021 at 4:59 PM Ibrar Ahmed  wrote:


 Hackers,

 The Commitfest 2021-07 is now in progress. It is one of the biggest one. 
 Total number of patches of this commitfest is 342.

 Needs review: 204.
 Waiting on Author: 40.
 Ready for Committer: 18.
 Committed: 57.
 Moved to next CF: 3.
 Withdrawn: 15.
 Rejected: 3.
 Returned with Feedback: 2.
 Total: 342.

 If you are a patch author, please check http://commitfest.cputube.org to 
 be sure your patch still applies, compiles, and passes tests.

 We need your involvement and participation in reviewing the patches. Let's 
 try and make this happen.

 --
 Regards.
 Ibrar Ahmed
>>>
>>>
>>>
>>> Over the past one week, statuses of 47 patches have been changed from 
>>> "Needs review". This still leaves us with 157 patches
>>> requiring reviews. As always, your continuous support is appreciated to get 
>>> us over the line.
>>>
>>> Please look at the patches requiring review in the current commitfest. 
>>> Test, provide feedback where needed, and update the patch status.
>>>
>>> Total: 342.
>>>
>>> Needs review: 157.
>>> Waiting on Author: 74.
>>> Ready for Committer: 15.
>>> Committed: 68.
>>> Moved to next CF: 3.
>>> Withdrawn: 19.
>>> Rejected: 4.
>>> Returned with Feedback: 2.
>>>
>>>
>>> --
>>> Ibrar Ahmed
>>
>>
>> Over the past one week, some progress was made, however, there are still 155 
>> patches in total that require reviews. Time to continue pushing for 
>> maximising patch reviews and getting stuff committed in PostgreSQL.
>>
>> Total: 342.
>> Needs review: 155.
>> Waiting on Author: 67.
>> Ready for Committer: 20.
>> Committed: 70.
>> Moved to next CF: 3.
>> Withdrawn: 20.
>> Rejected: 5.
>> Returned with Feedback: 2.
>>
>> Thank you for your continued effort andI  support.
>>
>> --
>> Ibrar Ahmed
>

Thank you for working as a commitfest manager.

> Here is the current state of the commitfest. It looks like it should be 
> closed now. I don't have permission to do that.

I can close this commitfest. But should we mark uncommitted patches as
"Moved to next CF" or "Returned with Feedback" beforehand?

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: 2021-07 CF now in progress

2021-08-02 Thread Tom Lane
Masahiko Sawada  writes:
> I can close this commitfest. But should we mark uncommitted patches as
> "Moved to next CF" or "Returned with Feedback" beforehand?

Should be moved to next CF, or at least most of them should be.
(But I doubt anyone wants to try to kill off patches that
aren't going anywhere right now.)

regards, tom lane




Re: A micro-optimisation for ProcSendSignal()

2021-08-02 Thread Thomas Munro
Hi Soumyadeep,

On Sat, Jul 24, 2021 at 5:26 PM Soumyadeep Chakraborty
 wrote:
> On Tue, Jul 20, 2021 at 10:40 PM Thomas Munro  wrote:
> > I wonder why we need this member anyway, when you can compute it from
> > the address... #define GetPGProcNumber(p) ((p) - ProcGlobal->allProcs)
> > or something like that?  Kinda wonder why we don't use
> > GetPGProcByNumber() in more places instead of open-coding access to
> > ProcGlobal->allProcs, too...
>
> I tried this out. See attached v4 of your patch with these changes.

I like it.  I've moved these changes to a separate patch, 0002, for
separate commit.  I tweaked a couple of comments (it's not a position
in the "procarray", well it's a position stored in the procarray, but
that's confusing; I also found a stray comment about leader->pgprocno
that is obsoleted by this change).  Does anyone have objections to
this?

I was going to commit the earlier change this morning, but then I read [1].

New idea.  Instead of adding pgprocno to SERIALIZABLEXACT (which we
should really be trying to shrink, not grow), let's look it up by
vxid->backendId.  I didn't consider that before, because I was trying
not to get tangled up with BackendIds for various reasons, not least
that that's yet another lock + O(n) search.

It seems likely that getting from vxid to latch will be less clumsy in
the near future.  That refactoring and harmonising of backend
identifiers seems like a great idea to me.  Here's a version that
anticipates that, using vxid->backendId to wake a sleeping
SERIALIZABLE READ ONLY DEFERRABLE backend, without having to add a new
member to the struct.

> A session ID seems a bit heavy just to indicate whether a backend has
> exited.

Yeah.  A Greenplum-like session ID might eventually be necessary in a
world where sessions are divorced from processes and handled by a pool
of worker threads, though.  /me gazes towards the horizon

[1] 
https://www.postgresql.org/message-id/flat/20210802164124.ufo5buo4apl6yuvs%40alap3.anarazel.de
From b284d8f29efc1c16c3aa75b64d9a940bcb74872c Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Tue, 3 Aug 2021 10:02:15 +1200
Subject: [PATCH v5 1/2] Optimize ProcSendSignal().

Instead of referring to target backends by pid, use pgprocno.  This
means that we don't have to scan the ProcArray, and we can drop some
special case code for dealing with the startup process.

In the case of buffer pin waits, we switch to storing the pgprocno of
the waiter.  In the case of SERIALIZABLE READ ONLY DEFERRABLE waits, we
derive the pgprocno from the vxid (though that's not yet as efficient as
it could be).

Reviewed-by: Soumyadeep Chakraborty 
Reviewed-by: Ashwin Agrawal 
Discussion: https://postgr.es/m/CA%2BhUKGLYRyDaneEwz5Uya_OgFLMx5BgJfkQSD%3Dq9HmwsfRRb-w%40mail.gmail.com
---
 src/backend/access/transam/xlog.c |  3 --
 src/backend/storage/buffer/buf_init.c |  3 +-
 src/backend/storage/buffer/bufmgr.c   | 10 +++---
 src/backend/storage/lmgr/predicate.c  |  8 -
 src/backend/storage/lmgr/proc.c   | 49 ++-
 src/include/storage/buf_internals.h   |  2 +-
 src/include/storage/proc.h|  6 +---
 7 files changed, 19 insertions(+), 62 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f84c0bb01e..1574362724 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7308,9 +7308,6 @@ StartupXLOG(void)
 		/* Also ensure XLogReceiptTime has a sane value */
 		XLogReceiptTime = GetCurrentTimestamp();
 
-		/* Allow ProcSendSignal() to find us, for buffer pin wakeups. */
-		PublishStartupProcessInformation();
-
 		/*
 		 * Let postmaster know we've started redo now, so that it can launch
 		 * the archiver if necessary.
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index a299be1043..b9a83c0e9b 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -16,6 +16,7 @@
 
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "storage/proc.h"
 
 BufferDescPadded *BufferDescriptors;
 char	   *BufferBlocks;
@@ -118,7 +119,7 @@ InitBufferPool(void)
 			CLEAR_BUFFERTAG(buf->tag);
 
 			pg_atomic_init_u32(&buf->state, 0);
-			buf->wait_backend_pid = 0;
+			buf->wait_backend_pgprocno = INVALID_PGPROCNO;
 
 			buf->buf_id = i;
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 33d99f604a..fb86c8706c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1890,11 +1890,11 @@ UnpinBuffer(BufferDesc *buf, bool fixOwner)
 BUF_STATE_GET_REFCOUNT(buf_state) == 1)
 			{
 /* we just released the last pin other than the waiter's */
-int			wait_backend_pid = buf->wait_backend_pid;
+int			wait_backend_pgprocno = buf->wait_backend_pgprocno;
 
 buf_state &= ~BM_PIN_COUNT_WAITER;
 UnlockBufHdr(buf, buf_state);
-ProcSendSignal(wait_backend_pid);
+Pr

Re: 2021-07 CF now in progress

2021-08-02 Thread Masahiko Sawada
On Tue, Aug 3, 2021 at 10:43 AM Tom Lane  wrote:
>
> Masahiko Sawada  writes:
> > I can close this commitfest. But should we mark uncommitted patches as
> > "Moved to next CF" or "Returned with Feedback" beforehand?
>
> Should be moved to next CF, or at least most of them should be.
> (But I doubt anyone wants to try to kill off patches that
> aren't going anywhere right now.)

Agreed. I'll close this commitfest soon. The patches are automatically
marked as "Moved to next CF" by closing the commitfest?

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




fix DECLARE tab completion

2021-08-02 Thread Shinya11.Kato
Hi!

In the discussion of [1], the option ASENSITIVE was added to DECLARE.
I have improved its tab completion and attached a patch.

Do you think?

[1] 
https://www.postgresql.org/message-id/flat/96ee8b30-9889-9e1b-b053-90e10c050e85%40enterprisedb.com

Regards,
Shinya Kato


fix_declare_tab_completion.patch
Description: fix_declare_tab_completion.patch


Re: 2021-07 CF now in progress

2021-08-02 Thread Masahiko Sawada
On Tue, Aug 3, 2021 at 10:50 AM Masahiko Sawada  wrote:
>
> On Tue, Aug 3, 2021 at 10:43 AM Tom Lane  wrote:
> >
> > Masahiko Sawada  writes:
> > > I can close this commitfest. But should we mark uncommitted patches as
> > > "Moved to next CF" or "Returned with Feedback" beforehand?
> >
> > Should be moved to next CF, or at least most of them should be.
> > (But I doubt anyone wants to try to kill off patches that
> > aren't going anywhere right now.)
>
> Agreed. I'll close this commitfest soon. The patches are automatically
> marked as "Moved to next CF" by closing the commitfest?

It seems not.

Anyway, I've closed this commitfest. I'll move uncommitted patch
entries to the next commitfest.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: Background writer and checkpointer in crash recovery

2021-08-02 Thread Thomas Munro
On Tue, Aug 3, 2021 at 9:52 AM Thomas Munro  wrote:
> On Tue, Aug 3, 2021 at 1:17 AM Robert Haas  wrote:
> > That's great. I just realized that this leaves us with identical
> > RequestCheckpoint() calls in two nearby places. Is there any reason
> > not to further simplify as in the attached?
>
> LGTM.

And pushed.




Dividing line between before_shmem_exit and on_shmem_exit?

2021-08-02 Thread Andres Freund
Hi,

I don't think I really understand what decides what should use
before_shmem_exit() and what into on_shmem_exit(). The existing comments say:

/* 
 *  before_shmem_exit
 *
 *  Register early callback to perform user-level cleanup,
 *  e.g. transaction abort, before we begin shutting down
 *  low-level subsystems.
 * 
 */

/* 
 *  on_shmem_exit
 *
 *  Register ordinary callback to perform low-level shutdown
 *  (e.g. releasing our PGPROC); run after before_shmem_exit
 *  callbacks and before on_proc_exit callbacks.
 * 
 */

but I don't find it super obvious when a subsystem is "low-level" or
"user-level". Which is not helped by a fairly inconsistent use of the
mechanism. E.g. ParallelWorkerShutdown() is on_*, ShutdownPostgres,
ShutdownAuxiliaryProcess are before_* and I can't really make sense of that.

Nor do I really understand why dsm_backend_shutdown() is scheduled to be
exactly between those phases? That seems to indicate that no "low-level"
subsystem could ever use DSM, but I don't really understand that either. I
realize it has to be reliably be re-invoked in case of errors as the comment
in shmem_exit() explains - but does not really explain why that's the right
moment to shut down DSM.

I suspect it's mostly pragmatic, because it has to happen before we tear down
"really fundamental" subsystems, but there's no easy way to call
dsm_backend_shutdown() repeatedly at that point?


The concrete reason I am asking is that for the shared memory stats patch I
needed to adjust a few on/before determinations. The fundamental reason for
that is that the shared memory stats patch uses DSM (since stats are variably
sized), which means that anything that could report stats needs to happen in
before_shmem_exit().

To make things work in a halfway reasonable fashion I needed to adjust:

1) Single user mode using on_shmem_exit(ShutdownXLOG()) to before*. From the
   stats side that makes sense - clearly checkpoints can produce stat worthy
   stuff, so it has to happen before the stats subsystem is inactive. I don't
   know if it's "low-level" in the sense referenced in the comments above
   though.

2) Change ParallelWorkerShutdown to before*. That seems consistent with other
   *Shutdown* routines, so ...

3) Have ParallelWorkerShutdown() also trigger detaching from its dsm
   segment. Some on_dsm_detach() callbacks trigger stats (e.g. sharedfileset.c
   causes fd.c to do ReportTemporaryFileUsage()), which still need to be
   reported.

   Unfortunately ordering on_dsm_detach() callbacks is *not* sufficient,
   because dsa.c needs to be able to create new dsm segments if the amount of
   stats grow - and those will be put at the head of dsm_segment_list, which
   in turn will lead to dsm_backend_shutdown() to detach from that newly added
   segment last. Oops.

Pragmatically I can live with all of these changes and I don't think they
should be blockers for the shmem stats patch.


However, this situation feels deeply unsatisfying. We have to be very careful
about nesting initialization of subsystems correctly (see e.g. comments for
ShutdownPostgres's registration), there's no way to order dsm callbacks in a
proper order, on_shmem_exit is used for things ranging from
pgss_shmem_shutdown over ProcKill to IpcMemoryDelete, ...

IOW, the reason that I am (and I think others) are struggling with
before/on_shmem_exit is that the division is not really good enough.

I'm not even convinced that dynamic registration of such callbacks is a good
idea. At least for the important in core stuff it might be better to have a
centrally maintained ordering of callbacks, plus a bit of state describing how
far teardown has progressed (to deal with failing callback issues).

Greetings,

Andres Freund




Re: archive status ".ready" files may be created too early

2021-08-02 Thread Kyotaro Horiguchi
At Mon, 2 Aug 2021 23:28:19 +, "Bossart, Nathan"  
wrote in 
> On 8/2/21, 2:42 PM, "Alvaro Herrera"  wrote:
> > I think it's okay to make lastNotifiedSeg protected by just info_lck,
> > and RecordBoundaryMap protected by just ArchNotifyLock.  It's okay to
> > acquire the spinlock inside the lwlock-protected area, as long as we
> > make sure never to do the opposite.  (And we sure don't want to hold
> > info_lck long enough that a LWLock acquisition would occur in the
> > meantime).  So I modified things that way, and also added another
> > function to set the seg if it's unset, with a single spinlock
> > acquisition (rather than acqquire, read, release, acqquire, set,
> > release, which sounds like it would have trouble behaving.)
> 
> The patch looks good to me.

+   for (seg = flushed_seg; seg > last_notified; seg--)
+   {
+   RecordBoundaryEntry *entry;
+
+   entry = (RecordBoundaryEntry *) hash_search(RecordBoundaryMap,
+   
(void *) &seg, HASH_FIND,

I'm afraid that using hash to store boundary info is too much. Isn't a
ring buffer enough for this use?  In that case it is enough to
remember only the end LSN of the segment spanning records.  It is easy
to expand the buffer if needed.

+   if (!XLogSegNoIsInvalid(latest_boundary_seg))

It is a matter of taste, but I see latest_boundary_seg !=
InvalidXLogSegNo more frequentlyl, maybe to avoid double negation.


@@ -1167,10 +1195,33 @@ XLogInsertRecord(XLogRecData *rdata,
SpinLockRelease(&XLogCtl->info_lck);
}
 
+   /*
+* Record the record boundary if we crossed the segment boundary.  This 
is
...
+   XLByteToSeg(StartPos, StartSeg, wal_segment_size);
+   XLByteToSeg(EndPos, EndSeg, wal_segment_size);
+
+   if (StartSeg != EndSeg && XLogArchivingActive())
+   {

The immediately prceding if block is for cross-page records. So we can
reduce the overhaed by the above calculations by moving it to the
preceding if-block.


+RegisterRecordBoundaryEntry(XLogSegNo seg, XLogRecPtr pos)

The seg is restricted to the segment that pos resides on.  The caller
is free from caring that restriction if the function takes only pos.
It adds a small overhead to calculate segment number from the LSN but
I think it doesn't matter so much. (Or if we don't use hash, that
calculation is not required at all).


@@ -2602,7 +2855,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
LogwrtResult.Flush = LogwrtResult.Write;
/* end of page */
 
if (XLogArchivingActive())
-   XLogArchiveNotifySeg(openLogSegNo);
+   
SetLastNotifiedSegmentIfInvalid(openLogSegNo - 1);

Is it safe?  If server didn't notified of WAL files for recent 3
finished segments in the previous server life, they need to be
archived this life time. But this omits maybe all of the tree.
(I didn't confirm that behavior..)

> > I find it highly suspicious that the patch does an archiver notify (i.e.
> > creation of the .ready file) in XLogInsertRecord().  Is that a sane
> > thing to do?  Sounds to me like that should be attempted in XLogFlush
> > only.  This appeared after Kyotaro's patch at [1] and before your patch
> > at [2].
> 
> I believe my worry was that we'd miss notifying a segment as soon as
> possible if the record was somehow flushed prior to registering the
> record boundary in the map.  If that's actually impossible, then I
> would agree that the extra call to NotifySegmentsReadyForArchive() is
> unnecessary.

I don't think that XLogWrite(up to LSN=X) can happen before
XLogInsert(endpos = X) ends.

> >> I think the main downside of making lastNotifiedSeg an atomic is that
> >> the value we first read in NotifySegmentsReadyForArchive() might not
> >> be up-to-date, which means we might hold off creating .ready files
> >> longer than necessary.
> >
> > I'm not sure I understand how this would be a problem.  If we block
> > somebody from setting a newer value, they'll just set the value
> > immediately after we release the lock.  Will we reread the value
> > afterwards to see if it changed?
> 
> I think you are right.  If we see an old value for lastNotifiedSeg,
> the worst case is that we take the ArchNotifyLock, read
> lastNotifiedSeg again (which should then be up-to-date), and then

Agreed.

> basically do nothing.  I suspect initializing lastNotifiedSeg might
> still be a little tricky, though.  Do you think it is important to try
> to avoid this spinlock for lastNotifiedSeg?  IIUC it's acquired at the
> end of every call to XLogWrite() already, and we'd still need to
> acquire it for the flush pointer, anyway.

As mentioned above, I think it needs more cosideration.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: A micro-optimisation for ProcSendSignal()

2021-08-02 Thread Soumyadeep Chakraborty
Hey Thomas,

On Mon, Aug 2, 2021 at 6:45 PM Thomas Munro  wrote:
>
> Hi Soumyadeep,
>
> On Sat, Jul 24, 2021 at 5:26 PM Soumyadeep Chakraborty
>  wrote:
> > On Tue, Jul 20, 2021 at 10:40 PM Thomas Munro  
> > wrote:
> > > I wonder why we need this member anyway, when you can compute it from
> > > the address... #define GetPGProcNumber(p) ((p) - ProcGlobal->allProcs)
> > > or something like that?  Kinda wonder why we don't use
> > > GetPGProcByNumber() in more places instead of open-coding access to
> > > ProcGlobal->allProcs, too...
> >
> > I tried this out. See attached v4 of your patch with these changes.
>
> I like it.  I've moved these changes to a separate patch, 0002, for
> separate commit.  I tweaked a couple of comments (it's not a position
> in the "procarray", well it's a position stored in the procarray, but
> that's confusing; I also found a stray comment about leader->pgprocno
> that is obsoleted by this change).  Does anyone have objections to
> this?

Awesome, thanks! Looks good.

> I was going to commit the earlier change this morning, but then I read [1].
>
> New idea.  Instead of adding pgprocno to SERIALIZABLEXACT (which we
> should really be trying to shrink, not grow), let's look it up by
> vxid->backendId.  I didn't consider that before, because I was trying
> not to get tangled up with BackendIds for various reasons, not least
> that that's yet another lock + O(n) search.
>
> It seems likely that getting from vxid to latch will be less clumsy in
> the near future.  That refactoring and harmonising of backend
> identifiers seems like a great idea to me.  Here's a version that
> anticipates that, using vxid->backendId to wake a sleeping
> SERIALIZABLE READ ONLY DEFERRABLE backend, without having to add a new
> member to the struct.
>

Neat! A Vxid -> PGPROC lookup eventually becomes an O(1) operation with the
changes proposed at the ending paragraph of [1].

[1] 
https://www.postgresql.org/message-id/20210802164124.ufo5buo4apl6yuvs%40alap3.anarazel.de

Regards,
Soumyadeep (VMware)




Re: Failed transaction statistics to measure the logical replication progress

2021-08-02 Thread Amit Kapila
On Mon, Aug 2, 2021 at 1:13 PM Masahiko Sawada  wrote:
>
> On Mon, Aug 2, 2021 at 2:52 PM osumi.takami...@fujitsu.com
>  wrote:
> >
> >
> > Accordingly, I'm thinking to have unsuccessful and successful stats on the 
> > sub side.
> > Sawada-san is now implementing a new view in [1].
> > Do you think that I should write a patch to introduce a new separate view
> > or write a patch to add more columns to the new view 
> > "pg_stat_subscription_errors" that is added at [1] ?
>
> pg_stat_subscriptions_errors view I'm proposing is a view showing the
> details of error happening during logical replication. So I think a
> separate view or pg_stat_subscription view would be a more appropriate
> place.
>

+1 for having these stats in pg_stat_subscription. Do we want to add
two columns (xact_commit: number of transactions successfully applied
in this subscription, xact_rollback: number of transactions that have
been rolled back in this subscription) or do you guys have something
else in mind?

-- 
With Regards,
Amit Kapila.




Re: A micro-optimisation for ProcSendSignal()

2021-08-02 Thread Andres Freund
Hi,


On 2021-08-03 13:44:58 +1200, Thomas Munro wrote:
> New idea.  Instead of adding pgprocno to SERIALIZABLEXACT (which we
> should really be trying to shrink, not grow), let's look it up by
> vxid->backendId.  I didn't consider that before, because I was trying
> not to get tangled up with BackendIds for various reasons, not least
> that that's yet another lock + O(n) search.
>
> It seems likely that getting from vxid to latch will be less clumsy in
> the near future.

So this change only makes sense of vxids would start to use pgprocno instead
of backendid, basically?


> From b284d8f29efc1c16c3aa75b64d9a940bcb74872c Mon Sep 17 00:00:00 2001
> From: Thomas Munro 
> Date: Tue, 3 Aug 2021 10:02:15 +1200
> Subject: [PATCH v5 1/2] Optimize ProcSendSignal().
>
> Instead of referring to target backends by pid, use pgprocno.  This
> means that we don't have to scan the ProcArray, and we can drop some
> special case code for dealing with the startup process.
>
> In the case of buffer pin waits, we switch to storing the pgprocno of
> the waiter.  In the case of SERIALIZABLE READ ONLY DEFERRABLE waits, we
> derive the pgprocno from the vxid (though that's not yet as efficient as
> it could be).

That's kind of an understatement :)




> -ProcSendSignal(int pid)
> +ProcSendSignal(int pgprocno)
>  {
> - PGPROC *proc = NULL;
> -
> - if (RecoveryInProgress())
> - {
> - SpinLockAcquire(ProcStructLock);
> -
> - /*
> -  * Check to see whether it is the Startup process we wish to 
> signal.
> -  * This call is made by the buffer manager when it wishes to 
> wake up a
> -  * process that has been waiting for a pin in so it can obtain a
> -  * cleanup lock using LockBufferForCleanup(). Startup is not a 
> normal
> -  * backend, so BackendPidGetProc() will not return any pid at 
> all. So
> -  * we remember the information for this special case.
> -  */
> - if (pid == ProcGlobal->startupProcPid)
> - proc = ProcGlobal->startupProc;
> -
> - SpinLockRelease(ProcStructLock);
> - }
> -
> - if (proc == NULL)
> - proc = BackendPidGetProc(pid);
> -
> - if (proc != NULL)
> - {
> - SetLatch(&proc->procLatch);
> - }
> + SetLatch(&ProcGlobal->allProcs[pgprocno].procLatch);
>  }

I think some validation of the pgprocno here would be a good idea. I'm worried
that something could cause e.g. INVALID_PGPROCNO to be passed in, and suddenly
we're modifying random memory. That could end up being a pretty hard bug to
catch, because we might not even notice that the right latch isn't set...


> From 562657ea3f7be124a6c6b6d1e7450da2431a54a0 Mon Sep 17 00:00:00 2001
> From: Thomas Munro 
> Date: Thu, 11 Mar 2021 23:09:11 +1300
> Subject: [PATCH v5 2/2] Remove PGPROC's redundant pgprocno member.
>
> It's derivable with pointer arithmetic.
>
> Author: Soumyadeep Chakraborty 
> Discussion:
> https://postgr.es/m/CA%2BhUKGLYRyDaneEwz5Uya_OgFLMx5BgJfkQSD%3Dq9HmwsfRRb-w%40mail.gmail.com


>  /* Accessor for PGPROC given a pgprocno. */
>  #define GetPGProcByNumber(n) (&ProcGlobal->allProcs[(n)])
> +/* Accessor for pgprocno given a pointer to PGPROC. */
> +#define GetPGProcNumber(proc) ((proc) - ProcGlobal->allProcs)

I'm not sure this is a good idea. There's no really cheap way for the compiler
to compute this, because sizeof(PGPROC) isn't a power of two. Given that
PGPROC is ~880 bytes, I don't see all that much gain in getting rid of
->pgprocno.

Yes, compilers can optimize away the super expensive division, but it'll end
up as something like subtraction, shift, multiplication - not that cheap
either. And I suspect it'll often have to first load the ProcGlobal via the
GOT as well...

Greetings,

Andres Freund




RE: [PATCH]Comment improvement in publication.sql

2021-08-02 Thread tanghy.f...@fujitsu.com
On Monday, August 2, 2021 11:56 PM vignesh C  wrote:
> 
> Few minor suggestions:
> 1) Should we change below to "fail - tables can't be added, dropped or
> set to "FOR ALL TABLES" publications"
> --- fail - can't add to for all tables publication
> +-- fail - tables can't be added to or dropped from FOR ALL TABLES 
> publications

Thanks for your kindly comments.

I'm not sure should we ignore that 'drop' should be followed by 'from', but I
think there's no misunderstandings. So, modified as you described.

Besides, in other file comments (src/test/subscription/t/100_bugs.pl) I saw FOR 
ALL TABLES without using quotes. 
So I don't think we need the quotes, too.

> 2) Should we change this to "--fail - already existing publication"
> --- fail - already added
> +-- fail - already exists

Modified.

A modified patch is attached.

Regards
Tang


v2-improvement-on-comment.patch
Description: v2-improvement-on-comment.patch


Re: row filtering for logical replication

2021-08-02 Thread Peter Smith
FYI - v20 --> v21

(Only very minimal changes)

* I noticed that the v20 TAP test (023_row_filter.pl) began failing
due to a recent commit [1], so I have rebased it to keep the cfbot
happy.

--
[1] 
https://github.com/postgres/postgres/commit/201a76183e2056c2217129e12d68c25ec9c559c8

Kind Regards,
Peter Smith.
Fujitsu Australia


v21-0001-Row-filter-for-logical-replication.patch
Description: Binary data


Remove unused 'len' from pg_stat_recv_* functions

2021-08-02 Thread Masahiko Sawada
Hi all,

While working on a patch adding new stats, houzj pointed out that
'len' function arguments of all pgstat_recv_* functions are not used
at all. Looking at git history, pgstat_recv_* functions have been
having ‘len’ since the stats collector was introduced by commit
140ddb78fe 20 years ago but it was not used at all even in the first
commit. It seems like the improvements so far for the stats collector
had pgstat_recv_* function have ‘len’ for consistency with the
existing pgstat_recv_* functions. Is there any historical reason for
having 'len' argument? Or can we remove them?

I've attached the patch that removes 'len' from all pgstat_recv_* functions.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


remove_len_from_pgstat_recv.patch
Description: Binary data


Sort keys are omitted incorrectly after grouping sets

2021-08-02 Thread Richard Guo
Hi hackers,

In each PathKey we have an EC representing the value being sorted on.
This works correctly and efficiently in scan/join planning since the
members in an EC are always equal to each other.

However, after applying GROUP BY grouping sets, the members in an EC may
not be that equivalent any more, because NULLS may be emitted by
grouping sets. As a result, we may lose some sort keys incorrectly.

# explain (costs off) select a, b from foo where a = b group by cube(a, b)
order by a, b;
  QUERY PLAN
---
 Sort
   Sort Key: a
   ->  MixedAggregate
 Hash Key: a, b
 Hash Key: a
 Hash Key: b
 Group Key: ()
 ->  Seq Scan on foo
   Filter: (a = b)
(9 rows)

I believe we should not ignore sort key 'b' in the query above.

Is this a problem we should be worried about?

Thanks
Richard


Changes to recovery_min_apply_delay are ignored while waiting for delay

2021-08-02 Thread Soumyadeep Chakraborty
Hello,

I came across this issue while I was tweaking a TAP test with Ashwin for
this thread [1].

We noticed that while the startup process waits for a recovery delay, it does
not respect changes to the recovery_min_apply_delay GUC. So:

// Standby is waiting on recoveryWakeupLatch with a large
recovery_min_apply_delay value
node_standby->safe_psql('postgres', 'ALTER SYSTEM SET
recovery_min_apply_delay TO 0;');
node_standby->reload;
// Standby is still waiting, it does not proceed until the old timeout
is elapsed.

I attached a small patch to fix this. It makes it so that
recovery_min_apply_delay is reconsulted in the loop to redetermine the wait
interval. This makes it similar to the loop in CheckpointerMain, where
CheckPointTimeout is consulted after interrupts are handled:

if (elapsed_secs >= CheckPointTimeout)
continue; /* no sleep for us ... */

I have also added a test to 005_replay_delay.pl.

Regards,
Soumyadeep(VMware)

[1] 
https://www.postgresql.org/message-id/flat/CANXE4Tc3FNvZ_xAimempJWv_RH9pCvsZH7Yq93o1VuNLjUT-mQ%40mail.gmail.com
From e1b93d4b3874eb0923e18dca9a9eccd453d6cd56 Mon Sep 17 00:00:00 2001
From: Soumyadeep Chakraborty 
Date: Mon, 2 Aug 2021 20:50:37 -0700
Subject: [PATCH v1 1/1] Refresh delayUntil in recovery delay loop

This ensures that the wait interval in the loop is correctly
recalculated with the updated value of recovery_min_apply_delay.

Now, if one changes the GUC while we are waiting, those changes will
take effect. Practical applications include instantly cancelling a long
wait time by setting recovery_min_apply_delay to 0. This can be useful
in tests.
---
 src/backend/access/transam/xlog.c   | 11 ++---
 src/test/recovery/t/005_replay_delay.pl | 32 +
 2 files changed, 36 insertions(+), 7 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f84c0bb01eb..89dc759586c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6234,12 +6234,11 @@ recoveryApplyDelay(XLogReaderState *record)
 	if (!getRecordTimestamp(record, &xtime))
 		return false;
 
-	delayUntil = TimestampTzPlusMilliseconds(xtime, recovery_min_apply_delay);
-
 	/*
 	 * Exit without arming the latch if it's already past time to apply this
 	 * record
 	 */
+	delayUntil = TimestampTzPlusMilliseconds(xtime, recovery_min_apply_delay);
 	msecs = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), delayUntil);
 	if (msecs <= 0)
 		return false;
@@ -6255,7 +6254,13 @@ recoveryApplyDelay(XLogReaderState *record)
 			break;
 
 		/*
-		 * Wait for difference between GetCurrentTimestamp() and delayUntil
+		 * Recalculate delayUntil as recovery_min_apply_delay could have changed
+		 * while we were waiting in the loop.
+		 */
+		delayUntil = TimestampTzPlusMilliseconds(xtime, recovery_min_apply_delay);
+
+		/*
+		 * Wait for difference between GetCurrentTimestamp() and delayUntil.
 		 */
 		msecs = TimestampDifferenceMilliseconds(GetCurrentTimestamp(),
 delayUntil);
diff --git a/src/test/recovery/t/005_replay_delay.pl b/src/test/recovery/t/005_replay_delay.pl
index 0b56380e0a7..74f821b3248 100644
--- a/src/test/recovery/t/005_replay_delay.pl
+++ b/src/test/recovery/t/005_replay_delay.pl
@@ -7,7 +7,7 @@ use warnings;
 
 use PostgresNode;
 use TestLib;
-use Test::More tests => 3;
+use Test::More tests => 4;
 
 # Initialize primary node
 my $node_primary = PostgresNode->new('primary');
@@ -56,6 +56,30 @@ $node_standby->poll_query_until('postgres',
 ok(time() - $primary_insert_time >= $delay,
 	"standby applies WAL only after replication delay");
 
+# Now set the delay for the next commit to some obscenely high value.
+$node_standby->safe_psql('postgres', "ALTER SYSTEM SET recovery_min_apply_delay TO '2h';");
+$node_standby->reload;
+
+$node_primary->safe_psql('postgres',
+	"INSERT INTO tab_int VALUES (generate_series(21, 30))");
+
+# We should not have replayed the LSN from the last insert on the standby yet,
+# even in spite of it being received and flushed eventually. In other words, we
+# should be stuck in recovery_min_apply_delay.
+my $last_insert_lsn =
+	$node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn();");
+$node_primary->wait_for_catchup('standby', 'flush', $last_insert_lsn);
+is( $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn() < '$last_insert_lsn'::pg_lsn;"),
+	't', 'standby stuck in recovery_min_apply_delay');
+
+# Clear the recovery_min_apply_delay timeout so that the wait is immediately
+# cancelled and replay can proceed.
+$node_standby->safe_psql('postgres', "ALTER SYSTEM SET recovery_min_apply_delay TO 0;");
+$node_standby->reload;
+
+# Now the standby should catch up.
+$node_primary->wait_for_catchup('standby');
 
 # Check that recovery can be paused or resumed expectedly.
 my $node_standby2 = PostgresNode->new('standby2');
@@ -72,7 +96,7 @@ is( $node_standby2->safe_psql(
 # Request to pause recovery and wait until it's actually paused.
 $n

Re: Failed transaction statistics to measure the logical replication progress

2021-08-02 Thread Masahiko Sawada
On Tue, Aug 3, 2021 at 11:47 AM Amit Kapila  wrote:
>
> On Mon, Aug 2, 2021 at 1:13 PM Masahiko Sawada  wrote:
> >
> > On Mon, Aug 2, 2021 at 2:52 PM osumi.takami...@fujitsu.com
> >  wrote:
> > >
> > >
> > > Accordingly, I'm thinking to have unsuccessful and successful stats on 
> > > the sub side.
> > > Sawada-san is now implementing a new view in [1].
> > > Do you think that I should write a patch to introduce a new separate view
> > > or write a patch to add more columns to the new view 
> > > "pg_stat_subscription_errors" that is added at [1] ?
> >
> > pg_stat_subscriptions_errors view I'm proposing is a view showing the
> > details of error happening during logical replication. So I think a
> > separate view or pg_stat_subscription view would be a more appropriate
> > place.
> >
>
> +1 for having these stats in pg_stat_subscription. Do we want to add
> two columns (xact_commit: number of transactions successfully applied
> in this subscription, xact_rollback: number of transactions that have
> been rolled back in this subscription)

Sounds good. We might want to have separate counters for the number of
transactions failed due to an error and transactions rolled back by
stream_abort.

pg_stat_subscription currently shows logical replication worker stats
on the shared memory. I think xact_commit and xact_rollback should
survive beyond the server restarts, so it would be better to be
collected by the stats collector. My skipping transaction patch adds a
hash table of which entry represents a subscription stats. I guess we
can use the hash table so that one subscription stats entry has both
transaction stats and errors.

>  or do you guys have something else in mind?

Osumi-san was originally concerned that there is no way to grasp the
exact number and size of successful and unsuccessful transactions. The
above idea covers only the number of successful and unsuccessful
transactions but not the size. What do you think, Osumi-san?

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: Failed transaction statistics to measure the logical replication progress

2021-08-02 Thread vignesh C
On Mon, Aug 2, 2021 at 1:13 PM Masahiko Sawada  wrote:
>
> On Mon, Aug 2, 2021 at 2:52 PM osumi.takami...@fujitsu.com
>  wrote:
> >
> > On Thursday, July 29, 2021 10:50 AM Masahiko Sawada  
> > wrote:
> > > On Thu, Jul 8, 2021 at 3:55 PM osumi.takami...@fujitsu.com
> > >  wrote:
> > > > When the current HEAD fails during logical decoding, the failure
> > > > increments txns count in pg_stat_replication_slots - [1] and adds
> > > > the transaction size to the sum of bytes in the same repeatedly on
> > > > the publisher, until the problem is solved.
> > > > One of the good examples is duplication error on the subscriber side
> > > > and this applies to both streaming and spill cases as well.
> > > >
> > > > This update prevents users from grasping the exact number and size
> > > > of successful and unsuccessful transactions. Accordingly, we need to
> > > > have new columns of failed transactions that will work to
> > > > differentiate both of them for all types, which means spill,
> > > > streaming and normal transactions. This will help users to measure
> > > > the exact status of logical replication.
> > >
> > > Could you please elaborate on use cases of the proposed statistics?
> > > For example, the current statistics on pg_replication_slots can be
> > > used for tuning logical_decoding_work_mem as well as inferring the
> > > total amount of bytes passed to the output plugin. How will the user use 
> > > those statistics?
> > >
> > > Also, if we want the stats of successful transactions why don't we
> > > show the stats of successful transactions in the view instead of ones
> > > of failed transactions?
> > It works to show the ratio of successful and unsuccessful transactions,
> > which should be helpful in terms of administrator perspective.
> > FYI, the POC patch added the columns where I prefixed 'failed' to those 
> > names.
> > But, substantially, it meant the ratio when user compared normal columns and
> > newly introduced columns by this POC in the pg_stat_replication_slots.
>
> What can the administrator use the ratio of successful and
> unsuccessful logical replication transactions for? For example, IIUC
> if a conflict happens on the subscriber as you mentioned, the
> successful transaction ratio calculated by those statistics is getting
> low, perhaps meaning the logical replication stopped. But it can be
> checked also by checking pg_stat_replication view or
> pg_replication_slots view (or error counts of
> pg_stat_subscription_errors view I’m proposing[1]). Do you have other
> use cases?

We could also include failed_data_size, this will help us to identify
the actual bandwidth consumed by the failed transaction. It will help
the DBA's to understand the network consumption in a better way.
Currently only total transaction and total data will be available but
when there is a failure, the failed transaction data will be sent
repeatedly, if the DBA does not solve the actual cause of failure,
there can be significant consumption of the network due to failure
transaction being sent repeatedly. DBA will not be able to understand
why there is so much network bandwidth consumption. If we give the
failed transaction information, the DBA might not get alarmed in this
case and understand that the network consumption is genuine. Also it
will help monitoring tools to project this value.

Regards,
Vignesh




Re: Slow standby snapshot

2021-08-02 Thread Andrey Borodin
Hi,

> 3 авг. 2021 г., в 03:01, Andres Freund  написал(а):
> 
> Hi,
> 
> On 2021-08-03 00:07:23 +0300, Michail Nikolaev wrote:
>> The main idea is simple optimistic optimization - store offset to next
>> valid entry. So, in most cases, we could just skip all the gaps.
>> Of course, it adds some additional impact for workloads without long
>> (few seconds) transactions but it is almost not detectable (because of
>> CPU caches).
> 
> I'm doubtful that that's really the right direction. For workloads that
> are replay heavy we already often can see the cost of maintaining the
> known xids datastructures show up significantly - not surprising, given
> the datastructure. And for standby workloads with active primaries the
> cost of searching through the array in all backends is noticeable as
> well.  I think this needs a bigger data structure redesign.

KnownAssignedXids implements simple membership test idea. What kind of redesign 
would you suggest? Proposed optimisation makes it close to optimal, but needs 
eventual compression.

Maybe use a hashtable of running transactions? It will be slightly faster when 
adding\removing single transactions. But much worse when doing 
KnownAssignedXidsRemove().

Maybe use a tree? (AVL\RB or something like that) It will be slightly better, 
because it does not need eventual compression like exiting array.

Best regards, Andrey Borodin.



RE: Failed transaction statistics to measure the logical replication progress

2021-08-02 Thread osumi.takami...@fujitsu.com
On Tuesday, August 3, 2021 2:29 PM  Masahiko Sawada  
wrote:
> On Tue, Aug 3, 2021 at 11:47 AM Amit Kapila 
> wrote:
> >
> > On Mon, Aug 2, 2021 at 1:13 PM Masahiko Sawada
>  wrote:
> > >
> > > On Mon, Aug 2, 2021 at 2:52 PM osumi.takami...@fujitsu.com
> > >  wrote:
> > > >
> > > >
> > > > Accordingly, I'm thinking to have unsuccessful and successful stats on 
> > > > the
> sub side.
> > > > Sawada-san is now implementing a new view in [1].
> > > > Do you think that I should write a patch to introduce a new
> > > > separate view or write a patch to add more columns to the new view
> "pg_stat_subscription_errors" that is added at [1] ?
> > >
> > > pg_stat_subscriptions_errors view I'm proposing is a view showing
> > > the details of error happening during logical replication. So I
> > > think a separate view or pg_stat_subscription view would be a more
> > > appropriate place.
> > >
> >
> > +1 for having these stats in pg_stat_subscription. Do we want to add
> > two columns (xact_commit: number of transactions successfully applied
> > in this subscription, xact_rollback: number of transactions that have
> > been rolled back in this subscription)
> 
> Sounds good. We might want to have separate counters for the number of
> transactions failed due to an error and transactions rolled back by 
> stream_abort.
Okay. I wanna make those separate as well for this feature.


> pg_stat_subscription currently shows logical replication worker stats on the
> shared memory. I think xact_commit and xact_rollback should survive beyond
> the server restarts, so it would be better to be collected by the stats 
> collector.
> My skipping transaction patch adds a hash table of which entry represents a
> subscription stats. I guess we can use the hash table so that one subscription
> stats entry has both transaction stats and errors.
> 
> >  or do you guys have something else in mind?
> 
> Osumi-san was originally concerned that there is no way to grasp the exact
> number and size of successful and unsuccessful transactions. The above idea
> covers only the number of successful and unsuccessful transactions but not the
> size. What do you think, Osumi-san?
Yeah, I think tracking sizes of failed transactions and roll-backed transactions
is helpful to identify the genuine network consumption,
as mentioned by Vignesh in another mail.
I'd like to include those also and post a patch for this.

Best Regards,
Takamichi Osumi



Re: Remove unused 'len' from pg_stat_recv_* functions

2021-08-02 Thread Kyotaro Horiguchi
At Tue, 3 Aug 2021 12:40:23 +0900, Masahiko Sawada  
wrote in 
> Hi all,
> 
> While working on a patch adding new stats, houzj pointed out that
> 'len' function arguments of all pgstat_recv_* functions are not used
> at all. Looking at git history, pgstat_recv_* functions have been
> having ‘len’ since the stats collector was introduced by commit
> 140ddb78fe 20 years ago but it was not used at all even in the first
> commit. It seems like the improvements so far for the stats collector
> had pgstat_recv_* function have ‘len’ for consistency with the
> existing pgstat_recv_* functions. Is there any historical reason for
> having 'len' argument? Or can we remove them?
> 
> I've attached the patch that removes 'len' from all pgstat_recv_* functions.

I at the first look thought that giving "len" as a parameter is
reasonable as message-processing functions, but the given message
struct contains the same value and the functions can refer to the
message length without the parameter if they want.  So I'm +-0 for the
removal.

It applies cleanly on the master and compiled without an error.

That being said, I'm not sure it is worthwhile to change parameters of
going-to-be-removed functions (if shared-memory stats collector is
successfully introduced).

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


RE: Added schema level support for publication.

2021-08-02 Thread tanghy.f...@fujitsu.com
On Monday, August 2, 2021 11:40 PM vignesh C wrote:
> 
> Thanks for the comments, attached v17 patches has the fixes for the same.

Thanks for your new patch.

I saw the following warning when compiling. It seems we don't need this 
variable any more.

publicationcmds.c: In function ‘AlterPublicationSchemas’:
publicationcmds.c:592:15: warning: unused variable ‘oldlc’ [-Wunused-variable]
   ListCell   *oldlc;
   ^

Regards
Tang


Re: Changes to recovery_min_apply_delay are ignored while waiting for delay

2021-08-02 Thread Kyotaro Horiguchi
At Mon, 2 Aug 2021 22:21:56 -0700, Soumyadeep Chakraborty 
 wrote in 
> Hello,
> 
> I came across this issue while I was tweaking a TAP test with Ashwin for
> this thread [1].
> 
> We noticed that while the startup process waits for a recovery delay, it does
> not respect changes to the recovery_min_apply_delay GUC. So:
> 
> // Standby is waiting on recoveryWakeupLatch with a large
> recovery_min_apply_delay value
> node_standby->safe_psql('postgres', 'ALTER SYSTEM SET
> recovery_min_apply_delay TO 0;');
> node_standby->reload;
> // Standby is still waiting, it does not proceed until the old timeout
> is elapsed.
> 
> I attached a small patch to fix this. It makes it so that
> recovery_min_apply_delay is reconsulted in the loop to redetermine the wait
> interval. This makes it similar to the loop in CheckpointerMain, where
> CheckPointTimeout is consulted after interrupts are handled:
> 
> if (elapsed_secs >= CheckPointTimeout)
> continue; /* no sleep for us ... */
> 
> I have also added a test to 005_replay_delay.pl.
> 
> Regards,
> Soumyadeep(VMware)
> 
> [1] 
> https://www.postgresql.org/message-id/flat/CANXE4Tc3FNvZ_xAimempJWv_RH9pCvsZH7Yq93o1VuNLjUT-mQ%40mail.gmail.com

Sounds reasonable and the patch looks fine as a whole. Applied cleanly
and works as expected.  The added test properly catches the issue.

One comment from me.

+$node_standby->safe_psql('postgres', "ALTER SYSTEM SET 
recovery_min_apply_delay TO 0;");

It might be better do "SET reco.. TO DEFAULT" instead.

And how about adding the new test at the end of existing tests. We can
get rid of the extra lines in the diff.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: when the startup process doesn't (logging startup delays)

2021-08-02 Thread Nitin Jadhav
> Thanks. Can you please have a look at what I suggested down toward the
> bottom of 
> http://postgr.es/m/ca+tgmoap2wefsktmcgwt9lxuz7y99hnduyshpk7qnfuqb98...@mail.gmail.com
?

Implemented the above approach and verified the patch. Kindly have a
look and share your thoughts.

Thanks & Regards,
Nitin Jadhav

On Fri, Jul 30, 2021 at 10:40 AM Nitin Jadhav
 wrote:
>
> > Thanks. Can you please have a look at what I suggested down toward the
> > bottom of 
> > http://postgr.es/m/ca+tgmoap2wefsktmcgwt9lxuz7y99hnduyshpk7qnfuqb98...@mail.gmail.com
> > ?
> >
> > If we're going to go the route of combining the functions, I agree
> > that a Boolean is the way to go; I think we have some existing
> > precedent for 'bool finished' rather than 'bool done'.
> >
> > But I kind of wonder if we should have an enum in the first place. It
> > feels like we've got code in a bunch of places that just exists to
> > decide which enum value to use, and then code someplace else that
> > turns around and decides which message to produce. That would be
> > sensible if we were using the same enum values in lots of places, but
> > that's not the case here. So suppose we just moved the messages to the
> > places where we're now calling LogStartupProgress() or
> > LogStartupProgressComplete()? So something like this:
>
> Sorry. I thought it is related to the discussion of deciding whether
> LogStartupProgress() and LogStartupProgressComplete() should be
> combined or not. I feel it's a really nice design. With this we avoid
> a "action at a distance" issue and its easy to use. If we are
> reporting the same kind of msgs at multiple places then the current
> approach of using enum will be more suitable since we don't have to
> worry about matching the log msg string. But in the current scenario,
> we are not using the same kind of msgs at multiple places (I feel such
> scenario will not occur in future also. Even if there is similar
> operation, it can be distinguished like resetting unlogged relations
> is distinguished by init and clean. Kindly mention if you can oversee
> any such scenario), hence the approach you are suggesting will be a
> best suit.
>
> Thanks & Regards,
> Nitin Jadhav
>
> On Thu, Jul 29, 2021 at 9:48 PM Robert Haas  wrote:
> >
> > On Thu, Jul 29, 2021 at 4:56 AM Nitin Jadhav
> >  wrote:
> > > Thanks for sharing the information. I have done the necessary changes
> > > to show the logs during the latter case (postgres --single) and
> > > verified the log messages.
> >
> > Thanks. Can you please have a look at what I suggested down toward the
> > bottom of 
> > http://postgr.es/m/ca+tgmoap2wefsktmcgwt9lxuz7y99hnduyshpk7qnfuqb98...@mail.gmail.com
> > ?
> >
> > --
> > Robert Haas
> > EDB: http://www.enterprisedb.com


v9-0001-startup-process-progress.patch
Description: Binary data


Re: Lowering the ever-growing heap->pd_lower

2021-08-02 Thread Simon Riggs
On Tue, 18 May 2021 at 20:33, Peter Geoghegan  wrote:
>
> On Tue, May 18, 2021 at 12:29 PM Matthias van de Meent
>  wrote:
> > PFA the updated version of this patch. Apart from adding line pointer
> > truncation in PageRepairFragmentation (as in the earlier patches), I
> > also altered PageTruncateLinePointerArray to clean up all trailing
> > line pointers, even if it was the last item on the page.
>
> Can you show a practical benefit to this patch, such as an improvement
> in throughout or in efficiency for a given workload?
>
> It was easy to see that having something was better than having
> nothing at all. But things are of course different now that we have
> PageTruncateLinePointerArray().

There does seem to be utility in Matthias' patch, which currently does
two things:
1. Allow same thing as PageTruncateLinePointerArray() during HOT cleanup
That is going to have a clear benefit for HOT workloads, which by
their nature will use a lot of line pointers.
Many applications are updated much more frequently than they are vacuumed.
Peter - what is your concern about doing this more frequently? Why
would we *not* do this?

2. Reduce number of line pointers to 0 in some cases.
Matthias - I don't think you've made a full case for doing this, nor
looked at the implications.
The comment clearly says "it seems like a good idea to avoid leaving a
PageIsEmpty()" page behind.
So I would be inclined to remove that from the patch and consider that
as a separate issue, or close this.

-- 
Simon Riggshttp://www.EnterpriseDB.com/