date:20230407

Re: Using each rel as both outer and inner for JOIN_ANTI

2023-04-07 Thread Richard Guo

On Tue, Aug 2, 2022 at 3:13 PM Richard Guo  wrote:

> On Sun, Jul 31, 2022 at 12:07 AM Tom Lane  wrote:
>
>> [ wanders away wondering if JOIN_RIGHT_SEMI should become a thing ... ]
>
> Maybe this is something we can do. Currently for the query below:
>
> # explain select * from foo where a in (select c from bar);
>QUERY PLAN
> -
>  Hash Semi Join  (cost=154156.00..173691.29 rows=10 width=8)
>Hash Cond: (foo.a = bar.c)
>->  Seq Scan on foo  (cost=0.00..1.10 rows=10 width=8)
>->  Hash  (cost=72124.00..72124.00 rows=500 width=4)
>  ->  Seq Scan on bar  (cost=0.00..72124.00 rows=500 width=4)
> (5 rows)
>
> I believe we can get a cheaper plan if we are able to swap the outer and
> inner for SEMI JOIN and use the smaller 'foo' as inner rel.
>

I'm thinking about the JOIN_RIGHT_SEMI thing and it seems that it can be
implemented for HashJoin with very short change.  What we want to do is
to just have the first match for each inner tuple.  So after scanning
the hash bucket for matches, we just need to check whether the inner
tuple has been set match and skip it if so, something like

  {
  if (!ExecScanHashBucket(node, econtext))
  {
  /* out of matches; check for possible outer-join fill */
  node->hj_JoinState = HJ_FILL_OUTER_TUPLE;
  continue;
  }
  }

+ /*
+  * In a right-semijoin, we only need the first match for each
+  * inner tuple.
+  */
+ if (node->js.jointype == JOIN_RIGHT_SEMI &&
+ HeapTupleHeaderHasMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple)))
+ continue;
+

I have a simple implementation locally and tried it with the query below
and saw a speedup of 2055.617 ms VS. 1156.772 ms (both best of 3).

# explain (costs off, analyze)
select * from foo where a in (select c from bar);
  QUERY PLAN
---
 Hash Semi Join (actual time=1957.748..2055.058 rows=10 loops=1)
   Hash Cond: (foo.a = bar.c)
   ->  Seq Scan on foo (actual time=0.026..0.029 rows=10 loops=1)
   ->  Hash (actual time=1938.818..1938.819 rows=500 loops=1)
 Buckets: 262144  Batches: 64  Memory Usage: 4802kB
 ->  Seq Scan on bar (actual time=0.016..853.010 rows=500
loops=1)
 Planning Time: 0.327 ms
 Execution Time: 2055.617 ms
(8 rows)

# explain (costs off, analyze)
select * from foo where a in (select c from bar);
   QUERY PLAN
-
 Hash Right Semi Join (actual time=11.525..1156.713 rows=10 loops=1)
   Hash Cond: (bar.c = foo.a)
   ->  Seq Scan on bar (actual time=0.034..523.036 rows=500 loops=1)
   ->  Hash (actual time=0.027..0.029 rows=10 loops=1)
 Buckets: 1024  Batches: 1  Memory Usage: 9kB
 ->  Seq Scan on foo (actual time=0.009..0.014 rows=10 loops=1)
 Planning Time: 0.312 ms
 Execution Time: 1156.772 ms
(8 rows)

It may not be easy for MergeJoin and NestLoop though, as we do not have
a way to know if an inner tuple has been already matched or not.  But
the benefit of swapping inputs for MergeJoin and NestLoop seems to be
small, so I think it's OK to ignore them.

So is it worthwhile to make JOIN_RIGHT_SEMI come true?

Thanks
Richard

Re: CREATE SUBSCRIPTION -- add missing tab-completes

2023-04-07 Thread Masahiko Sawada

On Fri, Apr 7, 2023 at 2:28 PM Amit Kapila  wrote:
>
> On Wed, Apr 5, 2023 at 5:58 AM Peter Smith  wrote:
> >
> > There are some recent comment that added new options for CREATE SUBSCRIPTION
> >
> ...
> > PSA patches to add those tab completions.
> >
>
> LGTM, so pushed. BTW, while looking at this, I noticed that newly
> added options "password_required" and "run_as_owner" has incorrectly
> mentioned their datatype as a string in the docs. It should be
> boolean.

+1

> I think "password_required" belongs to first section of docs
> which says: "The following parameters control what happens during
> subscription creation".

But the documentation of ALTER SUBSCRIPTION says:

The parameters that can be altered are slot_name, synchronous_commit,
binary, streaming, disable_on_error, password_required, run_as_owner,
and origin. Only a superuser can set password_required = false.

ISTM that both password_required and run_as_owner are parameters to
control the subscription's behavior, like disable_on_error and
streaming. So it looks good to me that password_required belongs to
the second section.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Minimal logical decoding on standbys

2023-04-07 Thread Andres Freund

Hi,

On 2023-04-07 08:09:50 +0200, Drouvot, Bertrand wrote:
> Hi,
>
> On 4/7/23 7:56 AM, Andres Freund wrote:
> > Hi,
> >
> > On 2023-04-07 07:02:04 +0200, Drouvot, Bertrand wrote:
> > > Done in V63 attached and did change the associated comment a bit.
> >
> > Can you send your changes incrementally, relative to V62? I'm polishing them
> > right now, and that'd make it a lot easier to apply your changes ontop.
> >
>
> Sure, please find them enclosed.

Thanks.


Here's my current working state - I'll go to bed soon.

Changes:

- shared catalog relations weren't handled correctly, because the dboid is
  InvalidOid for them. I wrote a test for that as well.

- ReplicationSlotsComputeRequiredXmin() took invalidated logical slots into
  account (ReplicationSlotsComputeLogicalRestartLSN() too, but it never looks
  at logical slots)

- I don't think the subset of slot xids that were checked when invalidating
  was right. We need to check effective_xmin and effective_catalog_xmin - the
  latter was using catalog_xmin.

- similarly, it wasn't right that specifically those two fields were
  overwritten when invalidated - as that was done, I suspect the changes might
  get lost on a restart...

- As mentioned previously, I did not like all the functions in slot.h, nor
  their naming. Not yet quite finished with that, but a good bit further

- There were a lot of unrelated changes, e.g. removing comments like
 * NB - this runs as part of checkpoint, so avoid raising errors if possible.

- I still don't like the order of the patches, fixing the walsender patches
  after introducing support for logical decoding on standby. Reordered.

- I don't think logical slots being invalidated as checked e.g. in
  pg_logical_replication_slot_advance()

- I didn't like much that InvalidatePossiblyObsoleteSlot() switched between
  kill() and SendProcSignal() based on the "conflict". There very well could
  be reasons to use InvalidatePossiblyObsoleteSlot() with an xid from outside
  of the startup process in the future. Instead I made it differentiate based
  on MyBackendType == B_STARTUP.


I also:

Added new patch that replaces invalidated_at with a new enum, 'invalidated',
listing the reason for the invalidation. I added a check for !invalidated to
ReplicationSlotsComputeRequiredLSN() etc.

Added new patch moving checks for invalid logical slots into
CreateDecodingContext(). Otherwise we end up with 5 or so checks, which makes
no sense. As far as I can tell the old message in
pg_logical_slot_get_changes_guts() was bogus, one couldn't get there having
"never previously reserved WAL"

Split "Handle logical slot conflicts on standby." into two. I'm not sure that
should stay that way, but it made it easier to hack on
InvalidateObsoleteReplicationSlots.


Todo:
- write a test that invalidated logical slots stay invalidated across a restart
- write a test that invalidated logical slots do not lead to retaining WAL
- Further evolve the API of InvalidateObsoleteReplicationSlots()
  - pass in the ReplicationSlotInvalidationCause we're trying to conflict on?
  - rename xid to snapshotConflictHorizon, that'd be more in line with the
ResolveRecoveryConflictWithSnapshot and easier to understand, I think

- The test could stand a bit of cleanup and consolidation
  - No need to start 4 psql processes to do 4 updates, just do it in one
safe_psql()
  - the sequence of drop_logical_slots(), create_logical_slots(),
change_hot_standby_feedback_and_wait_for_xmins(), make_slot_active() is
repeated quite a few times
  - the stats queries checking for specific conflict counts, including
preceding tests, is pretty painful. I suggest to reset the stats at the
end of the test instead (likely also do the drop_logical_slot() there).
  - it's hard to correlate postgres log and the tap test, because the slots
are named the same across all tests. Perhaps they could have a per-test
prefix?
  - numbering tests is a PITA, I had to renumber the later ones, when adding a
test for shared catalog tables


My attached version does include your v62-63 incremental chnages.

Greetings,

Andres Freund
>From 1e5461e0019678a92192b0dd5d9bf3f7105f504d Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Thu, 6 Apr 2023 20:00:07 -0700
Subject: [PATCH va65 1/9] replication slots: replace invalidated_at LSN with
 an enum

---
 src/include/replication/slot.h  | 15 +--
 src/backend/replication/slot.c  | 21 ++---
 src/backend/replication/slotfuncs.c |  8 +++-
 3 files changed, 34 insertions(+), 10 deletions(-)

diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 8872c80cdfe..793f0701b88 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -37,6 +37,17 @@ typedef enum ReplicationSlotPersistency
 	RS_TEMPORARY
 } ReplicationSlotPersistency;
 
+/*
+ * Slots can be invalidated, e.g. due to max_slot_wal_keep_size. If so, the
+ * 'invalidated' field is se

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2023-04-07 Thread Denis Laxalde


The patch set does not apply any more.

I tried to rebase locally; even leaving out 1 ("libpq: Run pgindent 
after a9e9a9f32b3"), patch 4 ("Start using new libpq cancel APIs") is 
harder to resolve following 983ec23007b (I suppose).


Appart from that, the implementation in v19 sounds good to me, and seems 
worthwhile. FWIW, as said before, I also implemented it in Psycopg in a 
sort of an end-to-end validation.

Re: refactoring relation extension and BufferAlloc(), faster COPY

2023-04-07 Thread Andres Freund

Hi,

On 2023-04-06 18:15:14 -0700, Andres Freund wrote:
> I think it might be worth having a C test for some of the bufmgr.c API. Things
> like testing that retrying a failed relation extension works the second time
> round.

A few hours after this I hit a stupid copy-pasto (21d7c05a5cf) that would
hopefully have been uncovered by such a test...

I guess we could even test this specific instance without a more complicated
framework.  Create table with some data, rename the file, checkpoint - should
fail, rename back, checkpoint - should succeed.

It's much harder to exercise the error paths inside the backend extending the
relation unfortunately, because we require the file to be opened rw before
doing much. And once the FD is open, removing the permissions doesn't help.
The least complicated approach I scan see is creating directory qutoas, but
that's quite file system specific...

Greetings,

Andres Freund

RE: Fix the description of GUC "max_locks_per_transaction" and "max_pred_locks_per_transaction" in guc_table.c

2023-04-07 Thread wangw.f...@fujitsu.com

On Tues, Apr 4, 2023 at 23:48 PM Tom Lane  wrote:
> Nathan Bossart  writes:
> > On Wed, Feb 22, 2023 at 12:40:07PM +, wangw.f...@fujitsu.com wrote:
> >> After some rethinking, I think users can easily get exact value according 
> >> to
> >> exact formula, and I think using accurate formula can help users adjust
> >> max_locks_per_transaction or max_predicate_locks_per_transaction if
> needed. So,
> >> I used the exact formulas in the attached v2 patch.
> 
> > IMHO this is too verbose.
> 
> Yeah, it's impossibly verbose.  Even the current wording does not fit
> nicely in pg_settings output.
> 
> > Perhaps it could be simplified to something like
> > The shared lock table is sized on the assumption that at most
> > max_locks_per_transaction objects per eligible process or prepared
> > transaction will need to be locked at any one time.
> 
> I like the "per eligible process" wording, at least for guc_tables.c;
> or maybe it could be "per server process"?  That would be more
> accurate and not much longer than what we have now.
> 
> I've got mixed emotions about trying to put the exact formulas into
> the SGML docs either.  Space isn't such a constraint there, but I
> think the info would soon go out of date (indeed, I think the existing
> wording was once exactly accurate), and I'm not sure it's worth trying
> to maintain it precisely.

Thanks both for sharing your opinions.
I agree that verbose descriptions make maintenance difficult.
For consistency, I unified the formulas in guc_tables.c and pg-doc into the same
suggested short formula. Attach the new patch.

> One reason that I'm not very excited about this is that in fact the
> formula seen in the source code is not exact either; it's a lower
> bound for how much space will be available.  That's because we throw
> in 100K slop at the bottom of the shmem sizing calculation, and a
> large chunk of that remains available to be eaten by the lock table
> if necessary.

Thanks for sharing this.
Since no one has reported related issues, I'm also fine to close this entry if
this related modification is not necessary.

Regards,
Wang Wei


v3-0001-Fix-the-description-of-shared-lock-table-size-and.patch
Description:  v3-0001-Fix-the-description-of-shared-lock-table-size-and.patch

Re: CREATE SUBSCRIPTION -- add missing tab-completes

2023-04-07 Thread Amit Kapila

On Fri, Apr 7, 2023 at 1:12 PM Masahiko Sawada  wrote:
>
> On Fri, Apr 7, 2023 at 2:28 PM Amit Kapila  wrote:
> >
> > On Wed, Apr 5, 2023 at 5:58 AM Peter Smith  wrote:
> > >
> >
> > LGTM, so pushed. BTW, while looking at this, I noticed that newly
> > added options "password_required" and "run_as_owner" has incorrectly
> > mentioned their datatype as a string in the docs. It should be
> > boolean.
>
> +1
>
> > I think "password_required" belongs to first section of docs
> > which says: "The following parameters control what happens during
> > subscription creation".
>
> But the documentation of ALTER SUBSCRIPTION says:
>
> The parameters that can be altered are slot_name, synchronous_commit,
> binary, streaming, disable_on_error, password_required, run_as_owner,
> and origin. Only a superuser can set password_required = false.
>

By the above, do you intend to say that all the parameters that can be
altered are in the second list? If so, slot_name belongs to the first
category.

> ISTM that both password_required and run_as_owner are parameters to
> control the subscription's behavior, like disable_on_error and
> streaming. So it looks good to me that password_required belongs to
> the second section.
>

Do you mean that because 'password_required' is used each time we make
a connection to a publisher during replication, it should be in the
second category? If so, slot_name is also used during the start
replication each time.

BTW, do we need to check one or both of these parameters in
maybe_reread_subscription() where we "Exit if any parameter that
affects the remote connection was changed."

-- 
With Regards,
Amit Kapila.

RE: Partial aggregates pushdown

2023-04-07 Thread fujii.y...@df.mitsubishielectric.co.jp

Hi Mr.Momjian

> First, my apologies for not addressing this sooner.  I was so focused on my
> own tasks that I didn't realize this very important patch was not getting
> attention.  I will try my best to get it into PG 17.
Thank you very much for your comments. 
I will improve this patch for PG17.
I believe that this patch will help us use PostgreSQL's built-in sharding for 
OLAP.

> What amazes me is that you didn't need to create _any_ actual aggregate
> functions.  Rather, you just needed to hook existing functions into the
> aggregate tables for partial FDW execution.
Yes. This patch enables partial aggregate pushdown using 
only existing functions which belong to existing aggregate function
and are needed by parallel query(such as state transition function and 
serialization function).
This patch does not need new types of function belonging to aggregate functions
and does not need new functions belonging to existing aggregate functions.

> I suggest we remove the version check requirement --- instead just document
> that the FDW Postgres version should be the same or newer than the calling
> Postgres server --- that way, we can assume that whatever is in the system
> catalogs of the caller is in the receiving side.  
Thanks for the comment. I will modify this patch according to your comment.

> We should add a GUC to turn off
> this optimization for cases where the FDW Postgres version is older than the
> caller.  This handles case 1-2.
Thanks for the advice here too.
I thought it would be more appropriate to add a foregin server option of 
postgres_fdw rather than adding GUC. 
Would you mind if I ask you what you think about it?

> > 2. Automation of creating definition of partialaggfuncs In development
> > of v17, I manually create definition of partialaggfuncs for avg, min, max, 
> > sum,
> count.
> > I am concerned that this may be undesirable.
> > So I am thinking that v17 should be modified to automate creating
> > definition of partialaggfuncs for all built-in aggregate functions.
> 
> Are there any other builtin functions that need this?  I think we can just
> provide documention for extensions on how to do this.
For practical purposes, it is sufficient 
if partial aggregate for the above functions can be pushed down.
I think you are right, it would be sufficient to document how to achieve
 partial aggregate pushdown for other built-in functions.

> > 3. Documentation
> > I need add explanation of partialaggfunc to documents on postgres_fdw and
> other places.
> 
> I can help with that once we decide on the above.
Thank you. In the next verion of this patch, I will add documents on 
postgres_fdw
and other places. 

> I think 'partialaggfn' should be named 'aggpartialfn' to match other columns 
> in
> pg_aggregate.
Thanks for the comment. I will modify this patch according to your comment.

> For case 3, I don't even know how much pushdown those do of _any_
> aggregates to non-PG servers, let along parallel FDW ones.  Does anyone
> know the details?
To allow partial aggregate pushdown for non-PG FDWs,
I think we need to add pushdown logic to their FDWs for each function.
For example, we need to add logic avg() -> sum()/count() to their FDWs for avg.
To allow parallel partial aggregate by non-PG FDWs,
I think we need to add FDW Routines for Asynchronous Execution to their FDWs[1].

> I am confused by these changes to pg_aggegate:
> 
> +{ aggfnoid => 'sum_p_int8', aggtransfn => 'int8_avg_accum',
> +  aggfinalfn => 'int8_avg_serialize', aggcombinefn =>
> +'int8_avg_combine',
> +  aggserialfn => 'int8_avg_serialize', aggdeserialfn =>
> +'int8_avg_deserialize',
> +  aggtranstype => 'internal', aggtransspace => '48' },
> 
> ...
> 
> +{ aggfnoid => 'sum_p_numeric', aggtransfn => 'numeric_avg_accum',
> +  aggfinalfn => 'numeric_avg_serialize', aggcombinefn =>
> +'numeric_avg_combine',
> +  aggserialfn => 'numeric_avg_serialize',
> +  aggdeserialfn => 'numeric_avg_deserialize',
> +  aggtranstype => 'internal', aggtransspace => '128' },
> 
> Why are these marked as 'sum' but use 'avg' functions?
This reason is that sum(int8)/sum(numeric) shares some functions with 
avg(int8)/avg(numeric),
and sum_p_int8 is aggpartialfn of sum(int8) and sum_p_numeric is aggpartialfn 
of sum(numeric).

--Part of avg(int8) in BKI file in PostgreSQL15.0[2].
{ aggfnoid => 'avg(int8)', aggtransfn => 'int8_avg_accum',
  aggfinalfn => 'numeric_poly_avg', aggcombinefn => 'int8_avg_combine',
  aggserialfn => 'int8_avg_serialize', aggdeserialfn => 'int8_avg_deserialize',
  aggmtransfn => 'int8_avg_accum', aggminvtransfn => 'int8_avg_accum_inv',
  aggmfinalfn => 'numeric_poly_avg', aggtranstype => 'internal',
  aggtransspace => '48', aggmtranstype => 'internal', aggmtransspace => '48' },
--

--Part of sum(int8) in BKI file in PostgreSQL15.0[2].
{ aggfnoid => 'sum(int8)', aggtransfn => 'int8_avg_accum',
  aggfinalfn => 'numeric_poly_sum', aggcombinefn => 'int8_avg_combine',
  aggserialfn => 'int8_avg_serialize', aggdeserialfn => 'i

Re: Initial Schema Sync for Logical Replication

2023-04-07 Thread Amit Kapila

On Thu, Apr 6, 2023 at 6:57 PM Masahiko Sawada  wrote:
>
> On Thu, Mar 30, 2023 at 10:11 PM Masahiko Sawada  
> wrote:
> >
> > On Thu, Mar 30, 2023 at 12:18 AM Masahiko Sawada  
> > wrote:
> > >
> > >
> > > How can we postpone creating the pg_subscription_rel entries until the
> > > tablesync worker starts and does the schema sync? I think that since
> > > pg_subscription_rel entry needs the table OID, we need either to do
> > > the schema sync before creating the entry (i.e, during CREATE
> > > SUBSCRIPTION) or to postpone creating entries as Amit proposed[1]. The
> > > apply worker needs the information of tables to sync in order to
> > > launch the tablesync workers, but it needs to create the table schema
> > > to get that information.
> >
> > For the above reason, I think that step 6 of the initial proposal won't 
> > work.
> >
> > If we can have the tablesync worker create an entry of
> > pg_subscription_rel after creating the table, it may give us the
> > flexibility to perform the initial sync. One idea is that we add a
> > relname field to pg_subscription_rel so that we can create entries
> > with relname instead of OID if the table is not created yet. Once the
> > table is created, we clear the relname field and set the OID of the
> > table instead. It's not an ideal solution but we might make it simpler
> > later.
>
> While writing a PoC patch, I found some difficulties in this idea.
> First, I tried to add schemaname+relname to pg_subscription_rel but I
> could not define the primary key of pg_subscription_rel. The primary
> key on (srsubid, srrelid) doesn't work since srrelid could be NULL.
> Similarly, the primary key on (srsubid, srrelid, schemaname, relname)
> also doesn't work.
>

Can we think of having a separate catalog table say
pg_subscription_remote_rel for this? You can have srsubid,
remote_schema_name, remote_rel_name, etc. We may need some other state
to be maintained during the initial schema sync where this table can
be used. Basically, this can be used to maintain the state till the
initial schema sync is complete because we can create a relation entry
in pg_subscritption_rel only after the initial schema sync is
complete.

> So I tried another idea: that we generate a new OID
> for srrelid and the tablesync worker will replace it with the new
> table's OID once it creates the table. However, since we use srrelid
> in replication slot names, changing srrelid during the initial
> schema+data sync is not straightforward (please note that the slot is
> created by the tablesync worker but is removed by the apply worker).
> Using relname in slot name instead of srrelid is not a good idea since
> it requires all pg_subscription_rel entries have relname, and slot
> names could be duplicated, for example, when the relname is very long
> and we cut it.
>
> I'm trying to consider the idea from another angle: the apply worker
> fetches the table list and passes the relname to the tablesync worker.
> But a problem of this approach is that the table list is not
> persisted. If the apply worker restarts during the initial table sync,
> it could not get the same list as before.
>

Agreed, this has some drawbacks. We can try to explore this if the
above idea of the new catalog table doesn't solve this problem.


-- 
With Regards,
Amit Kapila.

RE: [PoC] pg_upgrade: allow to upgrade publisher node

2023-04-07 Thread Hayato Kuroda (Fujitsu)

Dear Julien,

Thank you for giving comments!

> As I mentioned in my original thread, I'm not very familiar with that code, 
> but
> I'm a bit worried about "all the changes generated on publisher must be send
> and applied".  Is that a hard requirement for the feature to work reliably?

I think the requirement is needed because the existing WALs on old node cannot 
be
transported on new instance. The WAL hole from confirmed_flush to current 
position
could not be filled by newer instance.

> If
> yes, how does this work if some subscriber node isn't connected when the
> publisher node is stopped?  I guess you could add a check in pg_upgrade to 
> make
> sure that all logical slot are indeed caught up and fail if that's not the 
> case
> rather than assuming that a clean shutdown implies it.  It would be good to
> cover that in the TAP test, and also cover some corner cases, like any new row
> added on the publisher node after the pg_upgrade but before the subscriber is
> reconnected is also replicated as expected.

Hmm, good point. Current patch could not be handled the case because walsenders
for the such slots do not exist. I have tested your approach, however, I found 
that
CHECKPOINT_SHUTDOWN record were generated twice when publisher was
shutted down and started. It led that the confirmed_lsn of slots always was 
behind
from WAL insert location and failed to upgrade every time.
Now I do not have good idea to solve it... Do anyone have for this?

> Agreed, but then shouldn't the option be named "--logical-slots-only" or
> something like that, same for all internal function names?

Seems right. Will be fixed in next version. Maybe 
"--logical-replication-slots-only"
will be used, per Peter's suggestion [1].

[1]: 
https://www.postgresql.org/message-id/CAHut%2BPvpBsyxj9SrB1ZZ9gP7r1AA5QoTYjpzMcVSjQO2xQy7aw%40mail.gmail.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-04-07 Thread John Naylor

On Thu, Feb 16, 2023 at 11:44 PM Andres Freund wrote:
>
> We really ought to replace the tid bitmap used for bitmap heap scans. The
> hashtable we use is a pretty awful data structure for it. And that's not
> filled in-order, for example.

I spent some time studying tidbitmap.c, and not only does it make sense to
use a radix tree there, but since it has more complex behavior and stricter
runtime requirements, it should really be the thing driving the design and
tradeoffs, not vacuum:

- With lazy expansion and single-value leaves, the root of a radix tree can
point to a single leaf. That might get rid of the need to track TBMStatus,
since setting a single-leaf tree should be cheap.

- Fixed-size PagetableEntry's are pretty large, but the tid compression
scheme used in this thread (in addition to being complex) is not a great
fit for tidbitmap because it makes it more difficult to track per-block
metadata (see also next point). With the "combined pointer-value slots"
technique, if a page's max tid offset is 63 or less, the offsets can be
stored directly in the pointer for the exact case. The lowest bit can tag
to indicate a pointer to a single-value leaf. That would complicate
operations like union/intersection and tracking "needs recheck", but it
would reduce memory use and node-traversal in common cases.

- Managing lossy storage. With pure blocknumber keys, replacing exact
storage for a range of 256 pages amounts to replacing a last-level node
with a single leaf containing one lossy PagetableEntry. The leader could
iterate over the nodes, and rank the last-level nodes by how much storage
they (possibly with leaf children) are using, and come up with an optimal
lossy-conversion plan.

The above would address the points (not including better iteration and
parallel bitmap index scans) raised in

https://www.postgresql.org/message-id/capsanrn5ywsows8ghqwbwajx1selxlntv54biq0z-j_e86f...@mail.gmail.com

Ironically, by targeting a more difficult use case, it's easier since there
is less freedom. There are many ways to beat a binary search, but fewer
good ways to improve bitmap heap scan. I'd like to put aside vacuum for
some time and try killing two birds with one stone, building upon our work
thus far.

Note: I've moved the CF entry to the next CF, and set to waiting on
author for now. Since no action is currently required from Masahiko, I've
added myself as author as well. If tackling bitmap heap scan shows promise,
we could RWF and resurrect at a later time.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: Add index scan progress to pg_stat_progress_vacuum

2023-04-07 Thread Michael Paquier

On Thu, Apr 06, 2023 at 03:14:20PM +, Imseih (AWS), Sami wrote:
>> Could it be worth thinking about a different design where
>> the value incremented and the parameters of
>> pgstat_progress_update_param() are passed through the 'P' message
>> instead?
> 
> I am not sure how this is different than the approach suggested.
> In the current design, the 'P' message is used to pass the
> ParallelvacuumState to parallel_vacuum_update_progress which then
> calls pgstat_progress_update_param.

The arguments of pgstat_progress_update_param() would be given by the
worker directly as components of the 'P' message.  It seems to me that
this approach would have the simplicity to not require the setup of a
shmem area for the extra counters, and there would be no need for a
callback.  Hence, the only thing the code paths of workers would need
to do is to call this routine, then the leaders would increment their
progress when they see a CFI to process the 'P' message.  Also, I
guess that we would only need an interface in backend_progress.c to
increment counters, like pgstat_progress_incr_param(), but usable by
workers.  Like a pgstat_progress_worker_incr_param()?
--
Michael


signature.asc
Description: PGP signature

RE: CREATE SUBSCRIPTION -- add missing tab-completes

2023-04-07 Thread Zhijie Hou (Fujitsu)

On Friday, April 7, 2023 5:11 PM Amit Kapila  wrote:
> 
> On Fri, Apr 7, 2023 at 1:12 PM Masahiko Sawada 
> wrote:
> >
> > On Fri, Apr 7, 2023 at 2:28 PM Amit Kapila 
> wrote:
> > >
> > > On Wed, Apr 5, 2023 at 5:58 AM Peter Smith 
> wrote:
> > > >
> > >
> > > LGTM, so pushed. BTW, while looking at this, I noticed that newly
> > > added options "password_required" and "run_as_owner" has incorrectly
> > > mentioned their datatype as a string in the docs. It should be
> > > boolean.
> >
> > +1
> >
> > > I think "password_required" belongs to first section of docs which
> > > says: "The following parameters control what happens during
> > > subscription creation".
> >
> > But the documentation of ALTER SUBSCRIPTION says:
> >
> > The parameters that can be altered are slot_name, synchronous_commit,
> > binary, streaming, disable_on_error, password_required, run_as_owner,
> > and origin. Only a superuser can set password_required = false.
> >
> 
> By the above, do you intend to say that all the parameters that can be altered
> are in the second list? If so, slot_name belongs to the first category.
> 
> > ISTM that both password_required and run_as_owner are parameters to
> > control the subscription's behavior, like disable_on_error and
> > streaming. So it looks good to me that password_required belongs to
> > the second section.
> >
> 
> Do you mean that because 'password_required' is used each time we make a
> connection to a publisher during replication, it should be in the second
> category? If so, slot_name is also used during the start replication each 
> time.
> 
> BTW, do we need to check one or both of these parameters in
> maybe_reread_subscription() where we "Exit if any parameter that affects the
> remote connection was changed."

I think changing run_as_owner doesn't require to be checked as it only affect
the role to perform the apply. But it seems password_required need to be
checked in maybe_reread_subscription() because we used this parameter for
connection.

Best Regards,
Hou zj

Re: Should vacuum process config file reload more often

2023-04-07 Thread Daniel Gustafsson

> On 7 Apr 2023, at 08:52, Masahiko Sawada  wrote:
> On Fri, Apr 7, 2023 at 8:08 AM Daniel Gustafsson  wrote:

>> I had another read-through and test-through of this version, and have applied
>> it with some minor changes to comments and whitespace.  Thanks for the quick
>> turnaround times on reviews in this thread!
> 
> Cool!
> 
> Regarding the commit 7d71d3dd08, I have one comment:
> 
> +   /* Only log updates to cost-related variables */
> +   if (vacuum_cost_delay == original_cost_delay &&
> +   vacuum_cost_limit == original_cost_limit)
> +   return;
> 
> IIUC by default, we log not only before starting the vacuum but also
> when changing cost-related variables. Which is good, I think, because
> logging the initial values would also be helpful for investigation.
> However, I think that we don't log the initial vacuum cost values
> depending on the values. For example, if the
> autovacuum_vacuum_cost_delay storage option is set to 0, we don't log
> the initial values. I think that instead of comparing old and new
> values, we can write the log only if
> message_level_is_interesting(DEBUG2) is true. That way, we don't need
> to acquire the lwlock unnecessarily. And the code looks cleaner to me.
> I've attached the patch (use_message_level_is_interesting.patch)

That's a good idea, unless Melanie has conflicting opinions I think we should
go ahead with this.  Avoiding taking a lock here is a good save.

> Also, while testing the autovacuum delay with relopt
> autovacuum_vacuum_cost_delay = 0, I realized that even if we set
> autovacuum_vacuum_cost_delay = 0 to a table, wi_dobalance is set to
> true. wi_dobalance comes from the following expression:
> 
>/*
> * If any of the cost delay parameters has been set individually for
> * this table, disable the balancing algorithm.
> */
>tab->at_dobalance =
>!(avopts && (avopts->vacuum_cost_limit > 0 ||
> avopts->vacuum_cost_delay > 0));
> 
> The initial values of both avopts->vacuum_cost_limit and
> avopts->vacuum_cost_delay are -1. I think we should use ">= 0" instead
> of "> 0". Otherwise, we include the autovacuum worker working on a
> table whose autovacuum_vacuum_cost_delay is 0 to the balancing
> algorithm. Probably this behavior has existed also on back branches
> but I haven't checked it yet.

Interesting, good find.  Looking quickly at the back branches I think there is
a variant of this for vacuum_cost_limit even there but needs more investigation.

--
Daniel Gustafsson

Re: [PATCH] Allow Postgres to pick an unused port to listen

2023-04-07 Thread Andrew Dunstan



On 2023-03-29 We 07:55, Tom Lane wrote:

Yurii Rashkovskii  writes:

I would like to suggest a patch against master (although it may be worth
backporting it) that makes it possible to listen on any unused port.

I think this is a bad idea, mainly because this:


Instead, with this patch, one can specify `port` as `0` (the "wildcard"
port) and retrieve the assigned port from postmaster.pid

is a horrid way to find out what was picked, and yet there could
be no other.

Our existing design for this sort of thing is to let the testing
framework choose the port, and I don't really see what's wrong
with that approach.  Yes, I know it's theoretically subject to
race conditions, but that hasn't seemed to be a problem in
practice.  It's especially not a problem given that modern
testing practice tends to not open any TCP port at all, just
a Unix socket in a test-private directory, so that port
conflicts are a non-issue.



For TAP tests we have pretty much resolved the port collisions issue for 
TCP ports too. See commit 9b4eafcaf4


Perhaps the OP could adapt that logic to his use case.


cheers


andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

Re: Direct I/O

2023-04-07 Thread Thomas Munro

On Wed, Jan 25, 2023 at 8:57 PM Bharath Rupireddy
 wrote:
> Thanks. I have some comments on
> v3-0002-Add-io_direct-setting-developer-only.patch:
>
> 1. I think we don't need to overwrite the io_direct_string in
> check_io_direct so that show_io_direct can be avoided.

Thanks for looking at this, and sorry for the late response.  Yeah, agreed.

> 2. check_io_direct can leak the flags memory - when io_direct is not
> supported or for an invalid list syntax or an invalid option is
> specified.
>
> I have addressed my review comments as a delta patch on top of v3-0002
> and added it here as v1-0001-Review-comments-io_direct-GUC.txt.

Thanks.  Your way is nicer.  I merged your patch and added you as a co-author.

> Some comments on the tests added:
>
> 1. Is there a way to know if Direct IO for WAL and data has been
> picked up programmatically? IOW, can we know if the OS page cache is
> bypassed? I know an external extension pgfincore which can help here,
> but nothing in the core exists AFAICS.

Right, that extension can tell you how many pages are in the kernel
page cache which is quite interesting for this.  I also once hacked up
something primitive to see *which* pages are in kernel cache, so I
could join that against pg_buffercache to measure double buffering,
when I was studying the "smile" shape where pgbench TPS goes down and
then back up again as you increase shared_buffers if the working set
is nearly as big as physical memory (code available in a link from
[1]).

Yeah, I agree it might be nice for human investigators to put
something like that in contrib/pg_buffercache, but I'm not sure you
could rely on it enough for an automated test, though, 'cause it
probably won't work on some file systems and the tests would probably
fail for random transient reasons (for example: some systems won't
kick pages out of kernel cache if they were already there, just
because we decided to open the file with O_DIRECT).  (I got curious
about why mincore() wasn't standardised along with mmap() and all that
jazz; it seems the BSD and later Sun people who invented all those
interfaces didn't think that one was quite good enough[2], but every
(?) Unixoid OS copied it anyway, with variations...  Apparently the
Windows thing is called VirtualQuery()).

> 2. Can we combine these two append_conf to a single statement?
> +$node->append_conf('io_direct', 'data,wal,wal_init');
> +$node->append_conf('shared_buffers', '64kB'); # tiny to force I/O

OK, sure, done.  And also oops, that was completely wrong and not
working the way I had it in that version...

> 3. A nitpick: Can we split these queries multi-line instead of in a single 
> line?
> +$node->safe_psql('postgres', 'begin; create temporary table t2 as
> select 1 as i from generate_series(1, 1); update t2 set i = i;
> insert into t2count select count(*) from t2; commit;');

OK.

> 4. I don't think we need to stop the node before the test ends, no?
> +$node->stop;
> +
> +done_testing();

Sure, but why not?

Otherwise, I rebased, and made a couple more changes:

I found a line of the manual about wal_sync_method that needed to be removed:

-The open_* options also use
O_DIRECT if available.

In fact that sentence didn't correctly document the behaviour in
released branches (wal_level=minimal is also required for that, so
probably very few people ever used it).  I think we should adjust that
misleading sentence in back-branches, separately from this patch set.

I also updated the commit message to highlight the only expected
user-visible change for this, namely the loss of the above incorrectly
documented obscure special case, which is replaced by the less obscure
new setting io_direct=wal, if someone still wants that behaviour.

Also a few minor comment changes.

[1] https://twitter.com/MengTangmu/status/994770040745615361
[2] http://kos.enix.org/pub/gingell8.pdf
From c6e01d506762fb7c11a3fb31d56902fa53ea822b Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Tue, 13 Dec 2022 16:25:59 +1300
Subject: [PATCH v4 1/3] Introduce PG_IO_ALIGN_SIZE and align all I/O buffers.

In order to be able to use O_DIRECT/FILE_FLAG_NO_BUFFERING on common
systems in a later commit, we need the address and length of user space
buffers to align with the sector size of the storage.  O_DIRECT would
either fail to work or fail to work efficiently without that on various
platforms.  Even without O_DIRECT, aligning on memory pages is known to
improve traditional buffered I/O performance.

The alignment size is set to 4096, which is enough for currently known
systems: it covers traditional 512 byte sectors and modern 4096 byte
sectors, as well as common 4096 byte memory pages.  There is no standard
governing the requirements for O_DIRECT so it's possible we might have
to reconsider this approach or fail to work on some exotic system, but
for now this simplistic approach works and it can be changed at compile
time.

Three classes of I/O buffers for regular data pages are adjusted:
(1) Heap bu

RE: [PoC] pg_upgrade: allow to upgrade publisher node

2023-04-07 Thread Hayato Kuroda (Fujitsu)

Dear Julien,

> > Agreed, but then shouldn't the option be named "--logical-slots-only" or
> > something like that, same for all internal function names?
> 
> Seems right. Will be fixed in next version. Maybe
> "--logical-replication-slots-only"
> will be used, per Peter's suggestion [1].

After considering more, I decided not to include the word "logical" in the 
option
at this point. This is because we have not decided yet whether we dumps physical
replication slots or not. Current restriction has been occurred because of just
lack of analysis and considerations, If we decide not to do that, then they will
be renamed accordingly.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Should vacuum process config file reload more often

2023-04-07 Thread Melanie Plageman

On Fri, Apr 7, 2023 at 2:53 AM Masahiko Sawada  wrote:
>
> On Fri, Apr 7, 2023 at 8:08 AM Daniel Gustafsson  wrote:
> >
> > > On 7 Apr 2023, at 00:12, Melanie Plageman  
> > > wrote:
> > >
> > > On Thu, Apr 6, 2023 at 5:45 PM Daniel Gustafsson  wrote:
> > >>
> > >>> On 6 Apr 2023, at 23:06, Melanie Plageman  
> > >>> wrote:
> > >>
> > >>> Autovacuum workers, at the end of VacuumUpdateCosts(), check if cost
> > >>> limit or cost delay have been changed. If they have, they assert that
> > >>> they don't already hold the AutovacuumLock, take it in shared mode, and
> > >>> do the logging.
> > >>
> > >> Another idea would be to copy the values to local temp variables while 
> > >> holding
> > >> the lock, and release the lock before calling elog() to avoid holding 
> > >> the lock
> > >> over potential IO.
> > >
> > > Good idea. I've done this in attached v19.
> > > Also I looked through the docs and everything still looks correct for
> > > balancing algo.
> >
> > I had another read-through and test-through of this version, and have 
> > applied
> > it with some minor changes to comments and whitespace.  Thanks for the quick
> > turnaround times on reviews in this thread!
>
> Cool!
>
> Regarding the commit 7d71d3dd08, I have one comment:
>
> +   /* Only log updates to cost-related variables */
> +   if (vacuum_cost_delay == original_cost_delay &&
> +   vacuum_cost_limit == original_cost_limit)
> +   return;
>
> IIUC by default, we log not only before starting the vacuum but also
> when changing cost-related variables. Which is good, I think, because
> logging the initial values would also be helpful for investigation.
> However, I think that we don't log the initial vacuum cost values
> depending on the values. For example, if the
> autovacuum_vacuum_cost_delay storage option is set to 0, we don't log
> the initial values. I think that instead of comparing old and new
> values, we can write the log only if
> message_level_is_interesting(DEBUG2) is true. That way, we don't need
> to acquire the lwlock unnecessarily. And the code looks cleaner to me.
> I've attached the patch (use_message_level_is_interesting.patch)

Thanks for coming up with the case you thought of with storage param for
cost delay = 0. In that case we wouldn't print the message initially and
we should fix that.

I disagree, however, that we should condition it only on
message_level_is_interesting().

Actually, outside of printing initial values when the autovacuum worker
first starts (before vacuuming all tables), I don't think we should log
these values except when they are being updated. Autovacuum workers
could vacuum tons of small tables and having this print out at least
once per table (which I know is how it is on master) would be
distracting. Also, you could be reloading the config to update some
other GUCs and be oblivious to an ongoing autovacuum and get these
messages printed out, which I would also find distracting.

You will have to stare very hard at the logs to tell if your changes to
vacuum cost delay and limit took effect when you reload config. I think
with our changes to update the values more often, we should take the
opportunity to make this logging more useful by making it happen only
when the values are changed.

I would be open to elevating the log level to DEBUG1 for logging only
updates and, perhaps, having an option if you set log level to DEBUG2,
for example, to always log these values in VacuumUpdateCosts().

I'd even argue that, potentially, having the cost-delay related
parameters printed at the beginning of vacuuming could be interesting to
regular VACUUM as well (even though it doesn't benefit from config
reload while in progress).

To fix the issue you mentioned and ensure the logging is printed when
autovacuum workers start up before vacuuming tables, we could either
initialize vacuum_cost_delay and vacuum_cost_limit to something invalid
that will always be different than what they are set to in
VacuumUpdateCosts() (not sure if this poses a problem for VACUUM using
these values since they are set to the defaults for VACUUM). Or, we
could duplicate this logging message in do_autovacuum().

Finally, one other point about message_level_is_interesting(). I liked
the idea of using it a lot, since log level DEBUG2 will not be the
common case. I thought of it but hesitated because all other users of
message_level_is_interesting() are avoiding some memory allocation or
string copying -- not avoiding take a lock. Making this conditioned on
log level made me a bit uncomfortable. I can't think of a situation when
it would be a problem, but it felt a bit off.

> Also, while testing the autovacuum delay with relopt
> autovacuum_vacuum_cost_delay = 0, I realized that even if we set
> autovacuum_vacuum_cost_delay = 0 to a table, wi_dobalance is set to
> true. wi_dobalance comes from the following expression:
>
> /*
>  * If any of the cost delay parameters has been set individually for
>

Re: Should vacuum process config file reload more often

2023-04-07 Thread Daniel Gustafsson

> On 7 Apr 2023, at 15:07, Melanie Plageman  wrote:
> On Fri, Apr 7, 2023 at 2:53 AM Masahiko Sawada  wrote:

>> +   /* Only log updates to cost-related variables */
>> +   if (vacuum_cost_delay == original_cost_delay &&
>> +   vacuum_cost_limit == original_cost_limit)
>> +   return;
>> 
>> IIUC by default, we log not only before starting the vacuum but also
>> when changing cost-related variables. Which is good, I think, because
>> logging the initial values would also be helpful for investigation.
>> However, I think that we don't log the initial vacuum cost values
>> depending on the values. For example, if the
>> autovacuum_vacuum_cost_delay storage option is set to 0, we don't log
>> the initial values. I think that instead of comparing old and new
>> values, we can write the log only if
>> message_level_is_interesting(DEBUG2) is true. That way, we don't need
>> to acquire the lwlock unnecessarily. And the code looks cleaner to me.
>> I've attached the patch (use_message_level_is_interesting.patch)
> 
> Thanks for coming up with the case you thought of with storage param for
> cost delay = 0. In that case we wouldn't print the message initially and
> we should fix that.
> 
> I disagree, however, that we should condition it only on
> message_level_is_interesting().

I think we should keep the logging frequency as committed, but condition taking
the lock on message_level_is_interesting().

> Actually, outside of printing initial values when the autovacuum worker
> first starts (before vacuuming all tables), I don't think we should log
> these values except when they are being updated. Autovacuum workers
> could vacuum tons of small tables and having this print out at least
> once per table (which I know is how it is on master) would be
> distracting. Also, you could be reloading the config to update some
> other GUCs and be oblivious to an ongoing autovacuum and get these
> messages printed out, which I would also find distracting.
> 
> You will have to stare very hard at the logs to tell if your changes to
> vacuum cost delay and limit took effect when you reload config. I think
> with our changes to update the values more often, we should take the
> opportunity to make this logging more useful by making it happen only
> when the values are changed.
> 
> I would be open to elevating the log level to DEBUG1 for logging only
> updates and, perhaps, having an option if you set log level to DEBUG2,
> for example, to always log these values in VacuumUpdateCosts().
> 
> I'd even argue that, potentially, having the cost-delay related
> parameters printed at the beginning of vacuuming could be interesting to
> regular VACUUM as well (even though it doesn't benefit from config
> reload while in progress).
> 
> To fix the issue you mentioned and ensure the logging is printed when
> autovacuum workers start up before vacuuming tables, we could either
> initialize vacuum_cost_delay and vacuum_cost_limit to something invalid
> that will always be different than what they are set to in
> VacuumUpdateCosts() (not sure if this poses a problem for VACUUM using
> these values since they are set to the defaults for VACUUM). Or, we
> could duplicate this logging message in do_autovacuum().

Duplicating logging, maybe with a slightly tailored message, seem the least
bad option.

> Finally, one other point about message_level_is_interesting(). I liked
> the idea of using it a lot, since log level DEBUG2 will not be the
> common case. I thought of it but hesitated because all other users of
> message_level_is_interesting() are avoiding some memory allocation or
> string copying -- not avoiding take a lock. Making this conditioned on
> log level made me a bit uncomfortable. I can't think of a situation when
> it would be a problem, but it felt a bit off.

Considering how uncommon DEBUG2 will be in production, I think conditioning
taking a lock on it makes sense.

>> Also, while testing the autovacuum delay with relopt
>> autovacuum_vacuum_cost_delay = 0, I realized that even if we set
>> autovacuum_vacuum_cost_delay = 0 to a table, wi_dobalance is set to
>> true. wi_dobalance comes from the following expression:
>> 
>>/*
>> * If any of the cost delay parameters has been set individually for
>> * this table, disable the balancing algorithm.
>> */
>>tab->at_dobalance =
>>!(avopts && (avopts->vacuum_cost_limit > 0 ||
>> avopts->vacuum_cost_delay > 0));
>> 
>> The initial values of both avopts->vacuum_cost_limit and
>> avopts->vacuum_cost_delay are -1. I think we should use ">= 0" instead
>> of "> 0". Otherwise, we include the autovacuum worker working on a
>> table whose autovacuum_vacuum_cost_delay is 0 to the balancing
>> algorithm. Probably this behavior has existed also on back branches
>> but I haven't checked it yet.
> 
> Thank you for catching this. Indeed this exists in master since
> 1021bd6a89b which was backpatched. I c

Re: CREATE SUBSCRIPTION -- add missing tab-completes

2023-04-07 Thread Masahiko Sawada

On Fri, Apr 7, 2023 at 6:10 PM Amit Kapila  wrote:
>
> On Fri, Apr 7, 2023 at 1:12 PM Masahiko Sawada  wrote:
> >
> > On Fri, Apr 7, 2023 at 2:28 PM Amit Kapila  wrote:
> > >
> > > On Wed, Apr 5, 2023 at 5:58 AM Peter Smith  wrote:
> > > >
> > >
> > > LGTM, so pushed. BTW, while looking at this, I noticed that newly
> > > added options "password_required" and "run_as_owner" has incorrectly
> > > mentioned their datatype as a string in the docs. It should be
> > > boolean.
> >
> > +1
> >
> > > I think "password_required" belongs to first section of docs
> > > which says: "The following parameters control what happens during
> > > subscription creation".
> >
> > But the documentation of ALTER SUBSCRIPTION says:
> >
> > The parameters that can be altered are slot_name, synchronous_commit,
> > binary, streaming, disable_on_error, password_required, run_as_owner,
> > and origin. Only a superuser can set password_required = false.
> >
>
> By the above, do you intend to say that all the parameters that can be
> altered are in the second list? If so, slot_name belongs to the first
> category.
>
> > ISTM that both password_required and run_as_owner are parameters to
> > control the subscription's behavior, like disable_on_error and
> > streaming. So it looks good to me that password_required belongs to
> > the second section.
> >
>
> Do you mean that because 'password_required' is used each time we make
> a connection to a publisher during replication, it should be in the
> second category? If so, slot_name is also used during the start
> replication each time.

I think that parameters used by the backend process when performing
CREATE SUBSCRIPTION belong to the first category. And other parameters
used by apply workers and tablesync workers belong to the second
category. Since slot_name is used by both I'm not sure it should be in
the second category, but password_requried seems to be used by only
apply workers and tablesync workers, so it should be in the second
category.

>
> BTW, do we need to check one or both of these parameters in
> maybe_reread_subscription() where we "Exit if any parameter that
> affects the remote connection was changed."

As for run_as_owner, since we can dynamically switch the behavior I
think we don't need to reconnect. I'm not really sure about
password_required. From the implementation point of view, we don't
need to reconnect. Even if password_required is changed from false to
true, the apply worker already has the established connection. If it's
changed from true to false, we might not want to reconnect. I think we
need to consider it from the security point of view while checking the
motivation that password_required was introduced. So probably it's
better to discuss it on the original thread.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Making background psql nicer to use in tap tests

2023-04-07 Thread Daniel Gustafsson

> On 5 Apr 2023, at 23:44, Daniel Gustafsson  wrote:
> 
> Unless there are objections I plan to get this in before the freeze, in order
> to have better interactive tests starting with 16.  With a little bit of
> documentation polish I think it's ready.

When looking at the CFBot failure on Linux and Windows (not on macOS) I noticed
that it was down to the instance lacking IO::Pty.

[19:59:12.609](1.606s) ok 1 - scram_iterations in server side ROLE
Can't locate IO/Pty.pm in @INC (you may need to install the IO::Pty module) 
(@INC contains: /tmp/cirrus-ci-build/src/test/perl 
/tmp/cirrus-ci-build/src/test/authentication /etc/perl 
/usr/local/lib/i386-linux-gnu/perl/5.32.1 /usr/local/share/perl/5.32.1 
/usr/lib/i386-linux-gnu/perl5/5.32 /usr/share/perl5 
/usr/lib/i386-linux-gnu/perl/5.32 /usr/share/perl/5.32 
/usr/local/lib/site_perl) at /usr/share/perl5/IPC/Run.pm line 1828.

Skimming the VM creation [0] it seems like it should be though?  On macOS the
module is installed inside Cirrus and the test runs fine.

I don't think we should go ahead with a patch that refactors interactive_psql
only to SKIP over it in CI (which is what the tab_completion test does now), so
let's wait until we have that sorted before going ahead.

--
Daniel Gustafsson

[0] 
https://github.com/anarazel/pg-vm-images/blob/main/scripts/linux_debian_install_deps.sh

Re: Is RecoveryConflictInterrupt() entirely safe in a signal handler?

2023-04-07 Thread Thomas Munro

On Tue, Apr 4, 2023 at 1:25 AM Tom Lane  wrote:
> Sorry for not looking at this sooner.  I am okay with the regex
> changes proposed in v5-0001 through 0003, but I think you need to
> take another mopup pass there.  Some specific complaints:
> * header comment for pg_regprefix has been falsified (s/malloc/palloc/)

Thanks.  Fixed.

> * in spell.c, regex_affix_deletion_callback could be got rid of

Done in a separate patch.  I wondered if regex_t should be included
directly as a member of that union inside AFFIX, but decided it should
keep using a pointer (just without the extra wrapper struct).  A
direct member would make the AFFIX slightly larger, and it would
require us to assume that regex_t is movable which it probably
actually is in practice I guess but that isn't written down anywhere
and it seemed strange to rely on it.

> * check other callers of pg_regerror for now-useless CHECK_FOR_INTERRUPTS

I found three of these to remove (jsonpath_gram.y, varlena.c, test_regex.c).

> In general there's a lot of comments referring to regexes being malloc'd.

There is also some remaining direct use of malloc() in
regc_pg_locale.c because "we mustn't lose control on out-of-memory".
At that time (2012) there was no MCXT_NO_OOM (2015), so we could
presumably bring that cache into an observable MemoryContext now too.
I haven't written a patch for that, though, because it's not in the
way of my recovery conflict mission.

> I'm disinclined to change the ones inside the engine, because as far as
> it knows it is still using malloc, but maybe we should work harder on
> our own comments.  In particular, it'd likely be useful to have something
> somewhere pointing out that pg_regfree is only needed when you can't
> get rid of the regex by context cleanup.  Maybe write a short section
> about memory management in backend/regex/README?

I'll try to write something for the README tomorrow.  Here's a new
version of the code changes.

> I've not really looked at 0004.

I'm hoping to get just the regex changes in ASAP, and then take a
little bit longer on the recovery conflict patch itself (v6-0005) on
the basis that it's bugfix work and not subject to the feature freeze.
From a21a43bf5b1ba073abb3238968b9f8d13b1b318a Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Wed, 4 Jan 2023 14:15:40 +1300
Subject: [PATCH v6 1/5] Use MemoryContext API for regex memory management.

Previously, regex_t objects' memory was managed with malloc() and free()
directly.  Switch to palloc()-based memory management instead.
Advantages:

 * memory used by cached regexes is now visible with MemoryContext
   observability tools

 * cleanup can be done automatically in certain failure modes
   (something that later commits will take advantage of)

 * cleanup can be done in bulk

On the downside, there may be more fragmentation (wasted memory) due to
per-regex MemoryContext objects.  This is a problem shared with other
cached objects in PostgreSQL and can probably be improved with later
tuning.

Thanks to Noah Misch for suggesting this general approach, which
unblocks later work on interrupts.

Reviewed-by: Tom Lane 
Discussion: https://postgr.es/m/CA%2BhUKGK3PGKwcKqzoosamn36YW-fsuTdOPPF1i_rtEO%3DnEYKSg%40mail.gmail.com
---
 src/backend/regex/regprefix.c  |  2 +-
 src/backend/utils/adt/regexp.c | 57 --
 src/include/regex/regcustom.h  |  6 ++--
 3 files changed, 45 insertions(+), 20 deletions(-)

diff --git a/src/backend/regex/regprefix.c b/src/backend/regex/regprefix.c
index 221f02da63..c09b2a9778 100644
--- a/src/backend/regex/regprefix.c
+++ b/src/backend/regex/regprefix.c
@@ -32,7 +32,7 @@ static int	findprefix(struct cnfa *cnfa, struct colormap *cm,
  *	REG_EXACT: all strings satisfying the regex must match the same string
  *	or a REG_XXX error code
  *
- * In the non-failure cases, *string is set to a malloc'd string containing
+ * In the non-failure cases, *string is set to a palloc'd string containing
  * the common prefix or exact value, of length *slength (measured in chrs
  * not bytes!).
  *
diff --git a/src/backend/utils/adt/regexp.c b/src/backend/utils/adt/regexp.c
index 810dcb85b6..81400ba150 100644
--- a/src/backend/utils/adt/regexp.c
+++ b/src/backend/utils/adt/regexp.c
@@ -96,9 +96,13 @@ typedef struct regexp_matches_ctx
 #define MAX_CACHED_RES	32
 #endif

+/* A parent memory context for regular expressions. */
+static MemoryContext RegexpCacheMemoryContext;
+
 /* this structure describes one cached regular expression */
 typedef struct cached_re_str
 {
+	MemoryContext cre_context;	/* memory context for this regexp */
 	char	   *cre_pat;		/* original RE (not null terminated!) */
 	int			cre_pat_len;	/* length of original RE, in bytes */
 	int			cre_flags;		/* compile flags: extended,icase etc */
@@ -145,6 +149,7 @@ RE_compile_and_cache(text *text_re, int cflags, Oid collation)
 	int			regcomp_result;
 	cached_re_str re_temp;
 	char		errMsg[100];
+	MemoryContext oldcontext;

 	/*
 	 * Look for a ma

RE: [PoC] pg_upgrade: allow to upgrade publisher node

2023-04-07 Thread Hayato Kuroda (Fujitsu)

Dear Peter,

Thank you for reviewing briefly. PSA new version.
If you can I want to ask the opinion about the checking by pg_upgrade [1].

> ==
> General
> 
> 1.
> Since these two new options are made to work together, I think the
> names should be more similar. e.g.
> 
> pg_dump: "--slot_only" --> "--replication-slots-only"
> pg_upgrade: "--include-replication-slot" --> "--include-replication-slots"
> 
> help/comments/commit-message all should change accordingly, but I did
> not give separate review comments for each of these.

OK, I renamed. By the way, how do you think the suggestion raised by Julien?
Currently I did not address it because the restriction was caused by just lack 
of
analysis, and this may be not agreed in the community.
Or, should we keep the name anyway?

> 2.
> I felt there maybe should be some pg_dump test cases for that new
> option, rather than the current patch where it only seems to be
> testing the new pg_dump option via the pg_upgrade TAP tests.

Hmm, I supposed that the option shoul be used only for upgrading, so I'm not 
sure
it must be tested by only pg_dump.

> Commit message
> 
> 3.
> This commit introduces a new option called "--include-replication-slot".
> This allows nodes with logical replication slots to be upgraded. The commit 
> can
> be divided into two parts: one for pg_dump and another for pg_upgrade.
> 
> ~
> 
> "new option" --> "new pg_upgrade" option

Fixed.

> 4.
> For pg_upgrade, when '--include-replication-slot' is specified, it
> executes pg_dump
> with added option and restore from the dump. Apart from restoring
> schema, pg_resetwal
> must not be called after restoring replicaiton slots. This is because
> the command
> discards WAL files and starts from a new segment, even if they are required by
> replication slots. This leads an ERROR: "requested WAL segment XXX has already
> been removed". To avoid this, replication slots are restored at a different 
> time
> than other objects, after running pg_resetwal.
> 
> ~
> 
> 4a.
> "with added option and restore from the dump" --> "with the new
> "--slot-only" option and restores from the dump"

Fixed.

> 4b.
> Typo: /replicaiton/replication/

Fixed.

> 4c
> "leads an ERROR" --> "leads to an ERROR"

Fixed.

> doc/src/sgml/ref/pg_dump.sgml
> 
> 5.
> + 
> +  --slot-only
> +  
> +   
> +Dump only replication slots, neither the schema (data definitions) 
> nor
> +data. Mainly this is used for upgrading nodes.
> +   
> +  
> 
> SUGGESTION
> Dump only replication slots; not the schema (data definitions), nor
> data. This is mainly used when upgrading nodes.

Fixed.

> doc/src/sgml/ref/pgupgrade.sgml
> 
> 6.
> +   
> +Transport replication slots. Currently this can work only for logical
> +slots, and temporary slots are ignored. Note that pg_upgrade does not
> +check the installation of plugins.
> +   
> 
> SUGGESTION
> Upgrade replication slots. Only logical replication slots are
> currently supported, and temporary slots are ignored. Note that...

Fixed.

> src/bin/pg_dump/pg_dump.c
> 
> 7. main
>   {"exclude-table-data-and-children", required_argument, NULL, 14},
> -
> + {"slot-only", no_argument, NULL, 15},
>   {NULL, 0, NULL, 0}
> 
> The blank line is misplaced.

Fixed.

> 8. main
> + case 15: /* dump onlu replication slot(s) */
> + dopt.slot_only = true;
> + dopt.include_everything = false;
> + break;
> 
> typo: /onlu/only/

Fixed.

> 9. main
> + if (dopt.slot_only && dopt.dataOnly)
> + pg_fatal("options --replicatin-slots and -a/--data-only cannot be
> used together");
> + if (dopt.slot_only && dopt.schemaOnly)
> + pg_fatal("options --replicatin-slots and -s/--schema-only cannot be
> used together");
> +
> 
> 9a.
> typo: /replicatin/replication/

Fixed. Additionally, wrong parameter reference was also fixed.

> 9b.
> I am wondering if these checks are enough. E.g. is "slots-only"
> compatible with "no-publications" ?

I think there are something what should be checked more. But I'm not sure about
"no-publication". There is a possibility that non-core logical replication is 
used,
and at that time these options are not contradicted.

> 10. main
> + /*
> + * If dumping replication slots are request, dumping them and skip others.
> + */
> + if (dopt.slot_only)
> + {
> + getRepliactionSlots(fout);
> + goto dump;
> + }
> 
> 10a.
> SUGGESTION
> If dump replication-slots-only was requested, dump only them and skip
> everything else.

Fixed.

> 10b.
> This code seems mutually exclusive to every other option. I'm
> wondering if this code even needs 'collectRoleNames', or should the
> slots option check be moved  above that (and also above the 'Dumping
> LOs' etc...)

I read again, and I found that collected username are used to check the owner of
objects. IIUC replicaiton slots are not owned by database users, so it is not
needed. Also, the LOs should not dumped here. Based on them, I moved 
getRepliactionSlots()
above them.

> 11. help
> 
> +

Re: Commitfest 2023-03 starting tomorrow!

2023-04-07 Thread Greg Stark

As announced on this list feature freeze is at 00:00 April 8 AoE.
That's less than 24 hours away. If you need to set your watches to AoE
timezone it's currently:

$ TZ=AOE+12 date
Fri 07 Apr 2023 02:05:50 AM AOE

As we stand we have:

Status summary:
  Needs review: 82
  Waiting on Author:16
  Ready for Committer:  27
  Committed:   115
  Moved to next CF: 38
  Returned with Feedback:   10
  Rejected:  9
  Withdrawn:22
Total: 319.

In less than 24h most of the remaining patches will get rolled forward
to the next CF. The 16 that are Waiting on Author might be RwF
perhaps. The only exceptions would be non-features like Bug Fixes and
cleanup patches that have been intentionally held until the end --
those become Open Issues for the release.

So if we move forward all the remaining patches (so these numbers are
high by about half a dozen) the *next* CF would look like:

Commitfest 2023-07:Now  April 8
  Needs review: 46. 128
  Waiting on Author:17.  33
  Ready for Committer:   3.  30
Total:  66  191

I suppose that's better than the 319 we came into this CF with but
there's 3 months to accumulate more unreviewed patches...

I had hoped to find lots of patches that I could bring the hammer down
on and say there's just no interest in or there's no author still
maintaining. But that wasn't the case. Nearly all the patches still
had actively interested authors and looked like they were legitimately
interesting and worthwhile features that people just haven't had the
time to review or commit.


--
greg

Re: meson documentation build open issues

2023-04-07 Thread Andrew Dunstan



On 2023-04-06 Th 05:11, Peter Eisentraut wrote:

On 05.04.23 16:45, Andres Freund wrote:
I think it's still an issue that "make docs" builds html and man but 
"ninja
docs" only builds html.  For some reason the wiki page actually 
claims that

ninja docs builds both, but this does not happen for me.


It used to, but Tom insisted that it should not. I'm afraid that it's 
not
quite possible to emulate make here. 'make docs' at the toplevel 
builds both

HTML and manpages. But 'make -C doc/src/sgml', only builds HTML.


Ok, not a topic for this thread then.


5. There doesn't appear to be an equivalent of "make world" and "make
install-world" that includes documentation builds.


This has been addressed with the additional meson auto options.  But it
seems that these options only control building, not installing, so 
there is

no "install-world" aspect yet.


I'm not following - install-world install docs if the docs feature is
available, and not if not?


I had expected that if meson setup enables the 'docs' feature, then 
meson compile will build the documentation, which happens, and meson 
install will install it, which does not happen.






"meson compile" doesn't seem to build the docs by default ( see 
), 
and I'd rather it didn't, building the docs is a separate and optional 
step for the buildfarm.



cheers


andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

Re: Making background psql nicer to use in tap tests

2023-04-07 Thread Andrew Dunstan



On 2023-04-07 Fr 09:32, Daniel Gustafsson wrote:

On 5 Apr 2023, at 23:44, Daniel Gustafsson  wrote:

Unless there are objections I plan to get this in before the freeze, in order
to have better interactive tests starting with 16.  With a little bit of
documentation polish I think it's ready.

When looking at the CFBot failure on Linux and Windows (not on macOS) I noticed
that it was down to the instance lacking IO::Pty.

[19:59:12.609](1.606s) ok 1 - scram_iterations in server side ROLE
Can't locate IO/Pty.pm in @INC (you may need to install the IO::Pty module) 
(@INC contains: /tmp/cirrus-ci-build/src/test/perl 
/tmp/cirrus-ci-build/src/test/authentication /etc/perl 
/usr/local/lib/i386-linux-gnu/perl/5.32.1 /usr/local/share/perl/5.32.1 
/usr/lib/i386-linux-gnu/perl5/5.32 /usr/share/perl5 
/usr/lib/i386-linux-gnu/perl/5.32 /usr/share/perl/5.32 
/usr/local/lib/site_perl) at /usr/share/perl5/IPC/Run.pm line 1828.

Skimming the VM creation [0] it seems like it should be though?  On macOS the
module is installed inside Cirrus and the test runs fine.

I don't think we should go ahead with a patch that refactors interactive_psql
only to SKIP over it in CI (which is what the tab_completion test does now), so
let's wait until we have that sorted before going ahead.



It should probably be added to config/check_modules.pl if we're going to 
use it, but it seems to be missing for Strawberry Perl and msys/ucrt64 
perl and I'm not sure how easy it will be to add there. It would 
certainly add an installation burden for test instances at the very least.



cheers


andrew


--
Andrew Dunstan
EDB:https://www.enterprisedb.com

Re: cataloguing NOT NULL constraints

2023-04-07 Thread Justin Pryzby

On Fri, Apr 07, 2023 at 04:14:13AM +0200, Alvaro Herrera wrote:
> On 2023-Apr-06, Justin Pryzby wrote:

> > +ERROR:  relation "c" already exists
>
> Do you intend to make an error here ?

These still look like mistakes in the tests.

> Also, I think these table names may be too generic, and conflict with
> other parallel tests, now or in the future.
>
> > +create table d(a int not null, f1 int) inherits(inh_p3, c);
> > +ERROR:  relation "d" already exists

> Sadly, the binary-upgrade mode is a bit of a mess and thus the
> pg_upgrade test is failing.

Re: Making background psql nicer to use in tap tests

2023-04-07 Thread Andres Freund

Hi,

On 2023-04-07 15:32:12 +0200, Daniel Gustafsson wrote:
> > On 5 Apr 2023, at 23:44, Daniel Gustafsson  wrote:
> > 
> > Unless there are objections I plan to get this in before the freeze, in 
> > order
> > to have better interactive tests starting with 16.  With a little bit of
> > documentation polish I think it's ready.
> 
> When looking at the CFBot failure on Linux and Windows (not on macOS) I 
> noticed
> that it was down to the instance lacking IO::Pty.
> 
> [19:59:12.609](1.606s) ok 1 - scram_iterations in server side ROLE
> Can't locate IO/Pty.pm in @INC (you may need to install the IO::Pty module) 
> (@INC contains: /tmp/cirrus-ci-build/src/test/perl 
> /tmp/cirrus-ci-build/src/test/authentication /etc/perl 
> /usr/local/lib/i386-linux-gnu/perl/5.32.1 /usr/local/share/perl/5.32.1 
> /usr/lib/i386-linux-gnu/perl5/5.32 /usr/share/perl5 
> /usr/lib/i386-linux-gnu/perl/5.32 /usr/share/perl/5.32 
> /usr/local/lib/site_perl) at /usr/share/perl5/IPC/Run.pm line 1828.
> 
> Skimming the VM creation [0] it seems like it should be though?

Note it just fails on the 32build, not the 64bit build. Unfortunately I don't
think debian's multiarch in bullseye support installing enough of perl in
32bit and 64bit.

We can't have a hard dependency on non-default modules like IO::Pty anyway, so
the test needs to skip if it's not available.

On windows IO::Pty can't be installed, IIRC.


> I don't think we should go ahead with a patch that refactors interactive_psql
> only to SKIP over it in CI (which is what the tab_completion test does now), 
> so
> let's wait until we have that sorted before going ahead.

Maybe I am a bit confused, but isn't that just an existing requirement? Why
would we expect this patchset to change what dependencies use of
interactive_psql() has?

Greetings,

Andres Freund

Re: Making background psql nicer to use in tap tests

2023-04-07 Thread Andres Freund

Hi,

On 2023-04-07 10:55:19 -0400, Andrew Dunstan wrote:
> It should probably be added to config/check_modules.pl if we're going to use
> it, but it seems to be missing for Strawberry Perl and msys/ucrt64 perl and
> I'm not sure how easy it will be to add there. It would certainly add an
> installation burden for test instances at the very least.

The last time I tried, it can't be installed on windows with cpan either, the
module simply doesn't have the necessary windows bits - likely because
traditionally windows didn't really have ptys. I think some stuff has been
added, but it probably would still require a bunch of portability work.

Note that we normally don't even build with readline support on windows - so
there's not really much point in using IO::Pty there. While I've gotten that
to work manually not too long ago, it's still manual and not documented etc.

Afaict the failures are purely about patch 2, not 1, right?

Greetings,

Andres Freund

Re: Minimal logical decoding on standbys

2023-04-07 Thread Drouvot, Bertrand


Hi,

On 4/7/23 9:50 AM, Andres Freund wrote:

Hi,
Here's my current working state - I'll go to bed soon.


Thanks a lot for this Andres!



Changes:

- shared catalog relations weren't handled correctly, because the dboid is
   InvalidOid for them. I wrote a test for that as well.

- ReplicationSlotsComputeRequiredXmin() took invalidated logical slots into
   account (ReplicationSlotsComputeLogicalRestartLSN() too, but it never looks
   at logical slots)

- I don't think the subset of slot xids that were checked when invalidating
   was right. We need to check effective_xmin and effective_catalog_xmin - the
   latter was using catalog_xmin.

- similarly, it wasn't right that specifically those two fields were
   overwritten when invalidated - as that was done, I suspect the changes might
   get lost on a restart...

- As mentioned previously, I did not like all the functions in slot.h, nor
   their naming. Not yet quite finished with that, but a good bit further

- There were a lot of unrelated changes, e.g. removing comments like
  * NB - this runs as part of checkpoint, so avoid raising errors if possible.

- I still don't like the order of the patches, fixing the walsender patches
   after introducing support for logical decoding on standby. Reordered.

- I don't think logical slots being invalidated as checked e.g. in
   pg_logical_replication_slot_advance()

- I didn't like much that InvalidatePossiblyObsoleteSlot() switched between
   kill() and SendProcSignal() based on the "conflict". There very well could
   be reasons to use InvalidatePossiblyObsoleteSlot() with an xid from outside
   of the startup process in the future. Instead I made it differentiate based
   on MyBackendType == B_STARTUP.



Thanks for all of this and the above explanations.



I also:

Added new patch that replaces invalidated_at with a new enum, 'invalidated',
listing the reason for the invalidation.


Yeah, that's a great idea.


I added a check for !invalidated to
ReplicationSlotsComputeRequiredLSN() etc.



looked at 65-0001 and it looks good to me.


Added new patch moving checks for invalid logical slots into
CreateDecodingContext(). Otherwise we end up with 5 or so checks, which makes
no sense. As far as I can tell the old message in
pg_logical_slot_get_changes_guts() was bogus, one couldn't get there having
"never previously reserved WAL"



looked at 65-0002 and it looks good to me.


Split "Handle logical slot conflicts on standby." into two. I'm not sure that
should stay that way, but it made it easier to hack on
InvalidateObsoleteReplicationSlots.



looked at 65-0003 and the others.

It's easier to understand/read the code now that the 
ReplicationSlotInvalidationCause
enum has been created and that data.invalidated also make use of the enum. It does 
"simplify"
the review and that looks good to me.



Todo:
- write a test that invalidated logical slots stay invalidated across a restart


Done in 65-66-0008 attached.


- write a test that invalidated logical slots do not lead to retaining WAL


I'm not sure how to do that since pg_switch_wal() and friends can't be executed 
on
a standby.


- Further evolve the API of InvalidateObsoleteReplicationSlots()
   - pass in the ReplicationSlotInvalidationCause we're trying to conflict on?
   - rename xid to snapshotConflictHorizon, that'd be more in line with the
 ResolveRecoveryConflictWithSnapshot and easier to understand, I think



Done. The new API can be found in 
v65-66-InvalidateObsoleteReplicationSlots_API.patch
attached. It propagates the cause to InvalidatePossiblyObsoleteSlot() where a 
switch/case
can now be used. The "default" case does not emit an error since this code runs 
as part
of checkpoint.


- The test could stand a bit of cleanup and consolidation
   - No need to start 4 psql processes to do 4 updates, just do it in one
 safe_psql()


Right, done in v65-66-0008-New-TAP-test-for-logical-decoding-on-standby.patch 
attached.


   - the sequence of drop_logical_slots(), create_logical_slots(),
 change_hot_standby_feedback_and_wait_for_xmins(), make_slot_active() is
 repeated quite a few times


grouped in reactive_slots_change_hfs_and_wait_for_xmins() in 65-66-0008 
attached.


   - the stats queries checking for specific conflict counts, including
 preceding tests, is pretty painful. I suggest to reset the stats at the
 end of the test instead (likely also do the drop_logical_slot() there).


Good idea, done in 65-66-0008 attached.


   - it's hard to correlate postgres log and the tap test, because the slots
 are named the same across all tests. Perhaps they could have a per-test
 prefix?


Good point. Done in 65-66-0008 attached. Thanks to that and the stats reset the
check for invalidation is now done in a single function 
"check_for_invalidation" that looks
for invalidation messages in the logfile and in pg_stat_database_conflicts.

Thanks for the suggestions: the TAP test is now easier to read/understand.


   -

Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-04-07 Thread Julien Rouhaud

On Fri, Apr 07, 2023 at 09:40:14AM +, Hayato Kuroda (Fujitsu) wrote:
>
> > As I mentioned in my original thread, I'm not very familiar with that code, 
> > but
> > I'm a bit worried about "all the changes generated on publisher must be send
> > and applied".  Is that a hard requirement for the feature to work reliably?
>
> I think the requirement is needed because the existing WALs on old node 
> cannot be
> transported on new instance. The WAL hole from confirmed_flush to current 
> position
> could not be filled by newer instance.

I see, that was also the first blocker I could think of when Amit mentioned
that feature weeks ago and I also don't see how that whole could be filled
either.

> > If
> > yes, how does this work if some subscriber node isn't connected when the
> > publisher node is stopped?  I guess you could add a check in pg_upgrade to 
> > make
> > sure that all logical slot are indeed caught up and fail if that's not the 
> > case
> > rather than assuming that a clean shutdown implies it.  It would be good to
> > cover that in the TAP test, and also cover some corner cases, like any new 
> > row
> > added on the publisher node after the pg_upgrade but before the subscriber 
> > is
> > reconnected is also replicated as expected.
>
> Hmm, good point. Current patch could not be handled the case because 
> walsenders
> for the such slots do not exist. I have tested your approach, however, I 
> found that
> CHECKPOINT_SHUTDOWN record were generated twice when publisher was
> shutted down and started. It led that the confirmed_lsn of slots always was 
> behind
> from WAL insert location and failed to upgrade every time.
> Now I do not have good idea to solve it... Do anyone have for this?

I'm wondering if we could just check that each slot's LSN is exactly
sizeof(CHECKPOINT_SHUTDOWN) ago or something like that?  That's hackish, but if
pg_upgrade can run it means it was a clean shutdown so it should be safe to
assume that what's the last record in the WAL was.  For the double
shutdown checkpoint, I'm not sure that I get the problem.  The check should
only be done at the very beginning of pg_upgrade, so there should have been
only one shutdown checkpoint done right?

Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-04-07 Thread Julien Rouhaud

On Fri, Apr 07, 2023 at 12:51:51PM +, Hayato Kuroda (Fujitsu) wrote:
> Dear Julien,
> 
> > > Agreed, but then shouldn't the option be named "--logical-slots-only" or
> > > something like that, same for all internal function names?
> > 
> > Seems right. Will be fixed in next version. Maybe
> > "--logical-replication-slots-only"
> > will be used, per Peter's suggestion [1].
> 
> After considering more, I decided not to include the word "logical" in the 
> option
> at this point. This is because we have not decided yet whether we dumps 
> physical
> replication slots or not. Current restriction has been occurred because of 
> just
> lack of analysis and considerations, If we decide not to do that, then they 
> will
> be renamed accordingly.

Well, even if physical replication slots were eventually preserved during
pg_upgrade, maybe users would like to only keep one kind of the others so
having both options could make sense.

That being said, I have a hard time believing that we could actually preserve
physical replication slots.  I don't think that pg_upgrade final state is fully
reproducible:  not all object oids are preserved, and the various pg_restore
are run in parallel so you're very likely to end up with small physical
differences that would be incompatible with physical replication.  Even if we
could make it totally reproducible, it would probably be at the cost of making
pg_upgrade orders of magnitude slower.  And since many people are already
complaining that it's too slow, that doesn't seem like something we would want.

Re: Minimal logical decoding on standbys

2023-04-07 Thread Andres Freund

Hi,

On 2023-04-07 17:13:13 +0200, Drouvot, Bertrand wrote:
> On 4/7/23 9:50 AM, Andres Freund wrote:
> > I added a check for !invalidated to
> > ReplicationSlotsComputeRequiredLSN() etc.
> > 
> 
> looked at 65-0001 and it looks good to me.
> 
> > Added new patch moving checks for invalid logical slots into
> > CreateDecodingContext(). Otherwise we end up with 5 or so checks, which 
> > makes
> > no sense. As far as I can tell the old message in
> > pg_logical_slot_get_changes_guts() was bogus, one couldn't get there having
> > "never previously reserved WAL"
> > 
> 
> looked at 65-0002 and it looks good to me.
> 
> > Split "Handle logical slot conflicts on standby." into two. I'm not sure 
> > that
> > should stay that way, but it made it easier to hack on
> > InvalidateObsoleteReplicationSlots.
> > 
> 
> looked at 65-0003 and the others.

Thanks for checking!


> > Todo:
> > - write a test that invalidated logical slots stay invalidated across a 
> > restart
> 
> Done in 65-66-0008 attached.

Cool.


> > - write a test that invalidated logical slots do not lead to retaining WAL
> 
> I'm not sure how to do that since pg_switch_wal() and friends can't be 
> executed on
> a standby.

You can do it on the primary and wait for the records to have been applied.


> > - Further evolve the API of InvalidateObsoleteReplicationSlots()
> >- pass in the ReplicationSlotInvalidationCause we're trying to conflict 
> > on?
> >- rename xid to snapshotConflictHorizon, that'd be more in line with the
> >  ResolveRecoveryConflictWithSnapshot and easier to understand, I think
> > 
> 
> Done. The new API can be found in 
> v65-66-InvalidateObsoleteReplicationSlots_API.patch
> attached. It propagates the cause to InvalidatePossiblyObsoleteSlot() where a 
> switch/case
> can now be used.

Integrated. I moved the cause to the first argument, makes more sense to me
that way.


> The "default" case does not emit an error since this code runs as part
> of checkpoint.

I made it an error - it's a programming error, not some data level
inconsistency if that ever happens.


> > - The test could stand a bit of cleanup and consolidation
> >- No need to start 4 psql processes to do 4 updates, just do it in one
> >  safe_psql()
> 
> Right, done in v65-66-0008-New-TAP-test-for-logical-decoding-on-standby.patch 
> attached.

> >- the sequence of drop_logical_slots(), create_logical_slots(),
> >  change_hot_standby_feedback_and_wait_for_xmins(), make_slot_active() is
> >  repeated quite a few times
> 
> grouped in reactive_slots_change_hfs_and_wait_for_xmins() in 65-66-0008 
> attached.
> 
> >- the stats queries checking for specific conflict counts, including
> >  preceding tests, is pretty painful. I suggest to reset the stats at the
> >  end of the test instead (likely also do the drop_logical_slot() there).
> 
> Good idea, done in 65-66-0008 attached.
> 
> >- it's hard to correlate postgres log and the tap test, because the slots
> >  are named the same across all tests. Perhaps they could have a per-test
> >  prefix?
> 
> Good point. Done in 65-66-0008 attached. Thanks to that and the stats reset 
> the
> check for invalidation is now done in a single function 
> "check_for_invalidation" that looks
> for invalidation messages in the logfile and in pg_stat_database_conflicts.
> 
> Thanks for the suggestions: the TAP test is now easier to read/understand.

Integrated all of these.


I think pg_log_standby_snapshot() should be added in "Allow logical decoding
on standby", not the commit adding the tests.


Is this patchset sufficient to subscribe to a publication on a physical
standby, assuming the publication is created on the primary? If so, we should
have at least a minimal test. If not, we should note that restriction
explicitly.

Greetings,

Andres Freund

1 2 >

1 - 100 of 138 matches

Mail list logo